Last month, we asked readers and our social media followers what questions they had for our machine learning engineers Lewis Fishgold, James McClain, and Rob Emanuele. We received a variety of questions that covered different topics within the world of geospatial machine learning. We decided to divide up the questions and answers into two posts. This first post will address getting started with machine learning. To learn more about the future of machine learning, read part 2.
We’re machine learning engineers: Ask us anything!
What concepts and technologies are most important to learn to get started in machine learning?
(jmgaray11@gmail.com via Azavea newsletter)
Lewis Fishgold: Machine learning is a big field so it helps to have a target application in mind. If you are interested in making predictions or analyzing structured data (like that in a CSV file), you could learn some “classic ML” concept such as unsupervised vs. supervised learning, training vs. validation data, evaluation metrics, and overfitting, and use a toolkit like scikit-learn. Then you will be able to apply algorithms and models such as decision trees, linear regression, and k-means clustering. If you would like a deeper understanding of how these methods work, I recommend taking Andrew Ng’s Machine Learning course.
If you would like to work with unstructured data like text and images, I recommend learning about deep learning (in addition to the classic ML concepts mentioned above), and tools like PyTorch and fastai. To learn more about how these methods work I recommend the deeplearning.ai or fastai courses, where you will learn about backpropagation, stochastic gradient descent, convolutional networks which are useful for images, and recurrent neural networks (RNNs) which are useful for text and other sequence modeling problems. If you are using deep learning, you will also need to run your code on a high-end GPU (graphics processing unit), which may involve learning about cloud computing if you do not own your own.
To deeply understand these methods, and be able to modify them for custom use-cases, it’s important to have a solid grasp on the basics of multivariable calculus, probability theory, linear algebra, optimization, and information theory, but these can be picked up on an as-needed basis to some extent.
James McClain: Technologies first: To get started, I think that it is most important to have access to an environment in which you can experiment.
One such environment, that can be accessed without cost, is Google Collab. Collab provides you with a Jupyter notebook in which you can tinker and experiment and also run modestly-sized “real” experiments (I believe that Collab restricts the amount of time that any one notebook cell can run, thereby limiting the the size of your experiment).
When you are ready to move beyond Collab, you will most probably want to run experiments on cloud instances (that you pay for) or perhaps on your own hardware. If you wish to use the cloud, something like AWS (in particular their EC2 service and their Batch service) might meet your needs. If want to use your own computer, then it will be helpful to be familiar with Linux, Docker, and have fairly-new NVIDIA graphics card.
To sum up: you should be familiar with Jupyter and the Python programming language (be familiar with Python both inside and outside of the Jupyter environment) as well as the libraries and at least one of the frameworks that are most commonly used for machine learning (e.g. PyTorch). You should also be familiar with how to use cloud services to run GPU-enabled jobs and/or have your own hardware with the right software.
Concepts: Familiarity with programming, particularly Python, is important. Also, some background knowledge in statistics will be helpful.
Rob Emanuele: It’s important to understand core concepts of how a neural network works – e.g., backpropagation, stochastic gradient descent, convolutional neural networks (CNN) (if working with imagery), and supervised training cycles in batches. I don’t believe you have to really understand these at a deep level before getting your hands dirty, though; I feel like I’m always picking up more understanding of these concepts, both intuitive and technical, the more I work on real problems. While you don’t need to necessarily understand the inner workings of a neural network to start getting your hands dirty with some frameworks and training models, it will help you debug things when they do go wrong.
That being said, I think the best way to learn is by doing. I remember training my first CNN and seeing it make predictions on aerial imagery. I hit a lot of problems and had to learn along the way until I got a model producing (not great) results, but the payoff of seeing a model I had trained making real predictions was so satisfying! So I would encourage people to pick a problem or a tutorial and get started, you’ll learn plenty along the way by just doing it the first few times.
What machine learning technologies and resources do you use or like most?
(jmgaray11@gmail.com via Azavea newsletter)
LF: I’m a big fan of PyTorch which lets you define complex models and then optimize them on GPUs. I think it’s very intuitive, and the documentation and ecosystem around it are great.
We also use AWS EC2 P3 instances, which allow us to rent NVIDIA Tesla V100 GPUs by the minute, and AWS Batch to run sequences of computational jobs in the cloud. Unlike PyTorch, I can’t claim that these tools are particularly easy to set up and use, but they allow us to scale up our capacity on-demand, and avoid spending money on expensive machines that would often sit idle.
JM: The machine learning framework that I like the best is PyTorch. It is fairly simple to learn and use, especially if you have access to someone who is already knowledgeable in it. Unfortunately, there do not seem to be many (any) books that you can read to immediately bring yourself up to speed on this framework. However, there are tutorials, Stack Overflow conversations, YouTube videos, and documentation that you can use.
The computing resources that I prefer are AWS Batch and AWS EC2 for running jobs in the cloud. I also think that it can be very helpful to have a local GPU (with 8 GB of video RAM minimum, 16 GB or more recommended) attached to a properly-configured Linux machine. Having local resources allows you to iterate and experiment more quickly/easily than you can with a cloud instance.
RE: I think arXiv is an amazing resource to understand what’s possible with machine learning. It’s hard to keep up with all the advancements that are happening, but usually if I have an idea that I think ML can be applicable and effective, I can find a ton of research on topics that are similar and describe the state-of-the-art of what is possible. As a big proponent of open source and open data, I’m elated to see the open culture around research information for machine learning that lets someone like me – without an academic background in the area – get up to speed on the latest and the greatest ideas.
What’s the difference in workflows between classification work on pixels/segmented parts of images & deep learning work (e.g., CNNs)? Can one easily transform a workflow developed for per-pixel/image segment classification to be suitable for a CNN workflow?
(Mike Treglia @MikeTreglia)
LF: I think this question is getting at the split in the field between “classic remote sensing” approaches which do land cover classification over individual pixels of low-resolution multispectral imagery using random forest and simple neural networks, and more recent approaches which use CNNs to localize things like buildings in (usually RGB) high-resolution imagery. As James says, by using 1×1 convolutions, CNN workflows can handle the classic remote sensing use case, but it is perhaps overkill since you might not get a performance boost, and it will be more computationally expensive. In addition, you will lose some of the model interpretability afforded by approaches like random forests. On the other hand, by using 3×3 convolutions and unlocking the full power of CNNs, it may be possible to get an accuracy boost, and this would be an interesting experiment if it hasn’t been performed already.
JM: This answer may seem circular, but hopefully useful.
Roughly speaking if one starts within the context of a system that can represent convolutional neural networks but restricts the size of the convolutions to 1×1, then one is operating on a pixel-by-pixel level.
Given that, it is possible to imagine contexts in which moving a workflow from pixel-by-pixel to convolutional is easy: simply stop honoring the restriction.
In all probability, though, actual pixel-by-pixel workflows may do things which are not directly representable within a framework such as PyTorch (not so bad) or such workflows might do things that actively interfere with the operation of a framework (e.g. holding significant GPU resources).
The extent to which one’s actual framework resembles one of these scenarios or the other is (probably) the extent to which movement is easy or difficult.
Also, pipelines that make use of machine learning frequently use other styles (including pixel-by-pixel) as preprocessing and/or post-processing steps.
RE: If we consider supervised machine learning, I think most machine learning workflows look very much the same from a high level – gather training data, preprocess and split that data, train your model, evaluate, and utilize the model to predict on new data. If the model is a CNN or a pixel-level Random Forest Classifier, it still follows that same workflow. We set up Raster Vision to follow that workflow generally, and be plugable for different modeling strategies. In fact, while deep learning is the focus of Raster Vision, you can actually use different modeling techniques – check out this awesome research project by our colleague Derek Dohler who implemented a Genetic Programming model in Raster Vision for performing semantic segmentation on imagery.
If you have questions about the future of machine learning, check out part 2 of our AMA series.