Over the past several years, Azavea has devoted much time to machine learning. In particular, machine learning on satellite imagery. This work led to the creation and release of our own labeling tool: GroundWork.
There isn’t much guidance labeling satellite imagery for machine learning available. So, when we can, we run our (own) small experiments. One such opportunity arose last year when we needed to provide training data for an image classification model. The imagery spanned several cities, and the project required us to classify several features in each. Our question: given the need to classify more than one feature in each task, which would be quicker: labeling one project with multiple classes per task, or two projects with one class per task?
What’s the best way to label multiple features in training data?
Image classification models categorize an image as belonging to one or more categories of interest. In this case, our annotators at CloudFactory looked at images of each city overlaid with OpenStreetMap (OSM) labels. They were then to compare the image and the overlay to determine if two features (let’s say, highways and bridges) were well-mapped by OSM. The model’s goal was to identify and prioritize under-mapped areas.
Before beginning, members of the Raster Foundry and R&D teams wondered whether it would be more efficient to label each city one feature at a time or to label all features at once. Of particular concern was whether two or more OSM overlays would create visual clutter. And, our experience with objection detection projects was that locating different sorts of objects at once was thought-consuming. Still, other team members’ intuition was that classifying features separately would take more time.
After debating the subject, we decided to run an experiment to settle the question. We split our cities in half and created 6 image classification projects:
City | Project I | Project II |
---|---|---|
Are the … in this image well-mapped? | Are the … in this image well-mapped? | |
A | Highways | Bridges |
B | Highways | Bridges |
C | Highways or bridges | — |
D | Highways or bridges | — |
Testing single class projects versus multiple class projects
Speed: Tasks per hour
We first defined speed as the mean number of tasks labeled/validated per hour. Let’s break that down further:
When your imagery set is an entire city, segmenting it for annotation is necessary. We call these segments tasks. In our case, each task contained either one or two questions for the annotator to answer. A labeled task is one in which an annotator has answered all questions. Each task is then assigned to another annotator, who can either agree with the original decision(s) or make corrections. Once this process is complete, the task is validated.
Who validates? In our workflow, that changes as the project progresses. First, either I or one of our CloudFactory team leads verify each task. I provide regular feedback on what is going well and what needs improving. As team members become familiar with the expectations, one or more will be added to the validation team. I depend on my team lead by Smita Shrestha to make those decisions, and her judgment hasn’t been wrong yet.
Single Class Projects
City | Labeling speed | Validating speed | Mean speed |
---|---|---|---|
A | 70 | 100 | 85 |
B | 86 | 105 | 95 |
Both | 78 | 102 | 90 |
Multiple Class Projects
City | Labeling speed | Validating speed | Overall speed |
---|---|---|---|
C | 70 | 88 | 79 |
D | 73 | 98 | 86 |
Both | 72 | 93 | 82 |
At first glance, it may seem as if the single class projects are indisputably faster, but as each single class task only contains one label while those in multiple class projects have two, a better comparison is labels completed per hour.
Speed: Labels per hour
In other words, the output we are attempting to measure is not the hours spent completing the project, but rather the number of validated labels we can generate to train our machine learning model. When we compare the data thusly, the picture changes.
Single Class Projects
Project | Completed labels | Project duration (hours) | Speed (labels/hour) |
---|---|---|---|
City A I | 3141 | 118.5 | 27 |
City A II | 3217 | 126 | 26 |
City B I | 3826 | 127.5 | 30 |
City B II | 6467 | 127 | 51 |
Mean speed | 33 |
Multiple Class Projects
Project | Completed labels | Project duration (hours) | Speed (labels/hour) |
---|---|---|---|
City C | 18132 | 223 | 81 |
City D | 24138 | 273.5 | 88 |
Mean speed | 85 |
Finding: Multiple class projects are more efficient
While annotators completed each task of single class projects faster, it took them nearly two and a half times longer to produce the equivalent number of validated labels.
Other considerations
Other factors that might have influenced the outcome are:
- City A was much less developed than the other cities and contained the fewest number of tasks of any city;
- Project B II contained many more tasks than Project B I, as well as the other single class projects;
- Labelers completed all of the single class projects before working on the multiple class projects–labelers may have gained insight as they labeled; and
- Not all tasks for City D were validated.
Image classification is about understanding the entire task
Though far more examples would be needed to come to a final conclusion on the matter, our experiment indicates that it is significantly quicker to identify multiple classes in one labeling project, at least for our team using GroundWork. The next question is: why?
I must admit that the data surprised me. My instinct was that labelers would “get into a groove,” so to speak, and work through the single class projects much faster. Raster Foundry software engineer Alex Kaminsky offered the following thoughts:
“This is a cool experiment. What I am drawing from it is that classification isn’t just about answering that one question but understanding the entire task and then answering the question. So the overhead of looking at the whole thing is reduced when you have multiple questions per project.”
I also asked our CloudFactory team to share their thoughts. They confirmed that adding a second question to a task was not a significant burden:
“The team finds it easy to label a couple of classes where they don’t have to think of many items to be labeled and worry about missing any classes.”
However, their sentiments warn us not to generalize too much. Returns seem to diminish when too many classes are added:
“On the other hand, multiple labeling from the worker’s point of view is a complex task.”
This was especially true of other types of labeling projects such as semantic segmentation, where labelers not only have to identify items, but draw polygons around them:
“…they have to make sure that all the classes are taken into consideration and also focus on quality at the same time. The labeling time is longer in this case as they have to be more careful and observe all the classes are included.”
What’s next?
The next occasion we have to run this experiment, we will, adjusting for what we learned this time. I’m also interested in knowing whether a similar experiment with semantic segmentation would yield different results. Have you run a labeling experiment or can offer a different analysis? Let me know!
Whether you’re looking to label your own imagery, understand what type of annotation is needed to answer your questions, or learn more about running a machine learning model, Azavea has a host of products and teams to support you.