Image Classification Labeling: Single Class versus Multiple Class Projects

Over the past several years, Azavea has devoted much time to machine learning. In particular, machine learning on satellite imagery. This work led to the creation and release of our own labeling tool: GroundWork.

A screenshot of the GroundWork homepage.
Main text: "The first annotation tool designed for geospatial data
GroundWork makes labeling satellite, aerial, and drone imagery easy.
TRY GROUNDWORK " — GroundWork is the first annotation tool designed for geospatial data.

There isn’t much guidance labeling satellite imagery for machine learning available. So, when we can, we run our (own) small experiments. One such opportunity arose last year when we needed to provide training data for an image classification model. The imagery spanned several cities, and the project required us to classify several features in each. Our question: given the need to classify more than one feature in each task, which would be quicker: labeling one project with multiple classes per task, or two projects with one class per task?

What’s the best way to label multiple features in training data?

Image classification models categorize an image as belonging to one or more categories of interest. In this case, our annotators at CloudFactory looked at images of each city overlaid with OpenStreetMap (OSM) labels. They were then to compare the image and the overlay to determine if two features (let’s say, highways and bridges) were well-mapped by OSM. The model’s goal was to identify and prioritize under-mapped areas.

Before beginning, members of the Raster Foundry and R&D teams wondered whether it would be more efficient to label each city one feature at a time or to label all features at once. Of particular concern was whether two or more OSM overlays would create visual clutter. And, our experience with objection detection projects was that locating different sorts of objects at once was thought-consuming. Still, other team members’ intuition was that classifying features separately would take more time.

After debating the subject, we decided to run an experiment to settle the question. We split our cities in half and created 6 image classification projects:

City	Project I	Project II
	Are the … in this image well-mapped?	Are the … in this image well-mapped?
A	Highways	Bridges
B	Highways	Bridges
C	Highways or bridges	—
D	Highways or bridges	—

We created six image classification projects to test our theories.

Testing single class projects versus multiple class projects

Speed: Tasks per hour

We first defined speed as the mean number of tasks labeled/validated per hour. Let’s break that down further:

When your imagery set is an entire city, segmenting it for annotation is necessary. We call these segments tasks. In our case, each task contained either one or two questions for the annotator to answer. A labeled task is one in which an annotator has answered all questions. Each task is then assigned to another annotator, who can either agree with the original decision(s) or make corrections. Once this process is complete, the task is validated.

Who validates? In our workflow, that changes as the project progresses. First, either I or one of our CloudFactory team leads verify each task. I provide regular feedback on what is going well and what needs improving. As team members become familiar with the expectations, one or more will be added to the validation team. I depend on my team lead by Smita Shrestha to make those decisions, and her judgment hasn’t been wrong yet.

Single Class Projects

City	Labeling speed	Validating speed	Mean speed
A	70	100	85
B	86	105	95
*Both*	78	*102*	90

The mean speed of single class projects was 90 tasks/hour.

Multiple Class Projects

City	Labeling speed	Validating speed	Overall speed
C	70	88	79
D	73	98	86
*Both*	72	93	82

The mean speed of multiple class projects was 82 tasks/hour.

An column graph showing the mean speed of each single and multiple projects as well as the mean speed for each type of project. — Annotators were able to complete more tasks per hour in single class projects.

At first glance, it may seem as if the single class projects are indisputably faster, but as each single class task only contains one label while those in multiple class projects have two, a better comparison is labels completed per hour.

Speed: Labels per hour

In other words, the output we are attempting to measure is not the hours spent completing the project, but rather the number of validated labels we can generate to train our machine learning model. When we compare the data thusly, the picture changes.

Single Class Projects

Project	Completed labels	Project duration (hours)	Speed (labels/hour)
City A I	3141	118.5	27
City A II	3217	126	26
City B I	3826	127.5	30
City B II	6467	127	51
Mean speed			33

The mean speed of single class projects was 33 labels per hour.

Multiple Class Projects

Project	Completed labels	Project duration (hours)	Speed (labels/hour)
City C	18132	223	81
City D	24138	273.5	88
Mean speed			85

The mean speed of multiple class projects was 85 labels per hour.

A column graph showing the mean number of labels completed each hour; by project, and by project type. — Annotators were able to complete more labels per hour in multiple class projects.

Finding: Multiple class projects are more efficient

While annotators completed each task of single class projects faster, it took them nearly two and a half times longer to produce the equivalent number of validated labels.

A column graph comparing the number of tasks completed per hour and the number of labels completed per hour for single and multiple image classification projects. — Although annotators completed each task faster in single class projects, overall multiple class projects produced more validated labels per hour.

Other considerations

Other factors that might have influenced the outcome are:

City A was much less developed than the other cities and contained the fewest number of tasks of any city;
Project B II contained many more tasks than Project B I, as well as the other single class projects;
Labelers completed all of the single class projects before working on the multiple class projects–labelers may have gained insight as they labeled; and
Not all tasks for City D were validated.

Image classification is about understanding the entire task

Though far more examples would be needed to come to a final conclusion on the matter, our experiment indicates that it is significantly quicker to identify multiple classes in one labeling project, at least for our team using GroundWork. The next question is: why?

I must admit that the data surprised me. My instinct was that labelers would “get into a groove,” so to speak, and work through the single class projects much faster. Raster Foundry software engineer Alex Kaminsky offered the following thoughts:

“This is a cool experiment. What I am drawing from it is that classification isn’t just about answering that one question but understanding the entire task and then answering the question. So the overhead of looking at the whole thing is reduced when you have multiple questions per project.”

Right: Raster Foundry software engineer Alex Kaminsky. Left: Some of Azavea’s CloudFactory teammates, including: Dibesh Maharjan, Reeman Napit, Aryama Kayasta, and Devyani Neupane.

I also asked our CloudFactory team to share their thoughts. They confirmed that adding a second question to a task was not a significant burden:

“The team finds it easy to label a couple of classes where they don’t have to think of many items to be labeled and worry about missing any classes.”

However, their sentiments warn us not to generalize too much. Returns seem to diminish when too many classes are added:

“On the other hand, multiple labeling from the worker’s point of view is a complex task.”

This was especially true of other types of labeling projects such as semantic segmentation, where labelers not only have to identify items, but draw polygons around them:

“…they have to make sure that all the classes are taken into consideration and also focus on quality at the same time. The labeling time is longer in this case as they have to be more careful and observe all the classes are included.”

What’s next?

The next occasion we have to run this experiment, we will, adjusting for what we learned this time. I’m also interested in knowing whether a similar experiment with semantic segmentation would yield different results. Have you run a labeling experiment or can offer a different analysis? Let me know!

Whether you’re looking to label your own imagery, understand what type of annotation is needed to answer your questions, or learn more about running a machine learning model, Azavea has a host of products and teams to support you.