How Your Data Collection Strategy Influences Your AI's Behavior
Updated: Mar 22, 2022
We've become increasingly comfortable with AI making critical decisions, from driving our cars, to diagnosing our illnesses. However, many companies and communities have begun to notice the worrying trend of human biases creeping into these AI systems.
Autonomous vehicles are more likely to fail to detect a black person than a white person when trained with a standard dataset, according to a new study out of the Georgia Institute of Technology. Twitter recently apologized for their 'racist' cropping algorithm, where image previews were found to focus on white faces over black ones. In an extreme example, OpenAI’s iGPT model was tasked to fill in the blanks in a cropped image of Alexandria Ocasio-Cortez, and consistently depicted her in revealing outfits across multiple tests.
The iGPT model was trained on ImageNet, an open-source dataset. The autonomous driving dataset was collected at Berkeley AI Research, and Twitter's dataset was most likely continuously collected.
One of the main culprits for these discriminatory results is biased training data. The training data that is collected is often either unrepresentative of reality or reflects existing prejudices.
In our first article, we explained why most of the work in building AI systems involves building datasets. Out of all of the steps that are required to build a dataset - collecting, cleaning, processing, labeling, and sorting data - data collection has the biggest impact on model performance and behavior. Assumptions and decisions made in this step have lasting effects that cascade downstream to dataset curation, labeling, and training.
In this article, we explain how your choice of data collection method influences AI behavior, and list options for acquiring balanced, diverse data to train bias-free and performant AI systems.
Why does data collection change AI behavior?
Data collection defines data content and distributions
How you choose to collect your data will determine what kind of data you end up collecting. The content of a dataset and its distributions will determine the performance and capabilities of the resulting model. Two quantitatively similar datasets can still result in two very different performing models, with different biases.
Content determines performance
The aim is to build a diverse dataset that covers all the different variations. In computer vision applications, this means diverse subject or object types, positions, and orientations; lighting and weather conditions, and varied perspectives. If the dataset contains a good mix of lighting conditions, the model will likely perform well under varying conditions since it will have seen similar examples during training. However, in most cases, training data only covers a small subset of all the possible variations that are present in real-world scenarios.
The example below illustrates this scarily common problem with data collection. Satellite imagery is collected on a clear day and used to train an aircraft detector. However, when testing on real-world data under various weather conditions, performance can be unsatisfactory. In other words, our training data has a coverage problem – it does not accurately represent the full distribution of possible inputs. The same concept can be applied to nearly any aspect of dataset content, including object attributes, lighting, background features, and many more.
Industry giants also face this problem
In the real world, the coverage problem has proved an issue even for the largest AI companies. Google’s AI eye scan diagnosis system reported stellar results in the lab, but failed when used in the real world. Vision systems in self-driving cars have also faltered when facing rare objects, unexpected weather conditions, sensor errors, and other varied scenarios - in 2016, a Tesla car crashed, killing the driver, when Autopilot failed to detect a white semi-truck crossing in front of it, due to its “white color against a brightly lit sky”.
Models trained on narrow data also tend to make highly confident wrong predictions in real-world applications.
Reasons for these biases
The data collection process is constrained by real-world resources, time, and labor. Data collectors make assumptions about where and how to look for data even before acquiring any and tend to favor cheaper and easier methods.
Making sense of common data collection strategies
We’re now aware of AI models’ sensitivity to the content of training data. But how can we acquire high-quality data that's varied enough to train models that will be deployed in the real world?
Traditionally, AI teams would manually collect data directly from the deployment scenario with a team of data collectors. For example, to build vision systems for self-driving cars, you would hire drivers to navigate the deployment area, capturing sensor data in various lighting and weather conditions.
Works well if collection and deployment conditions are similar.
Doesn’t require an extremely large amount of investment into infrastructure, sensors, data pipelines, and human labor.
Does not perform well in new deployments and scenarios. The data collected is often biased to certain locations, scenarios, and conditions it was collected under. Google’s AI was trained in very carefully controlled lab environments, but it could not deal with real-world lighting conditions since it had never been trained under those conditions.
The resulting dataset may contain biases from the collectors themselves. Data collectors may misinterpret instructions or assumptions. Collection instructions must be carefully constructed and guidelines must be clear. The example below illustrates completely different vehicle orientations obtained through two different interpretations of the collection instruction.
Portions of the datasets may have differing levels of quality, especially when collected by many different individuals.
Human or environmental error when collecting data due to miscalibration and poor operation, such as damaged or faulty sensors, wrong exposure settings, dirty lenses, out-of-focus cameras.
Data collection runs must be consistently updated since data degrades in relevance over time. The choice of sensors and sensor setup can change, requiring additional rounds of collection of entirely new datasets.
Manual collection remains a good approach if one does not expect much variation in new deployment conditions. However, for serious projects that are expected to have robust real-world performance guarantees on long-tail distributions or varying conditions, this approach will very quickly cause scalability and performance issues.
This method is predominantly used by players who already have networks of sensors deployed in the field with the necessary infrastructure to handle these data pipelines. A prime example is in the autonomous vehicle (AV) industry, where large networks of multiple cars continuously collect sensor readings while in service, and improve themselves by aggregating the experiences across the fleet – Tesla’s AI director even called its fleet "a large, distributed, mobile data center".
As Tesla has suggested, the most important data to collect is actually in the long tail - a term borrowed from statistics, and used to refer to infrequent but diverse scenarios. AV manufacturers utilize tricks to filter for these rare data points. For instance, whenever a driver executes a manual takeover from Autopilot, data is flagged as interesting. This data can then be easily retrieved to train the fleet on specific scenarios.
Extremely large volumes of diverse data.
Rare and difficult real-world scenarios will be collected.
Data remains recent and relevant.
Extremely expensive and complex infrastructure. There's usually a high cost to set up infrastructure to capture, stream, and store data. Usually involves large-scale distributed deployment of products or sensors. The diversity of data depends on the size and spread of the fleet or network.
Sorting through the large volumes of data is labor-intensive. Curation is still required to curate balanced and diverse training data. Biases may still leak into training if not eliminated through curation, such as in the Twitter example.
After a certain point, we face diminishing returns since a large proportion of the collected data is very similar. Tesla highlighted the importance of the long tail, which should contain rare and diverse scenarios - cars carrying bikes and equipment, car-carrying trailers, and even images of crashes and overturned vehicles.
This is the ideal option to get the most diverse data and most performant models for robust real-world AI systems. However, it is extremely costly to build, deploy and maintain.
Purchase data directly from suppliers who are already collecting data at scale. Satellite companies like Maxar, BlackSky, and Planet Labs collect large amounts of imagery every day. This data can be used for AI analysis in agriculture, defense, logistics, and many other industries. Alternatively, public data marketplaces like AWS marketplace or bounding.ai are slowly emerging, as vendors put up their task-specific datasets for sale. Small, specific tasks like industrial object inspection and detection form the bulk of data currently found on marketplaces, but some vendors have bundled large amounts of data together - Shutterstock's self-driving car dataset, for example.
Does not require capital investment into sensors, infrastructure, and human operators.
Low turnaround time as you can buy the data off the shelf.
Data can be of high quality, depending on the vendor and price point.
The burden is on the buyer to know exactly what data they need. Otherwise, a great deal of trust is needed in the vendor.
Vendors need to know which parts of the world to look at to find specific objects under specific conditions. The easy “solution” is to purchase all possible images of an area or object under different conditions from a single vendor. However, this often results in a bias towards the characteristics of that specific area or object.
A lack of control over the content of the images. You can only buy what is available. Specific requests are possible but could be extremely expensive, for instance, re-tasking a satellite to collect high-elevation, aggressively-angled images of a specific target, under fog, at noon, from a specific sensor.
Buying data can be a great option for achieving baseline performance and detecting common objects if the right kind of data is selected and curated. However, when training to perform on rare objects or custom objects that do not appear often, it is often difficult to find enough appropriate samples.
Without proper curation, models may overfit on specific details in training data, such as backgrounds. This particular model learnt features from the ground surface to detect planes. The middle image shows red portions of the heatmap contributing greatly to the decision during training. In the rightmost image, a similar plane on a different surface could not be detected.
Large, free open-source datasets have been used to produce baselines for a long time. However, these datasets are predominantly scraped from various web sources. AI models tend to pick up biases from the way people are stereotypically portrayed on the internet. Currently, many researchers are trying to fix fundamental data problems with datasets like ImageNet.
Easily available and free.
Datasets have some basic QA and have been validated by other practitioners.
It is easy to compare model performance on benchmarks.
Many older datasets are poorly collected and constructed, and can contain systematic biases, as seen in the AOC example.
Are usually built for research tasks and not tailored to specific use-cases. May not match your required sensors, perspectives, objects, and contexts.
Can provide decent baseline performance, but does not guarantee good performance under rare or specific conditions. Further manual processing is required to tailor datasets to use cases. It’s possible that after filtering a large open-source dataset, there are very few relevant samples remaining to even train a performant model for your tasks.
Open source datasets are a great choice for quick prototyping and bootstrapping a baseline model, but are less appropriate for real-world deployments. There are usually many content and coverage problems. For instance, the xView satellite image dataset has a large class imbalance problem. If we wanted to detect railway cars from satellite imagery, we would only be able to retrieve 17 examples, out of the 1 million objects available.
Synthetic data is created using virtual simulations and 3D graphics, and is used as a complement (in both senses of the word) to real-world data. It has been proven to match or surpass real data in training AI models for various use cases.
Check out our previous post for a deeper dive into synthetic data and how it works.
Full control of dataset content and distributions.
Contains rare objects and scenarios.
High diversity in the dataset, over many different variables. For instance, weather conditions, lighting, object variation, material variation, and sensor conditions.
Makes developing AI like developing software. Synthetic data allows for rapid experimentation, testing, and data patching, and fault isolation through generating alternative datasets.
Extremely fast iteration. You can generate multiple datasets in an hour.
Bonus: Synthetic data can complement other data strategies, and has been shown to match or outperform when combined with real data.
We can’t model what we don’t know. Synthetic data relies on real-world knowledge or data to model similar virtual worlds. For example, while it is easy to simulate virtual environments since physical objects and dynamics are well understood, it’s dangerous to attempt to create synthetic images of cancerous brain scans because we often don’t know the underlying processes well enough.
Synthetic data is a great Swiss-army knife. It is increasingly being adopted for bootstrapping models, patching missing data, isolating model failures, training on diverse and rare examples, and giving developers the ability to build more performant and bias-free AI systems in a short period of time.
At the start of this article, we discussed examples of real-world AI misbehavior, including racism and sexism in predictions and results. Unlike software, faulty data collection strategies do not cause system-halting errors. Systematic biases can leak into models without anyone noticing, potentially causing economic, social, or even physical harm. If we are to rely on machines to inform our decisions, we must begin designing good data collection strategies.
Without a proper collection methodology, unintended downstream effects are often introduced. We’ve illustrated several approaches to data collection, which can be used alone or combined as part of a larger data strategy. The choice of approach comes down to your constraint of time, cost, and performance.
Ultimately, real data remains valuable for training and testing, but building massive diverse data for lots of classes requires a large amount of capital, labor and time, which has been proving to be an uphill battle. As a result, industry-leading companies and researchers have been adopting synthetic data in the last few years. With synthetic data, they've been able to speed up AI development and scale their capabilities in record time. As this trend grows, we're seeing more practitioners adopt synthetic data as their tool of choice.
At Bifrost, we’re building synthetic data tooling that allows users to generate and customize their own datasets with the scale and variety you need. Get in touch or follow us on LinkedIn and Twitter to find out how we can help you navigate data collection and build performant AI models!