Data preparation
Assess your use case, source and prepare your data
What is training data?
If you have properly set up your Google Cloud account, you are now ready for the exercise. In this lesson, you will learn what questions you should ask while gathering the training data and how to prepare it to be used by AutoML Vision.
With training data, what we mean is examples of what we want our ML model to be able to recognise and categorise. In our case, this means providing a set of satellite images and telling the algorithm which ones are examples of amber mining and which are not.
Start with your use case
While putting together the dataset, always start from the problem you are asking ML to help you solve. Consider the following questions:
- What is the outcome you’re trying to achieve?
- What kinds of categories would you need to recognise to achieve this outcome?
- Is it possible for humans to recognise those categories? Although AutoML Vision can handle many more images and categories than humans can, if a human cannot recognise a specific category, then AutoML Vision will have a hard time as well.
- What kinds of examples would best reflect the type and range of data your system will classify?
Think about a story you are working on. How do the answers to those questions change your approach to the story and whether you need Machine Learning for it?
Assess your use case
In our case, these could be our answers:
- We want our model to be able to recognise instances of amber mining in satellite images we will present to it.
- We only need two categories: "YES: this image includes elements consistent with patterns that usually show amber mining activity" and "NO: this image doesn't include elements that suggest amber mining".
- Mostly yes: instances of amber mining are quite recognisable in satellite images because of the distinctive pockmark-like pattern of holes in the ground. But we'll see in the testing phase that it might not always be as easy as we think.
- Different background, different density of the holes, different colours. The more diverse the examples in our dataset, the better the algorithm will learn.
Source your data
Once you’ve established what data you need, the next step is to find a way to source it. In our case, we already have the dataset provided by Texty. But think of what might be your own use case: How and where can you find the images you need?
You might be able to source them from what your organisation collects or from third-parties. In both cases, make sure to review regulations about data protection in your region and the locations your application will serve.
No training data will ever be perfectly "unbiased", but you can improve your chances of building a "fair" ML model if you carefully consider potential sources of bias in your data and take steps to address them. Review our Introduction to Machine Learning to find out more about it.
Prepare your data
There are a few more things to keep in mind as you put together the training data:
Include enough labelled examples in each category: The minimum required by AutoML Vision is 100 examples per label. In general, the more labelled images you can bring to the training process, the better your model will be.
It’s important to include roughly similar amounts of training examples for each category. If you have an abundance of data for one label, use only part of it to avoid having a widely different amount of examples per category.
Find images that are visually similar to what you’re planning to ask the model to categorise. Ideally, your training examples are real-world data drawn from the same dataset you're planning to use the model to classify.
-
Google Historical Imagery: Google Earth Pro, Maps and Timelapse
LessonFind out where a photo was taken and when it was uploaded. -
-