Skip to main content
Go to dashboard
Not sure where to start? Take a short quiz to get personalised recommendations.
Lesson 4 of 7
Data preparation
Hands-on Machine Learning
What is Machine Learning
Investigating stories with Machine Learning
Google Cloud AutoML Vision
Data preparation
Training your Machine Learning model
Evaluate and Test
check_box_outline_blank Hands-on Machine Learning: Take the Quiz
Course
0% completed
5 minutes to complete

Data preparation

image12_3.png

Assess your use case, source and prepare your data

image12_3.png

What is training data?

image12_3_zA6aI42.png

If you have properly set up your Google Cloud account, you are now ready for the exercise. In this lesson, you will learn what questions you should ask while gathering the training data and how to prepare it to be used by AutoML Vision.

With training data, what we mean is examples of what we want our ML model to be able to recognise and categorise. In our case, this means providing a set of satellite images and telling the algorithm which ones are examples of amber mining and which are not.

image12_3_zA6aI42.png

Start with your use case

image40_2.png

While putting together the dataset, always start from the problem you are asking ML to help you solve. Consider the following questions:

  1. What is the outcome you’re trying to achieve?
  2. What kinds of categories would you need to recognise to achieve this outcome?
  3. Is it possible for humans to recognise those categories? Although AutoML Vision can handle many more images and categories than humans can, if a human cannot recognise a specific category, then AutoML Vision will have a hard time as well.
  4. What kinds of examples would best reflect the type and range of data your system will classify?


Think about a story you are working on. How do the answers to those questions change your approach to the story and whether you need Machine Learning for it?

image40_2.png

Assess your use case

image5_3.png

In our case, these could be our answers:


  1. We want our model to be able to recognise instances of amber mining in satellite images we will present to it.
  2. We only need two categories: "YES: this image includes elements consistent with patterns that usually show amber mining activity" and "NO: this image doesn't include elements that suggest amber mining".
  3. Mostly yes: instances of amber mining are quite recognisable in satellite images because of the distinctive pockmark-like pattern of holes in the ground. But we'll see in the testing phase that it might not always be as easy as we think.
  4. Different background, different density of the holes, different colours. The more diverse the examples in our dataset, the better the algorithm will learn.
image5_3.png

Source your data

image17_3.png

Once you’ve established what data you need, the next step is to find a way to source it. In our case, we already have the dataset provided by Texty. But think of what might be your own use case: How and where can you find the images you need?

You might be able to source them from what your organisation collects or from third-parties. In both cases, make sure to review regulations about data protection in your region and the locations your application will serve.


No training data will ever be perfectly "unbiased", but you can improve your chances of building a "fair" ML model if you carefully consider potential sources of bias in your data and take steps to address them. Review our Introduction to Machine Learning to find out more about it.

image17_3.png

Prepare your data

image50_2.png

There are a few more things to keep in mind as you put together the training data:


Include enough labelled examples in each category: The minimum required by AutoML Vision is 100 examples per label. In general, the more labelled images you can bring to the training process, the better your model will be.


It’s important to include roughly similar amounts of training examples for each category. If you have an abundance of data for one label, use only part of it to avoid having a widely different amount of examples per category.


Find images that are visually similar to what you’re planning to ask the model to categorise. Ideally, your training examples are real-world data drawn from the same dataset you're planning to use the model to classify.

image50_2.png
Congratulations! You've just finished Data preparation in progress
Recommended for you
How would you rate this lesson?
Your feedback will help us to continuously improve our lessons!
Save results and track your progress
By leaving this page you will lose all progress on your current lesson. Are you sure you want to continue and lose your progress?