Saltar al contenido principal
Toma un corto quiz y recibe recomendaciones personalizadas para empezar tu recorrido
5 minutes para completar

Data preparation

Assess your use case, source and prepare your data
Comience Ahora

What is training data?


If you have properly set up your Google Cloud account, you are now ready for the exercise. In this lesson, you will learn what questions you should ask while gathering the training data and how to prepare it to be used by AutoML Vision.

With training data, what we mean is examples of what we want our ML model to be able to recognise and categorise. In our case, this means providing a set of satellite images and telling the algorithm which ones are examples of amber mining and which are not.


Start with your use case


While putting together the dataset, always start from the problem you are asking ML to help you solve. Consider the following questions:

  1. What is the outcome you’re trying to achieve?
  2. What kinds of categories would you need to recognise to achieve this outcome?
  3. Is it possible for humans to recognise those categories? Although AutoML Vision can handle many more images and categories than humans can, if a human cannot recognise a specific category, then AutoML Vision will have a hard time as well.
  4. What kinds of examples would best reflect the type and range of data your system will classify?

Think about a story you are working on. How do the answers to those questions change your approach to the story and whether you need Machine Learning for it?


Assess your use case


In our case, these could be our answers:

  1. We want our model to be able to recognise instances of amber mining in satellite images we will present to it.
  2. We only need two categories: "YES: this image includes elements consistent with patterns that usually show amber mining activity" and "NO: this image doesn't include elements that suggest amber mining".
  3. Mostly yes: instances of amber mining are quite recognisable in satellite images because of the distinctive pockmark-like pattern of holes in the ground. But we'll see in the testing phase that it might not always be as easy as we think.
  4. Different background, different density of the holes, different colours. The more diverse the examples in our dataset, the better the algorithm will learn.

Source your data


Once you’ve established what data you need, the next step is to find a way to source it. In our case, we already have the dataset provided by Texty. But think of what might be your own use case: How and where can you find the images you need?

You might be able to source them from what your organisation collects or from third-parties. In both cases, make sure to review regulations about data protection in your region and the locations your application will serve.

No training data will ever be perfectly "unbiased", but you can improve your chances of building a "fair" ML model if you carefully consider potential sources of bias in your data and take steps to address them. Review our Introduction to Machine Learning to find out more about it.


Prepare your data


There are a few more things to keep in mind as you put together the training data:

Include enough labelled examples in each category: The minimum required by AutoML Vision is 100 examples per label. In general, the more labelled images you can bring to the training process, the better your model will be.

It’s important to include roughly similar amounts of training examples for each category. If you have an abundance of data for one label, use only part of it to avoid having a widely different amount of examples per category.

Find images that are visually similar to what you’re planning to ask the model to categorise. Ideally, your training examples are real-world data drawn from the same dataset you're planning to use the model to classify.

¡Enhorabuena! Acabas de terminar Data preparation in progress
Recommended for you
  • Introduction_to_Google_Earth_Engine_Lesson_Overview_jLofKXp.png

    Introducción a Google Earth Engine

    lesson 5 minutes Beginner
    Use una biblioteca multipetabyte de datos e imágenes satelitales para detectar cambios, mapear tendencias y cuantificar las diferencias en la superficie de la Tierra.
  • GO801_GNI_AccessCourtCasesEtc_TitleCard.jpg

    Google Académico: Accede a casos judiciales, trabajos académicos y fuentes.

    lesson 15 minutes Beginner
    Accede rápidamente a opiniones de expertos.
  • How to make them using WordPress

    lesson 5 minutes Beginner
    WordPress is the standard for so many content makers, and now the ability to create Web Stories is built right into the platform.