Annotation & machine learning for data-journalists (Round 3)

France Project type: prototype Université Pierre et Marie Curie


Journalists often have to deal with large amounts of textual content. They face important difficulties to extract structured knowledge out of these documents, and the text remains underused, leaving some important information probably undiscovered.

Automatic machine learning approaches can help, but require to first understand perfectly the information need and to annotate the corpus manually.

We will build a tool allowing the manual annotation of textual documents with no or little idea of what one want to extract. The tool will then semi-automatically help the annotator to focus on the most important features and to produce good-quality annotations. 

The solution

Machine learning approaches are difficult to apply by investigate journalists. They require: 1/ to know in advance the types of knowledge to be extracted, reducing the chances to discover information by serendipity. 2/ to annotate manually a large volume of data (by several well-trained human annotators). 3/ to reproduce this heavy process at each new project. Our tool will reproduce the task of any person discovering a document and annotating it with a pencil and a highlighter pen. However, the machine will assist this process from the very first steps, and automatic annotations will be suggested as soon as possible.