For each country we've selected up to 20 top news sources, mostly general (i.e. not sport-specific, finance-specific, etc.), and mostly in a local language. The idea is to monitor popular news websites which people in this country actually read.
Throughout the day, we are gathering articles published on these websites using our AutoExtract API
. The API allows us to get a clean article body and headline, without the website elements, suggested articles, etc. It also allows us to detect and filter out occasional non-article pages.
To decide if an article is about COVID-19 or not, we've built a list of COVID-19 synonyms and spellings in ~70 languages (~250 variants in total). This list of keywords is conservative: it is strictly about coronavirus - no "pandemic", "virus", "mask" and related keywords.
Then, if one of a keywords appears in the article body, article headline or article URL, we count an article as a COVID-19 article.