In the wake of the COVID-19 outbreak, many organizations, and the data community in general have spent a lot of effort to make relevant data available to a wider audience. From the beginning, I have mainly relied on the John Hopkins dashboard and the Worldometer website to stay up-to-date with the latest numbers. Additional detailed information on country level is available from different sources, though it can be a bit difficult to collect and clean up all that data. In that sense, I particularly like the data sets curated by Starschema. Be sure to check out their Github page for more information if you are interested in working with COVID-19 data.
This is a great development in my opinion, and I hope it will only get easier to access publicly available data in the future. At the same time however, we need to acknowledge that understanding data is not easy, and it is important to educate people on how to use data correctly, especially when data becomes a public good. A lot has been written on data literacy and visualization best practices, but one subject that doesn’t get enough attention yet is the process of data validation, data quality, and understanding what the data means in the first place.
Describing the Data Sources
To illustrate what I mean, let’s take a look at the number of total confirmed cases in Belgium. We have two different data sources that report on this figure. Both of these sources are actually added to the Starschema data sets, but for this post we’ll use the files from the original sources:
Sciensano – The Belgian institute of health. We’ll use the data set called Dataset of confirmed cases by date, age, sex and province under the heading ‘Data’.
These files are included in the Tableau workbook. You can also download the files from the links provided above, but since the numbers are updated on a daily basis, it’s more convenient to use the static files (downloaded on April 17th) in the workbook if you want to follow along with the examples below.
Ultimately the numbers between Sciensano and John Hopkins should match. John Hopkins uses Worldometer as a source, and Worldometer themselves uses daily reports from Sciensano as their source.
Comparing the Numbers
With the data in hand, we can make a simple comparison on the daily numbers (in Tableau). There are a few things to be aware of in terms of the data structure:
- In the John Hopkins data set, the days are captured in columns instead of rows, so we’ll need to pivot the data. Rename Pivot Field Names to
[DATE]so we can blend with the Sciensano data source on date.
- The John Hopkins data set contains more countries than just Belgium, so we filter on Belgium.
- The number of confirmed cases in the John Hopkins data set are cumulative, whereas the Sciensano data set contains daily numbers. Therefore, we’ll use a
WINDOW_SUMfunction to also report on cumulative numbers for Sciensano.
We then blend the data sources and create a line chart that compares the number of confirmed cases from both sources on a daily basis:
What Sciensano reports on a daily basis is significantly higher that what John Hopkins reports, even though both sources are reporting on the total number of confirmed cases in Belgium. But which one is correct? You can imagine that people can lose faith in the numbers or analyses if we do not have a satisfactory answer for this.
Checking the John Hopkins Numbers
Before we jump into what the difference is between the two different sources, let’s first confirm that the John Hopkins numbers that we report are the same as what John Hopkins themselves report. To do that, we’ll turn to their online dashboard. When we filter on Belgium using the left-hand panel, we can check what the number of confirmed cases are on April 16th:
This number is the same what we presented earlier, albeit rounded. We can check the previous days as well, and we can see that the numbers also match our dashboard. Earlier I mentioned using the Worldometer website regularly, so let’s also check the number of confirmed cases in Belgium there:
This number matches exactly with what we have. Now we can be pretty confident we didn’t make any mistakes when retrieving the data from John Hopkins’ Github page.
Matching Up with the Sciensano Data Source
None of this still explains why the numbers are different from Sciensano. One theory is that the Worldometer and John Hopkins data refer to the number of cases that are reported during a particular day. However, this date of report is not the same as the date when the cases are first registered. It usually takes a day or more before these cases are reported to Sciensano (and therefore also to John Hopkins and Worldometer).
Fortunately, Sciensano maintains a Historical datasets page, where we can download the data sets from March 31st onwards. If our theory is correct, the total number of cases reported in each file should match the numbers from John Hopkins. So let’s go ahead and download all the files, and union them in Tableau so that we have one data source with all the historical files. The file names also contain a date, so we’ll use the
SPLIT function to extract the date. We then make a new comparison between the Sciensano and John Hopkins numbers based on the file date.
Eureka! We have an exact match between the Sciensano and John Hopkins numbers. So what Sciensano reports is based on when the cases are registered or tested, but it could take days before the data is collected by Sciensano. Sciensano then creates a report every day with the newly collected cases, and this report is the source for Worldometer, which in turn is the source for John Hopkins. So in the end, they are based on the same numbers, just from a different perspective.
Let’s spend a little bit more thinking about the data sets.
One thing to be aware of when using the Sciensano data set is that data from the most recent days are not complete yet. To visualise this, let’s take a look at the data from 3 separate historical files.
The darkest grey shows what the daily numbers are on April 14th, whereas the lighter grey shows what is being added in the subsequent two days. We clearly see that it can take quite a while before daily numbers are completed, so obviously we have to be careful when we perform trend analyses on the Sciensano data set.
There is also a large disparity between the number of confirmed cases in the weekdays and the weekends, which we can easily see if we colour the graph in a different way. Something else to consider before drawing any conclusions.
The John Hopkins data has a less obvious weekday – weekend disparity. Note that the effects of the weekend are shifted largely to Mondays and Tuesdays in the John Hopkins data set. To see how it works, let’s take Tuesday April 14th as an example. This file is actually already available in the morning, so it’s probably more accurate to say that the data is collected on Monday April 13th. The data collection on Monday then contains many cases from Saturday and Sunday, but also from other days. so the weekend effect is somewhat smoothed out.
By now, you probably noticed there is a peak on April 15th in the John Hopkins data set. If we read the daily Sciensano report of April 15th — which is in Dutch — it mentions that they started doing additional tests in nursing homes from April 10th onwards, and that April 15th was the 1st day that the results were added to the report. On top of that, the number of cases on April 14th is very low, even for a Sunday. I suspect there were some errors made during data collection, and that these were then corrected on April 15th, though this is just a theory and I couldn’t find mention of this anywhere.
In Conclusion… For Now
There is potentially more to be said and analysed about these – seemingly simple at first – data sets, but I’ll avoid going on for much longer in this blog post. The general idea should be clear anyway: using and interpreting data in the correct manner is often not a straightforward process. Especially if you have multiple data sources presumably reporting on the same thing, it may take some effort to validate the numbers between them. This process is usually messier than how I’ve presented it here, but ultimately I find it to be rewarding work.
Looking at COVID-19 data as a whole, this is just a small portion of how complex it can get. We’ve only looked at the number of confirmed cases in Belgium, but there are other metrics and detailed information in the data sets. Each country (and even state or province) will also have their own methods for collecting data, and for aggregating this data into a daily report before they are being picked up by Worldometer and John Hopkins, and each method could mean that we need to look at the data in a different way. Of course, you don’t have to understand every single nuance before making an analysis – nor would I even recommend it. But it helps to be aware that these differences exist, and — depending on the analysis you want to perform — how you can delve deeper to get a better understanding.