Analyzing COVID Data: Separating signal from noise and clutter.


In this report I am describing how I interpret the data.  What are the approaches, and pitfalls in the approach. 

First some definitions:

Signal:  what I am looking for:  the growth of the disease in a population

Noise:  Random variations in the data unrelated to the signal

Clutter:  Other events that mimic the signals I am looking for.  One man's signal is another man's clutter.  The only way to remove clutter is to look at other data source, such as press reports.  


The data I use comes from the Department of Health.  The Department of Health collects reports from the various labs and collates into a common format.  The data are reported as cumulative number of cases, hospitalizations, and testing by day and by geographical area.  The geographical areas include Health District, political location, and Zip code.  But not all data is reported at the same granularity.  Some is only by health district (e.g., cases by age or race).  Some is at locality, and number of tests and number of cases is tracked at the Zip code.

The way I handle the data is to parse it into the appropriate geographic element. I then have the cumulative cases, cumulative testes, etc by day for that location.  So the next step is to date the time derivative of the data to find the rate of cases per day.  For those that do not remember calculus, this sounds hard, but really it is subtracting the cumulative number of cases from the the cumulative number of cases for the previous day.  An example is shown below:

Figure 1:  Comparison of the cumulative number of cases ant the daily cases.  The daily cases are determined by subtracting the previous day's cumulative cases to the todays.

That is the data.  It can be fairly noisy.  For example, on May 7, there were very few new cases.  On that day, I heard much rejoicing about the low number was down.  but, it was just noise -- relating to a lab reporting issue.  Similar noise was introduced around the time of Hurricane Isiais.  

The data are not perfect, but it is the data.

Noise & Handling noise


Noise is the "random" variation in the data.  In some systems, as the signal intensifies, the noise stays constant.  For example, turning up the radio will drown out the engine noise in the car.  However, if you are listening to a staticy station (yes, I am showing my age), turning up the volume will amplify the static.  

The noise present in the COVID-19 data is more like static: it increases with the number of cases reported.
Here is an example:
Figure 2:  comparing the daily data to a seven day average shows the scatter is proportional to the average,

If we look at the distribution of the scatter, it is close to a 2 decibel (dB) scatter with a normal distribution (in dB).  Note that decibel is defined as 20 log10 (daily/average).  

Addressing Noise

The way to address the noise for this data is to average it away.  Because humans live life by week, it is natural to use a seven day average.  An alternative approach is to define a model which the data should follow.  In the case of the growth of any disease, the each day the number of people newly infected should be proportional to the number of people already infected.  In other words, one person may infecct 3 other people, so two people will infect 6 people, etc.  The implication is, if on day 1, one person one other person, then on day 2, there will be two people infected; the next day, those two people will infect two other people, and so on.  The number of new cases would go as: 1, 2, 4, 8, 16, ... until infectious people start interacting with people who already had the disease.  This is how we get herd immunity.

Mathematically, by assuming an exponential growth, instead of averaging seven days, we solve for the exponential parameters.  Fortunately, math helps us.  

The model is:
  C=A eRn

Where, C is the number of new cases on day n, A is a scaling constant (number of people initially infected), and R is the time constant.

If we take the logarithm of both sides, we end up with :

log (C)= log A eRn

which can be rewritten as: 
log(c)= Log(A)+Rn

This is the equation of a line:  Y=Mx + B.  Y=log(C); M=R, n=x and log(A)=b
(there is a use for Algebra!!!!!!)

Using linear regression, and taking the log of the new cases, I can solve for the slope and y intercept.

Here is how it looks in data:

This approach allows is to quickly identify when trends change, that is when we are doing better or worse overall.  If we project forward with the model + error bars, changes in the trends become obvious.

For example, by early April, we can clearly see that the cases are falling below the trend, meaning we were flattening the curve.

In March the Virgina caseload was growing at 24% per day.  While, in real time it was not clear that the curve was flattening until around April 5-7, in retrospect, it started around the 29th of march, which was 7. days after the shutdown began (3/15).  That first event dropped us from 21% to 9% per day, the second event, the stay at home order, dropped us to about 3-4% per day.  This is apparent in the "line plots" in the region section of each day's reports.


Popular posts from this blog

Daily Status, August 29

Daily Status, Nov 8: Every trend I was tracking 10 days ago remains. Which is not good.

October 24: Vienna is not doing well, Virginia is not doing well, the USA is not doing well (but I will not talk about the latter)