COVID19 Pandemic Trend Prediction

Hello dear reader,

Today we are all worried about the COVID19 pandemic, and it is not without reason. Being a Big Data company at GeoDB, we have decided to take on the challenge and create an impact in epidemiology research by building a predictive model of the pandemic evolution, which you can check across every country on:

The post first appeared on medium at , this is and adapted version with slightly different content.

Let’s take a look at the methodology behind our analysis.


Epidemiology, like any field in science, uses maths to understand reality. The idea is employing a collection of mathematical models that are able to represent the spread of infectious diseases, which are curiously called Infectious Disease Models.

One of the most famous is the SIR model, a mathematical model that compares:

  • The number of people who are susceptible to a contagious disease,
  • The number of people who have been infected by that disease
  • Those who have no symptoms thanks to having a very good immune system, and those who have recovered from the disease during a certain period in a closed population.

This model will allow us to predict how the number of infected people evolves and to better understand the message sent to us by the media about flattening the curve (as the following gif from TheSpinoffTV does)

The first thing we have to say is that this model is old, precisely 93 years old, it was developed in 1927 by W.O. Kermack and A. G. McKendric two physicists who used differential equations to model the behaviour of the spread of a disease in a closed population with the aim of explaining the increase and decrease of the total number of people infected during the period of an epidemic.

The SIR model divides the total population into 3 separate classes: susceptible, infected and recovered. It is important to highlight that the classes are separate, in other words, no one can be in more than one class at the same time.

We also have to take into account two more restrictions, the first one is that each person has the same susceptibility to contract the disease and the second one is that a person who has recovered cannot contract the disease again.

It is very difficult to assume that everyone has the same risk of contracting the disease. We already know that the elderly, people with cardiovascular or respiratory problems are more susceptible to contracting the virus and presenting severe symptoms, but this model does not take those differences into account. The model also assumes  that if you have had the disease and have recovered,  will not be infected again.

Let’s do the maths

covid 1

To apply the mathematical model we have to make even more assumptions:

  1. The first assumption is that population growth/decrease occurs in equal proportions (g) for Susceptibility, Infection, and Recovery.
  2. We must also assume that a certain percentage (β) of the Susceptible population will be infected each day.
  3. A certain percentage (d) of the population will die each day
  4. A fixed ratio (α) of the Infected population will Recover from the disease each day.
  5. It can be concluded that -1 ≤ g ≤1 and β, d, and α range from 0 to 1

Using differential equations we can define the problem as follows:

covid 2

This system of equations is in motion. There are people who go from being Suspicious to Infected, from Infected to dead or to Recovered. But when does it stop, when does it reach static equilibrium? In this case, equilibrium is the point at which people are no longer diagnosed with the disease and there are no patients recovering from the disease. This happens when the first derivatives are equal to 0.

covid 3

In this model the derivatives are equal to zero when S = I = R = 0. But this is impossible, there will always be people susceptible to infection. However, this is not the only time when the system is in equilibrium. Remember, the original assumption is that population growth/decrease occurs in equal proportions (g) for susceptibility, infection, and recovery. So now look at the equation (4), eureka! there is another static balance when S’ = I’ = R’ = 0

covid 4

From the original assumption, β cannot be less than 0, because people go from being susceptible to being infected in one way. So the model balances out when the percentage of population growth equals to the ratio of infected growth.

Now let’s apply the maths

Instead of applying math to solve the optimisation problem for a particular country on a particular day we are developing an R package (covidTracker) for everyone to use and apply to a specific country. This package reads the data that Johns Hopkins University updates daily, and by installing all its dependencies, we are also developing a dashboard to visualise the evolution of the COVID daily and see how the curve of infected people evolves.

For the case of Spain and being today March 19 we can predict that the peak of infection will be reached on April 10, and on April 15, the number of recoveries will be higher than the number of infected and the disease outlook will change.

covid 5


If you’ve come this far and understood the mathematics above, you’re probably thinking, “With all the restrictions it is safe to assume make that this model is kind of useless,” because that’s just what we think. There are much better models but they need more variables or a situation in which the disease is more widespread. There are also models that apply graph theory and spatial analysis and come to a comparable conclusion and also emphasise the need for containment to reduce the spread and severity of the disease

But this model, although simple, allows giving early results with little data (only the number of infected) which is actually great when we are in an emergency as the one the world is now suffering!

And do not forget that “All models are wrong, but some are useful.” (George E. P. Box).

Stay safe out there!

Until next time 🙂