July 17, 2020

During the Coronavirus pandemic data scientists have come to the rescue of healthcare in an amazing manner with their data modeling, predictions, and analytical understanding of the causal factors of the infectious disease. It resembles the beginnings of the effort of the historic mission to land a Man on the Moon. Since the moment in March when Governor Andrew Cuomo based his hospital capacity planning on the prediction of infections, hospitalization and deaths on a data model, global efforts of data scientists have left no stone unturned.

Today various data models are projecting infections, hospitalizations and deaths from COVID-19 pandemic across the world by significant locations, geographical areas and other factors. Attempts are being made to develop an algorithm to predict the next moves of the pandemic. Data scientists, collaborating with medical scientists and doctors, are beginning to identify causal factors of infection and death. Restoration of the economy is being explored to find the optimal reopening of businesses, schools, universities, religious and social institutions, sports and entertainment events, and national infrastructural services.

Forecasting models have provided critical information about the course of the COVID-19 pandemic. In attempting to predict both the timing of peak deaths and the total magnitude of mortality, these models have played a critical, influential role in shaping the responses of policymakers and health systems alike. As data and models are updated regularly, a publicly available, transparent, and reproducible framework is emerging. Six models have earned considerable trust for which publicly available, multinational, and date-versioned mortality estimates are available. The sources for these models are DELPHI-MIT (Delphi), Youyang Gu (YYG), the Los Alamos National Laboratory (LANL), Imperial College London (Imperial), t IHME, the Curvefit model (IHME-CF) that was used between March 26 and end of April, and the hybrid epidemiological compartment model (IHME-HSEIR), which we have used since early May. Collectively, these models cover 164 countries, as well as the 50 US states and Washington, DC, and accounted for >99% of all reported COVID-19 deaths on June 23, 2020.

Mauricio Santillana and Nicole Kogan of Harvard have developed a model to predict Covid-19 outbreaks two to three weeks in advance. The system uses real-time monitoring of Twitter, Google searches and mobility data from smartphones, among other data streams. The algorithm, the researchers write, could function “as a thermostat, in a cooling or heating system, to guide intermittent activation or relaxation of public health interventions” — that is, a smoother, safer reopening. “In most infectious-disease modeling, you project different scenarios based on assumptions made up front,” said Dr. Santillana, director of the Machine Intelligence Lab at Boston Children’s Hospital and an assistant professor of pediatrics and epidemiology at Harvard. “What we are doing here is observing, without making assumptions. The difference is that our methods are responsive to immediate changes in behavior and we can incorporate those.” Teams at Carnegie-Mellon University, University College London and the University of Texas, have models incorporating some real-time data analysis as well.

While data models may not be perfect, they enable decision making in the short term as well as long term where one has to be concerned with soundness and robustness of the model, statistical and epidemiological assumptions, accuracy of data, inclusion of relevant data streams, validation and adaptation.


SOURCE:, Youyang Gu, Independent Data Scientist