The broad goal of this competition is to use a wide range of data to predict the major risk factors of COVID-19's spread and impact on the U.S. in the near future. In responding to this pandemic, predicting which areas will be hardest hit in the near future is critical for properly distributing medical resources and enacting public health policy. In service to this goal, students will be asked to create a model which predicts, at any given point in time, what the effect COVID-19 will have in each county in the United States 2 weeks in the future.
Modeling the spread of disease is of central importance in protecting lives around the world. This competition was designed in almost daily consultation with epidemiologists, virologists, and modelers from Caltech, UC Berkeley, Stanford, UCSF, and elsewhere, identifying a central issue in COVID-19 research and organizing a team of experts to help judge students' results. The field of epidemiology seeks to quantify predictions, enabling better policies to be formed in response to tragic outbreaks. The COVID-19 project held in CS 156b this term directly support this response, introducing novel data science methods to address a critical problem. Models such as these have informed decisions about Ebola, Zika, and other major crises. Indeed, early models introduced by Imperial College London shaped U.K. policy. The vital importance of these models has brought a recent call to action from the Royal Society to produce better prediction models using techniques including data science; this announcement follows only a few months after a similar CDC $3 million mission to forecast the flu with data science. CS 156b seeks to improve such results and make them more robust and effective in tracking the pandemic to save lives.
The TAs will provide a canonical set of data for all competitors to use. Some of these data (e.g. demographics and past flu infection rates) will be fixed for the duration of the projects. Other data (e.g. new COVID cases) will be changing in real time over the course of the competition. All publicly available data will be distributed in the form of a git repo; instructions for accessing it are provided in the Initial Setup section below. It will be up to teams to manage how often they update the data being used to train their models. Each new update will contain all past versions of the variable datasets as well, so teams do not need to keep their own archives.
In order to help with this research, Unacast has agreed to provide us with proprietary social distancing data which are not publicly available. Instructions for accessing these data will be posted on the Piazza page. These data should not be distributed outside of Caltech or used for any purpose outside of COVID-19 research.
The canonical data set will be uploaded to the course HPC instance for teams to use. The dynamic data on the HPC will automatically be updated daily.
The canonical dataset is intended to cover a wide range of factors which may influence COVID spread, and we anticipate that teams will be able to attain good performance with those data alone. However, teams are also welcome to use any additional publicly accessible datasets they find which they think may be useful. This does come with some conditions:
Every team must create a private git repository to hold their data and code which is shared with the TAs. This repo must have the TAs’ master repo set as a remote upstream in order to pull live data as it updates during the competition. Instructions for setting this up are available at Git Repo Initial Setup.
Since a criticial purpose of these projects is to contribute to the epidemiological community, we emphasize both model performance and clarity of final reports. The reports will be evaluated by an expert committee of epidemiologists, virologists, and modelers to ensure validity and usefulness to the broader community.
In crafting a public health response to COVID-19 it is important to know many different things about its spread, including the number of infected individuals, the number of those infected requiring intensive care, and the survival rate due to the disease. However, many of these metrics are highly sensitive to the availability of tests, a factor which cannot effectively be predicted. Thus, we are focusing on absolute survival rate as an evaluation metric since it should closely reflect the true number of lives saved from the disease.
The spread of COVID-19 is inherently semi-random and dependent on many external factors. In order to reflect this, rather than asking teams to predict a specific survival rate we are asking them to predict a likelihood distribution of different survival rates. A pinball loss will be used to evaluate scores, encouraging predicted distributions to be both accurate and confident. An introduction to this metric can be found at this webpage, and more formal descriptions are provided in the introduction of this paper and in this paper. If the metric proves unreliable in some way during the course of the competition, it may be substituted for a better metric.
As this research problem relates to an immediate public health concern, we do not wish to restrict teams from using whatever resources possible to construct the best models they can. As such, teams are allowed to collaborate as much as they want, up to and including exchange of code. Students are also allowed to examine and make use of code which has been publicly posted by research groups from other universities who have been working on COVID modeling. The only condition is that if a team uses any code they did not write (outside of standard data science or ML packages), they must post a link to that code publicly on the Piazza forum and also mention it in their progress report. We ask under the honor code that students do not blindly copy and paste code which they do not understand.