Bob Henson • November 30, 2012 | What if we could use the data from fevered searches for flu information on the Web, plus humidity observations, to help predict the course of an outbreak? If new research lives up to its promise, we’ll soon be able to do just that.
This intriguing advance is outlined in a paper this week in Proceedings of the National Academy of Sciences (PNAS) and highlighted in an NCAR news release. The study authors are Jeff Shaman, a biologist and atmospheric scientist by training who works at Columbia University’s Mailman School of Public Health, and NCAR’s Alicia Karspeck, an atmospheric scientist and expert on bringing data into computer models—the not-so-simple task known as data assimilation.
The core data for the project are geographic patterns that emerge as people use Google to search for influenza-related information. Perhaps not surprisingly, these searches are strongly correlated with local spikes in flu prevalence, so they serve as a real-time monitoring tool, viewable at Google Flu Trends (part of Google.org, the company’s philanthropic arm).
The trick for Shaman and Karspeck was to get these data into a predictive model in a useful way. To do this, they called on one of the most popular data assimilation techniques used in weather forecasting (more on this below). And their predictive model includes information on the kind of weather that helps flu to spread—which might not be the weather you imagine.
For more than half a century, laboratory work and incidence patterns have suggested that influenza viruses tend to spread more readily when the air is dry. However, pinning down the effect has proven difficult. Some researchers suspected that the reason for influenza’s clear wintertime peaks might be school schedules, cold-weather gatherings indoors, or other nonmeteorological factors.
When it comes to weather, “people had been looking exclusively at relative humidity and temperature,” says Shaman. “They’d only seen marginal relationships between both of these markers and influenza.”
It occurred to Shaman that a more powerful weather variable might be absolute humidity (AH), the literal amount of moisture in the air. Absolute humidity doesn’t change when an air parcel is heated or cooled, while relative humidity does.
In a 2009 study published in PNAS, Shaman and colleagues showed that AH bore a much stronger and more consistent relationship than did relative humidity to virus survival and flu transmission rates. Since AH doesn’t change when you go indoors (unless you have a humidifier or air conditioner), whatever influence it has on the flu virus would be consistent in both environments. Interestingly, the virus itself is borne on tiny droplets, so exactly why it would spread most readily in a dry atmosphere remains a mystery. “I’ve seen about five competing hypotheses,” says Shaman.
Whatever the explanation, Shaman found that flu epidemics were somewhat more likely to occur several weeks after a region experienced relatively low AH values. This was a useful building block for a predictive model of influenza spread, which Shaman introduced in a 2010 PloS Biology paper.
But there’s much more than weather driving flu epidemics. While AH can shape the climatological signature of flu in temperate regions, he says, “what it can’t do on its own without data assimilation is produce individual outbreaks that match observations.” To do this, Shaman’s model would also need to incorporate current information on what the virus was actually doing, and then extend that behavior into the future.
In 2009, Shaman discussed the unfolding research over dinner with Karspeck at the annual meeting of the American Geophysical Union (this year’s conference kicks off next week). “Alicia and I were officemates in grad school at Columbia, so I knew she was doing data assimilation work,” he says. “I wanted to build on that first flu paper and develop some sort of predictive model. She said that sounded like a really good project for data assimilation.”
The challenge was to keep Shaman’s flu-prediction model from running out of control. Just as small, unobservable weather features can lead to big forecast errors over time—often referred to as the “butterfly effect,” though that analogy should be used with care—there are aspects of flu transmission that can’t be easily measured but that greatly influence the odds of an epidemic. These range from how many people in a given area get flu vaccinations to how often people travel elsewhere and bring a virus back with them.
Karspeck had a particular tool in mind for handling this uncertainty: the ensemble Kalman filter. Introduced to meteorology in the 1990s, it’s become one of the leading methods for bringing large datasets into numerical weather forecast models, especially when there are other variables that can’t be directly observed in a quantitative way. Clouds are a good example: their highly variegated nature can’t be portrayed with precision in a model, so clouds are instead parameterized (estimated) based on temperature, humidity, and other observable factors.
“There are elements in flu transmission that you can observe and those you can’t,” Karspeck explains. “One of the powerful aspects of data assimilation in general, and our method in particular, is the ability to estimate those unknown pieces of information, which turn out to be important in making forecasts.”
In a nutshell, the ensemble Kalman filter allows a forecast from the flu prediction model to be “nudged” toward the most recent Google.org search data, which serve as a proxy for flu incidence. In addition to this basic task, the ensemble Kalman filter uses a more sophisticated strategy to track the unobserved variables (such as the number of flu-susceptible people) as well as the observed ones (Google searches) This allows the model to bring a complete picture of the social and physical setting for flu transmission forward in time. “We’re inferring all of this other information by using estimates of the number of infected people,” Karspeck says. (The AH component of the model is handled separately from the Kalman filter.)
To use a loose and somewhat goofy analogy, imagine you’re off on a blind date. You start out with a set of expectations about what certain actions from your date—a smile, a raised eyebrow, a bit of personal history—might mean for the future. Your mental map relating current observations to possible outcomes is akin to the flu-prediction model, except the latter is quantitative and probabilistic. You also know that you can’t read your date’s mind, but you can use the behaviors you observe to infer what she or he is thinking. In this rough analogy, the ensemble Kalman filter relates each observed behavior to each unobserved thought. As the night goes on, the clues you notice will tend to push your future scenarios closer to each other and to what’s really happening in your date’s mind.
Romance might not be a practical setting for using the real-life ensemble Kalman filter, but the tool has potential in many other areas, according to Jeff Anderson, who directs the NCAR-based Data Assimilation Research Testbed. This group of about a half dozen scientists explores new ways of doing assimilation and helps other researchers apply those techniques. Anderson led the creation of the version of the ensemble Kalman filter used in the Shaman-Karspeck study.
“We’ve already had some users try out ensemble filters in a variety of other fields, such as economic forecasting,” says Anderson. “They open up the possibility of addressing very large problems, and they make it much easier to use complex prediction models.” One case where ensemble filters have especially great untapped potential, according to Anderson, is election forecasting.
Shaman, Karspeck, and colleagues plan to continue their journey toward what they hope will become a practical real-time system for predicting flu outbreaks weeks in advance. First, though, will come more testing over different regions and longer time periods in order to validate the model’s performance. “Over time, as we do more and more of these studies, we’ll be able to do more robust validation,” says Karspeck. A couple of factors will ease the way. The data involved are readily available to the public, and although the interrelationships are complex, the eventual number crunching is quite manageable. As Karspeck notes, “all this work was done on our laptops.”