Upon entering the New York Academy of Sciences (NYAS) foyer, guests are greeted by a huge bust of Darwin, along with wonderful preservations and replicas of samples of his early works adorning the walls. While Darwin revolutionized science with curiosity and the powers of observation, who knows what he could have accomplished with the informatics and computational resources that are available to scientists today?
It was fitting last Friday that the NYAS held their First International Workshop on Climate Informatics at their downtown offices on a beautiful day when everyone seemed to be dodging the city in advance of Hurricane Irene. Aside from being a wonderful venue to hold a workshop — I enjoyed reading the pages of Darwin’s “Descent of Man” writings on the wall — the discussions gave me much food for thought.
As with any small conference in a single-speaker setting, the majority of talks were good, covering the range of climate data and statistical methods and applications. And as is often the case, I was more impressed with the talks that addressed topics outside of my disciplines, particularly the machine learning discussion provided by Arindam Banerjee of the University of Minnesota.
But the highlight came during the breakout sessions, which provided in-depth discussions surrounding the challenges and opportunities in applying new methods to climate data management and analysis. Topics ranged from multiple-petabyte data management issues faced by paleoclimatologists to management and manipulation of large datasets associated with global climate modeling and Earth Observation (EO) technologies.
Overall, the workshop showed that we’re seeing the early confluence of two communities: climate scientists looking for new tools and techniques are on one side, data scientists and statisticians looking for new problems to tackle are on the other.
Data poor to data rich
One of the event’s more interesting side notes came from a breakout session where we explored the transition from being a data-poor field to a data-rich field. As an applied scientist, I certainly would say that climate researchers have been blessed with more data, both spatially and temporally. While the days of stitching various datasets together to test an idea may be behind us, the main issues tend to come down to scale. Is global coverage at 4KM resolution good enough for satellite observations? Can we build a robust model with data at this scale? Do interpolation methods for precipitation and temperature work across various physiographic environments?
While more data helps alleviate some of the scientific challenges we have faced in the past, it also raises more questions. Further, each year of global observations builds the database of reanalysis data — as an example, look at the reanalysis data that’s part of the MERRA maintained at NASA’s Goddard Space Flight Center.
That said, I’ll default to the position that too much data is a good problem to have.
Path forward for the data community
The timing of this event was also useful for another reason. The upcoming Strata Summit in New York will bring together data scientists and others in the data domain to address the challenges and strategies this growing community faces. I’ll be giving a talk on new ways to collect, generate and apply atmospheric and oceanic data in a decision-making context under the rubric of atmospheric analytics. In addition to the talk, I’m eager to learn how I can better utilize the data I’m working with as well as bring back some new tools to share with my colleagues in other fields who may face similar big data challenges.