Can it be possible that there is such a thing as too much data to work with? Sometimes the answer is actually yes. Advancements in data generation and storage capacities have begun to exceed the growth in any given machine’s bandwidth capabilities, causing a bottleneck effect when attempting to move and work with huge amounts of data. What has become particularly troublesome is when one tries to work with multiple massive datasets from more than one source that are stored across more than one location, as is often the case when dealing with satellite or geophysical data. Due to the potential for datasets from this source to reach from thousands to millions of observations each at one time, the cost and difficulty of moving them renders these sets practically immovable. So how, then, can one consolidate all of this distributed data and use it to draw conclusions from the data, especially when inferences are commonly sought over a period time (resulting in even more data points)?
Dorit Hammerling of the National Center for Atmospheric Research and her colleagues are working on solving this problem, with their ultimate goal being to find a way to make an inference from a massive dataset without having to move substantial amounts of the data. She explained how they have attempted to deal with this issue by using a combination of spatial statistics and Bayesian hierarchical modeling. Hammerling described how this methodology begins with identifying the process of interest to be modeled, and stating the spatial domain over which the process occurs. Then, vectors for both the data used in the model and the necessary parameters are set up for eventual use within the model.
The types of models used by Hammerling and her colleagues for this approach at a solution were low rank spatial models. This is because these groups of models provide quick, precise inference for large sets of data as well as maintaining a non-stationary covariance function, which she explained is essential to several geophysical processes. In particular, Hammerling’s team chose to work with a spatial random effects model, a branch of low rank spatial models that assumes that elements in the dataset are distributed independently and evenly across the set.
To test this newly developed methodology, data from the National Oceanic and Atmospheric Administration (NOAA) was used to make an inference on total precipitable water (TPW) in the atmosphere, or more simply put, the amount of water in the atmospheric column. This is because NOAA’s information on the subject had to be put together from three separate collection sources, each contributing huge datasets. These included readings from GPS units (which provided high accuracy data, but only in a limited scope over each unit), geostationary operational environmental satellite (GOES) data, and microwave integrated retrieval system (MIRS) data.
The use of low rank spatial models worked as well as Hammerling and her colleagues could have hoped, as the method seemed well suited to deal the massive amounts of spatial data. Quick and accurate inferences made from the dataset were made possible, and in a much simpler and more efficient manner than past methods. The use of this methodology also reduced the computational time and difficulty required for a machine to run the given model, giving one more reason why low rank spatial models could possibly be the future of working with massive spatial datasets.