This Friday saw Texas A&M’s own Dr. Jeffrey Hart present to a packed room at this week’s AMS department colloquium. Hart presented on research conducted by himself and a student of his at Texas A&M, titled “Testing equality of a large number of densities.”
What Dr. Hart was interested in was how to handle situations when the number of sources of data are far greater than the amount of data each source provides (mathematically, p data sets of size n when p>>n). To study this, he structured a non-parametric hypothesis test with the null hypothesis being that the means of several small samples of densities would be roughly equivalent to the overall mean of the total densities, very similar to the idea behind the central limit theorem. Hart chose a non-parametric test as opposed to a parameterized model for the sake of limiting the number of assumptions needing to be made.
In particular, he chose to design this test based upon kernel density estimators rather than more classically used empirical distribution functions. Hart claimed that the kernel density estimators, which tend to parallel the sum of squares in a one-way ANOVA test, are simply more powerful and provide more useful results than EDFs, and thus were chosen for use in this research. In this situation, the kernel refers to a unimodal density symmetric about zero, most often a standard normal distribution. This, coupled with the use of a parameter called the “bandwidth” (a data-driven parameter used to control the smoothness of the estimate), produced a model that delivered rather accurate results when numerically tested. The kernel estimates did tend to slightly undershoot the true distribution function’s peaks and slightly overshoot its valleys, but it was well within the generally accepted bias range for an estimator (non-parametric model) and did an excellent job of mapping out the general trend of the data’s distribution.
However, one pressing issue Hart explained that still needs to be worked out with this estimator is that when working with a data pool from several differing distributions (differing in terms of normal distributions with differing degrees, central points, etc.), it is often difficult to correctly pinpoint the location of a data point and determine which distribution in particular it came from. As part of planned future research, Dr. Hart is considering trying techniques such as the use of wavelets and other non-analytic/numeric methods of resolving this issue.