Modal
Inference: Halfway between clustering
and mixture analyses Bruce G. Lindsay
Willaman Professor and Department Head Department of Statistics Pennsylvania State University Abstract
Given
a set of n data points and an appropriate kernel K, one has the natural
nonparametric density estimator formed by averaging the values of the
kernel over the data. The smoothness of the estimator depends on a
bandwidth parameter h. We consider the modes of the resulting density
as indicators of important substructures within the data. There is a
natural extension of the EM algorithm that can be used to find the
modes. In addition, the method of steepest ascent can be used to assign
the individual data points to modes, providing a clustering of data
points through their modal association. If in addition we let the
bandwidth parameter go from 0 to infinity, we can construct a
hierarchical clustering of the data points. In addition to providing
satisfying clustering results that lie somewhere between clustering
algorithms and a formal mixture analysis, the estimation method raises
interesting inferential questions that lie somewhere between the two
points of view. One question: is it a mistake to use mean squared error
as a bandwidth criterion in density estimation?
Co-authors: Jia Li, Penn State University; Surajit Ray, Boston University |