A New Approach to Finding and Testing Clusters
David W. Scott
Rice University
Finding clusters in multivariate data is a
fundamental problem underlying many investigations.
The most successful algorithm is hierarchical
clustering, which forms clumps by finding points
that are close together. A more sophisticated
algorithm is k-means, which iteratively reassigns
points to the closest cluster center. Perhaps the most
sophisticated algorithm is mixture modeling, in
which the entire dataset is fit by a complicated
combination of multivariate Normal distributions.
In practice, these methods can be fooled. The
most difficult problem is determining the correct
number of clusters. The second problem is that
all of these methods work
best when the clusters have the same shape (spherical,
for example) and the same numerosity.
We examine some new research that aims at handling
the difficult yet practical case: multivariate data,
different cluster numerosity, different cluster
shapes, and an unknown number of clusters. Our
solution relies upon novel fitting technology and
interactive graphical visualization.