Scaled Principal Components and Correspondence Analysis:
clustering and ordering

Chris Ding
Lawrence Berkeley National Laboratory


 
Correspondence Analysis (COA) is a method in multivariate statistics for
analyzing contingency tables using a technique similar to principal
component analysis (PCA). The relationship between COA and PCA, however,
has not been fully explored so far.

In this paper, we develop the theory of scaled principal component analysis
(SPCA) first on a matrix of pairwise similarities. Extending this approach
to asymmetric (and rectangle) similarities of contingency tables, the
resulting SPCA components are precisely those in COA.

SPCA is motivated by data ordering (ordination). Given n objects and pairwise
similarities among them, we seek an optimal ordering such that similarities
between adjacent objects are maximized while similarities between distant
objects are minimized. Optimizing such an ordination objective function,
SPCA components are continuous solution for the desired index permutations.

Extending this approach to simultaneous ordering the rows and columns of a
contingency tables, the resulting SPCA components are precisely those in COA.
SPCA can also be derived from data clustering. Given n objects and pairwise
similarities among them, we seek to cluster them into two clusters
such that the between-cluster similarities are minimized while the within-cluster
similarities are maximized.

Optimizing such an objective function, SPCA components are continuous  solution
for the desired cluster membership indicators. Extending this approach to
simultaneous clustering of the rows and columns of a contingency table,
the resulting cluster membership indicators are precisely those in COA.

Underlying the objective function optimizations for data ordering and clustering
is a fundamental property of SPCA: the cluster self-aggregation. In the space
spanned by the $K$ SPCA components, objects within each cluster self-aggregate
towards each other; In a properly defined connectivity matrix, connections between
different clusters are automatrically suppressed while connections within same
cluster are enhanced.

We illustrate the SPCA theory with examples and apply them to the analyses of
DNA microarray gene expression profiles.