Principal Component Analysis

Principal component analysis is a technique that allow us to reduce the size of the data set while keep the maximum among of information of the data set.

The problem

Given a large data set:
    1) it has a large number of observations, W; i.e.  S = { s1, s2, ......, sw }
    2) each observation has a large number of dimensions, D; i.e. si = { si1, si2, ...., siD },  i=1....W

To reduce the size of the data set, we can either reduce the number of observations, W; or reduce the number of dimensions, D. For principal component analysis, we try to reduce the dimension of the data set while keeping as much information as possible.  In later class, we will discuss how to reduce the number of observations by styling.

Evaluation of data set

In a graphical representation, the original data set is represented in a space with D axes. After reducing the dimension, the transformed data set can be represented in a space with fewer dimensions. To evaluate how much information in the original data set has been captured in the transformed space, we use one measure: variations. By measure the variations along different axes in both original and transformed space, we know what percentage of the variations on the original axes can be explained by variations on the transformed axes. For example, how many variations in a 2D space can be explained by variation in a 1D space.

If all variations in the original space (with dimension D1) can be captured in the reduced space (with dimension D2 < D1), we then know that the intrinsic dimension of the original data is less than D1.

One drawback: how do we know the dimension(s) with maximum variation(s) captures the characteristic of the data set which we want to keep? For certain types of data such as layered distributed data, the dimension along maximum variation cannot capture the characteristics of the data set.

Measure of variations

The variations of the data set can be measured in a variance-covariance matrix: for the 2D example,
    VAR = V = |V11, C12|
              |C21, V22|

Our goal is to find a transformation that will make V11 as large as possible in the transformed axis system. Further, we assume that: 1) we only consider the 2nd moments captured by COV, 2) we only consider linear transformation.

For the 2D example, the transformation matrix is  represented as:
    L = |L11, L12| = |cos(C),  sin(C)|
        |L21, L22|   |-sin(C), cos(C)|
where C is the axis rotation angle. By trying different values of the rotation angle, the maximum V11 in the transformed space can be found.

In general, the dimensions (axes) in corresponding to the maximum variances are called principal components of the data set. They can be found through Singular Value Decomposition (SVD).

Eigen value and eigen vector approach

For the original data set S, let us find vector Li and scalar ei that satisfy
                                    S *Li  =  ei * Li   (i = 1, 2, ......)
where Li is called eigen vector and eis called eigen value of S.

For the data set S, there are a set of eigen vectors and corresponding set of eigen values. It is approved that all eigen vectors are orthogonal. Each eigen vector defines a new axis, and the corresponding eigen value represents the variance along this axis. By looking at the spectrum of the eigen values, we can find the dominant eigen vectors that capture most of the variations of the data set. Those eigen vectors define a new axis system. Since the eigen vectors are orthogonal, data transformed into this new coordinate system are decoupled.

One drawback: when we have a very large data set, the data set have a lot of eigen values and eigen vectors. It is very time consuming to do the calculations.

Neural network approach

We can use neural network to find the reduced dimensions for the large data set.

First, configure a three layered neural network with linear mapping among layers. The number of input and output node equal to the number of dimensions of the data set, and the number of hidden units is smaller than the data set dimension.

Second, train the neural network with data set. Use each data observation as both input and output. This is called auto association or self supervised learning.

Third, after the training, hidden units represent reduced dimensions. The weights linking to the hidden units represents the mapping for the original space to the reduced space.

To do non-linear mapping, we need to add two more layers after the input layer and before the output layer. They take care of non-linear mapping from the the input layer and to the output layer.

With enough data, the neural network approach can find the intrinsic dimension of the data set.

Box counting approach

Define a length e, and construct a box with dimensions e in a space,
1) start with one dimension, gradually increase the length of e, and count the number of data points in the box: N.
2) if N is changing in proportional to the change of e, then the data could be one dimension;
    if  N is changing in proportional to the change of e2, then the data could be two dimension;
    ......

Multidimensional Scaling

It operates on distances between points and project the points onto the sub planes. The nearby points remain nearby; while the far apart points remain far apart.