To reduce the size of the data set, we can either reduce the number of observations, W; or reduce the number of dimensions, D. For principal component analysis, we try to reduce the dimension of the data set while keeping as much information as possible. In later class, we will discuss how to reduce the number of observations by styling.
If all variations in the original space (with dimension D1) can be captured in the reduced space (with dimension D2 < D1), we then know that the intrinsic dimension of the original data is less than D1.
One drawback: how do we know the dimension(s) with maximum variation(s) captures the characteristic of the data set which we want to keep? For certain types of data such as layered distributed data, the dimension along maximum variation cannot capture the characteristics of the data set.
Our goal is to find a transformation that will make V11 as large as possible in the transformed axis system. Further, we assume that: 1) we only consider the 2nd moments captured by COV, 2) we only consider linear transformation.
For the 2D example, the transformation matrix is represented as:
L = |L11, L12| = |cos(C), sin(C)|
|L21, L22|
|-sin(C), cos(C)|
where C is the axis rotation angle. By trying different values of the
rotation angle, the maximum V11 in the transformed space can be found.
In general, the dimensions (axes) in corresponding to the maximum variances are called principal components of the data set. They can be found through Singular Value Decomposition (SVD).
For the data set S, there are a set of eigen vectors and corresponding set of eigen values. It is approved that all eigen vectors are orthogonal. Each eigen vector defines a new axis, and the corresponding eigen value represents the variance along this axis. By looking at the spectrum of the eigen values, we can find the dominant eigen vectors that capture most of the variations of the data set. Those eigen vectors define a new axis system. Since the eigen vectors are orthogonal, data transformed into this new coordinate system are decoupled.
One drawback: when we have a very large data set, the data set have a lot of eigen values and eigen vectors. It is very time consuming to do the calculations.
First, configure a three layered neural network with linear mapping among layers. The number of input and output node equal to the number of dimensions of the data set, and the number of hidden units is smaller than the data set dimension.
Second, train the neural network with data set. Use each data observation as both input and output. This is called auto association or self supervised learning.
Third, after the training, hidden units represent reduced dimensions. The weights linking to the hidden units represents the mapping for the original space to the reduced space.
To do non-linear mapping, we need to add two more layers after the input layer and before the output layer. They take care of non-linear mapping from the the input layer and to the output layer.
With enough data, the neural network approach can find the intrinsic dimension of the data set.