1.    Notes for Class 10 (Feb 24, 1999)

1.1.        Probability Distribution

Probability distributions plays very important role in data mining. We are not only concerned about predicting the mean of the data series, as in a traditional regression model, but also want to know the probability distribution of the next value.

1.1.1.     Why and what for do we want to know the probability distribution

Knowing probability distribution enable us to price options, calculate value-at-risk, and select portfolio, etc.

 For most of the applications, if there was no uncertainty at all, what we only need to do is to pick up the optimal candidate which maximize the criteria. For example, if know the future returns of equity for certain, we only need to pick up the stock with maximum return. However, in the face of uncertainly, we need to diversify and need to know the probability distribution to construct a diversified portfolio.

1.1.2.     Gaussian noise model

So far in this class, we only consider probability distribution in the output domain (i.e. y). Further, we only consider Gaussian distribution in the output domain, i.e. we assume that the noisy output data are normally distributed. We represent their distribution as:

N[mu(x), sigma2(x)]

where: x is the data series

1.1.3.     Non-Gaussian noise model

However, we all know that many financial time series are not normally distributed. For example, the stock return data deviates from the Gaussian distribution with tail heavy and skewness. To get more accurate estimate of the behavior of data series, we need to consider probability distribution beyond Gaussian distribution.

1.2.        Model Non-Gaussian Distribution

There is a huge amount of possibilities to go to non-Gaussian distribution, similar to the generalization from linear models to non-linear models. Historically, people used the parameterized distributions and compiled them in big books.

Today, data mining techniques provide a powerful alternative to learn the distributions from data.

1.2.1.     How do we know a distribution is non-Gaussian

·         Moments

Certain characteristics of the distribution of data can be examined by some well designed algorithm. For example, the heavy-tail and skewness of the stock return series can be examined using kurtosis and skewness function in Matlab. Kurtosis, for example, while still a central moment, amplifies the outliers in the data due to the 4th power.

·         Stem-and-leaf plots

Stem and leaf plots provide a visual presentation of the data distribution. However, they are only useful for small data sets.

·         QQplots

Quantile-quantile plots are powerful tool to examine if the data are from a given distribution. It is implement by plot the data against the data from a reference distribution    

plot( {data from reference distribution}, sort(x) )
                        where x is the data series under consideration
To see if x is normally distributed,

plot( sort(randn(length(x),1)), sort(x) )

A linear plot implies that the data in x are normally distributed. A data series with fat-tails will produce a plot which drops down at the left end and shoot up at the right end.

1.2.2.     Mixture of Normals (unconditional)

This method uses combinations of Gaussian distributions to model a Non-Gaussian distribution. It can be viewed using the same hidden nodes architecture discussed before, however, the definition of the hidden units and the meaning of the outputs are very different.

In the regression model discussed before, the hidden nodes are defined either as predetermined nonlinear functions of the input, such as x, x2, ...., or some squashing functions such as a tanh. The goal for regression model is to predict the conditional mean and/or variance of the data set.

In the mixture model, the hidden units represent different Gaussian distributions with its own mui and sigmai. It goal is to obtain the probability density. Those basic Gaussian distributions are used as the building blocks to construct the probability distribution of the data set. The distribution of the data set is approximated by the mixture of Gaussian distributions.

Let gi represents the weight for the ith Gaussian distribution,
             gi  >= 0
            Sum(gi) = 1

The algorithm for constructing the mixture model is presented as follows:
Given:         data set Xt
                   mu = sum(xi)/N
                   sigma2 = sum[(xi-mu)2]/N
The mixture distribution:
                  P(x) = sum[ gi * N(mui, sigmai2) ]

Step 1: assume known mui and sigmai
            compute for each data point how likely Xt is from the distribution N(mui, sigmai2)
            gi = P(Xt | i) = N(mui, sigmai2)|Xt / sum[ N(mui, sigmai2)|Xt ]

Step 2:  assume git is given
            compute mui and sigmai.
            mui = Sumt[ gi * xt ] / Sumt[ git ]
            sigmai2 = Sumt[ gi * (xt - mui)2 ] / Sumt[ git ]

Step 1 and step 2 run iteratively. That will refine gi  , mui , and sigmai2 continuously. Demster, Laird and Rubin proved that this algorithm will converge to a local optimum.

This algorithm not only find distribution from data set instead of book-looking, but also discover hidden states.

1.3.        Class 11

1.3.1.     Mixture depends on some input (conditional mixture)

·         Mixture of experts, also called Gated experts

·         (see 5 pages of slides)

1.3.2.     Direct quantile prediction

·         Maximum likelihood framework

·         Two implementations

Kernels (lazy method)
Neural network (eager method)

·         Application to Value at Risk