Class 9 Notes

Review of Class 8: Neural Network Learning

1.1.        Logistic Regression

­          The OUPUT is the logisitic, has a target, and is directly connected to the input.

1.1.1.    Input-Output Space

­          An activation function is defined in the input-output space to map the input parameters to the output parameters.

­          For logistic regression, a logistic function is used as the actuation function. It is defined as:
                                                    y = 1 / [1 + exp(-Sigma)]
                                where          Sigma = w0 + sum(wi * xi)
                                                    xi is the value of the input (can be a scalar (univariate) or a vector (multivariate)

1.1.2.    Weight-Error Space

­          An error function is defined in the weight-error space to measure the deviation of network results from that of the training set. For logistic regression, the cross-entropy cost function is used as the error function. It is defined as:
                                                Ek = dk * ln(1/yk) + (1-dk) * ln(1/1(1-yk))
                                where        k is the number of pattern
                                                yk is the output value from the network
                                                dk is the “desired” value or target from training data

1.1.3.    Interpretation in ML framework

­          Cross entropy is the negative logarithm of the probability that we encountered before (Bernoulli, coin flips)

­          Note: the parameter is a function of the network input (conditional)

­          To be precise, this probabilistic part is the noise model

­          I.e., exp(error function) is a model of the noise only, not of the observation, which is signal + noise

­          The errors are assumed to be statistically independent.

­          This allows us to write the likelihood as a product

1.1.4.    Network Training

­          Based on the definition of the activation function and error function, the training process updates the weights of the network step by step along the gradient in the weight-error space. For logistic regression, change in weight is derived as:
                                                Delta( wi ) = eta * xki * (dk - yk)
                                where     eta is the step size (“learning rate”)
                                               xki is the activation from below
                                              (dk - yk) is the error from above

­          For Logistic Regression, faster training methods exist. What is important here, however, is to bring out the parallels to MLP neural networks

1.2.        Multilayer Peceptron (MLP) / Backpropagation

­          Input and output nodes are linked through layer(s) of hidden units.

­          Each node is connected to the nodes in the adjacent layers through their weights.

­          Comparison between MLP and logistic regression

­          Similar: logistic activation function

­          Different: it is now the hidden units which have this activation function, and we no longer have a target for them

1.2.1.    Input-Output Space

­          An activation function is defined in the input-output space to map the input parameters to the output parameters.

­          For MLP, a hyperbolic tangent function is used as the actuation function. It is defined as:
                                                    y = [1 - exp(-net)] / [1 + exp(-net)]
                                where          net = w0 + sum(wi * xi)
                                                    xi is the input value from training set

1.2.2.    Weight-Error Space

­          An error function is defined in the weight-error space to measure the deviation of network results from that of the training set. For MLP, the squared error cost function is used as the error function. It is defined as:
                                                Ei = 1/2 * Sum (di - yi)2
                                where        i  is the index of the “pattern”
                                                yi is the output value from the network
                                                di is the “desired” value or target from training data

1.2.3.    Interpretation in ML framework

­           

1.2.4.    Network Training

­          Based on the definition of the actuation function and error function, network training process update the weights of the network step by step along the gradient in the weight-error space. For MLP, change in weight is derived as:
                                                Delta( w ) = eta *  actuation from below *  error from above
Actuation from below depends on the actuation function; while error from above is back propagated from the output layer to the input layer through hidden units. (see handout 10-27 for details).

1.3.        Choices in model building

1.3.1.    Number of hidden units

1.3.2.    Update frequency

­          Pattern: update the weights of the network once after the presentation of every pattern.
Block: update the weights of the network once after the presentation of a block of patterns (more than one pattern, and less than all of them)
Epoch: update the weights of the network once after the presentation of all patterns.

1.4.        Overfitting

­          Underfitting: model not complex enough for the data

­          Overfitting: model too complex compared to the data

­          Though overfitting may reduce the in-sample error, it usually increase the out-of-sample error. Several method can be used to reduce overfitting:

1.4.1.    Stop early

­          For noisy data, never want to reach “convergence”

­          Most important: MONITOR in-sample and out-of-sample performance (“learning curve”)

­          Contrast to statisticians: often primarily interested in the final results

­          Neural networks: focus on process

­          Starting with small weights means starting with linear model (WHY?)

1.4.2.    Use pseudo data or hints

­          Generate more data that reflect your beliefs about the process

­          Can be in the neighborhood by local small perturbations, or can be discrete symmetries, e.g., reflection for foreign exchange where viewing it from the other country might need to give the same model

1.4.3.    Normalize data in creative ways

­          Stock picking example where every stock for every day was normalized to have the sum of the squared inputs equal to unity

­          I.e., project on hypersphere

1.4.4.    Use robust error function

­          E.g., log(cosh())reduce the contributions of outliers.

­          Quadratic for small errors, linear for large errors (SHOW)

­          (But… what if the outlier is not an outlier—we never know)

1.4.5.    Model and penalize network complexity

­          Weight decay: W = W * lambda; lambda = 0.98

­          Assumes that weights are drawn from a Gaussian (as prior probability)

­          Weight elimination (effective counter of weights: if it should be large, don’t extra penalize it!)

1.4.6.    Penalize output curvature

­          Rather than going via the weights, go directly for the functional mapping

1.4.7.    Reformulate the task

­          Fewer dimensions…