The OUPUT is the logisitic, has a target, and is directly connected to the input.
An activation function is defined in the input-output space to map the input parameters to the output parameters.
For logistic regression, a logistic function is used as the
actuation function. It is defined as:
y = 1 / [1 + exp(-Sigma)]
where Sigma = w0
+ sum(wi * xi)
xi is the value of the input (can be a scalar (univariate) or a
vector (multivariate)
An error function is defined in the weight-error space to
measure the deviation of network results from that of the training set. For
logistic regression, the cross-entropy cost function is used as the error
function. It is defined as:
Ek = dk * ln(1/yk) + (1-dk) *
ln(1/1(1-yk))
where k is the number of pattern
yk is the output value from the network
dk is the “desired” value or target from training data
Cross entropy is the negative logarithm of the probability that we encountered before (Bernoulli, coin flips)
Note: the parameter is a function of the network input (conditional)
To be precise, this probabilistic part is the noise model
I.e., exp(error function) is a model of the noise only, not of the observation, which is signal + noise
The errors are assumed to be statistically independent.
This allows us to write the likelihood as a product
Based on the definition of the activation function and error
function, the training process updates the weights of the network step by step
along the gradient in the weight-error space. For logistic regression, change
in weight is derived as:
Delta( wi ) = eta * xki * (dk - yk)
where eta is the step size (“learning rate”)
xki is the activation from below
(dk - yk) is the error from above
For Logistic Regression, faster training methods exist. What is important here, however, is to bring out the parallels to MLP neural networks
Input and output nodes are linked through layer(s) of hidden units.
Each node is connected to the nodes in the adjacent layers through their weights.
Comparison between MLP and logistic regression
Similar: logistic activation function
Different: it is now the hidden units which have this activation function, and we no longer have a target for them
An activation function is defined in the input-output space to map the input parameters to the output parameters.
For MLP, a hyperbolic tangent function is used as the actuation
function. It is defined as:
y = [1 - exp(-net)] / [1 + exp(-net)]
where net = w0
+ sum(wi * xi)
xi is the input value from training set
An error function is defined in the weight-error space to
measure the deviation of network results from that of the training set. For
MLP, the squared error cost function is used as the error function. It is
defined as:
Ei = 1/2 * Sum (di - yi)2
where i is the index of the
“pattern”
yi is the output value from the network
di is the “desired” value or target from training data
Based on the definition of the actuation function and error
function, network training process update the weights of the network step by
step along the gradient in the weight-error space. For MLP, change in weight is
derived as:
Delta( w ) = eta * actuation from below * error from
above
Actuation from below depends on the actuation function; while error from above
is back propagated from the output layer to the input layer through hidden
units. (see handout 10-27 for details).
Pattern: update the weights of the network once after the
presentation of every pattern.
Block: update the weights of the network once after the presentation of a block
of patterns (more than one pattern, and less than all of them)
Epoch: update the weights of the network once after the presentation of all
patterns.
Underfitting: model not complex enough for the data
Overfitting: model too complex compared to the data
Though overfitting may reduce the in-sample error, it usually increase the out-of-sample error. Several method can be used to reduce overfitting:
For noisy data, never want to reach “convergence”
Most important: MONITOR in-sample and out-of-sample performance (“learning curve”)
Contrast to statisticians: often primarily interested in the final results
Neural networks: focus on process
Starting with small weights means starting with linear model (WHY?)
Generate more data that reflect your beliefs about the process
Can be in the neighborhood by local small perturbations, or can be discrete symmetries, e.g., reflection for foreign exchange where viewing it from the other country might need to give the same model
Stock picking example where every stock for every day was normalized to have the sum of the squared inputs equal to unity
I.e., project on hypersphere
E.g., log(cosh())reduce the contributions of outliers.
Quadratic for small errors, linear for large errors (SHOW)
(But… what if the outlier is not an outlier—we never know)
Weight decay: W = W * lambda; lambda = 0.98
Assumes that weights are drawn from a Gaussian (as prior probability)
Weight elimination (effective counter of weights: if it should be large, don’t extra penalize it!)
Rather than going via the weights, go directly for the functional mapping
Fewer dimensions…