PD Stefan Bosse
University of Siegen - Dept. Maschinenbau
University of Bremen - Dept. Mathematics and Computer Science
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models -
Adapting dynamic parameters of a functional network is an iterative optimization problem
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models -
Adapting dynamic parameters of a functional network is an iterative optimization problem
Commonly the solution space is infinite, i.e., there is no one valid solution of the optimization problem.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models -
Adapting dynamic parameters of a functional network is an iterative optimization problem
Commonly the solution space is infinite, i.e., there is no one valid solution of the optimization problem.
Basic training is demonstrated for an Artificial Neural Network
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - A simple Artificial Neuron
A simple neuron (perecptron) is a mapping function f(a model) that maps an n-dimensional input vector v on a scalar value u:
f(→x,→w,b)=g(n∑i=1wixi+b)
Here w is weight vector and b an offset (dynamic parameters). The function g is called transfer or activation function, normally not parametrized.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - A simple Artificial Neuron
A single neuron with a single input p and an output o. w is a weighting factor (a weight for incoming p) and b is a bias (offset)
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - A Multi-input Artificial Neuron
A single neuron with an input vector p and a scalar output o. w is a weighting factor vector (a weight for incoming p) and b is a bias (offset)
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Artificial Neural Network
A ANN is a function graph consisting of interconnected neurons. It is a graph G(V,N) with a set of nodes (neurons) and vertices connecting the nodes.
Commonly neurons are arranged and grouped in layers, but this is not mandatory. There is always an input and one output layer. Hidden layers are between input and output layers.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Artificial Neural Network
The input layer (commonly) consists of n neurons for n input variables (attributes).
The output layer (commonly) consists of m neurons for m output variables (regression) or m target classes (classification)
Commonly, but not mandatory, each neuron of a layer i is connected with the outputs of all neurons of the previous layer i-1
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Artificial Neural Network
Neural network with neurons arranged in one layer
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Loss and Error Functions
Assume there is a set of data samples D, each sample contains the x input feature vector and output target feature vector y.
The goal of the model training is to find a model function that maps x on y with minimal error for all instances (at least averaged)
The loss or error function defines the mismatch of a training or test sample with the output of the function f(here for one scalar output y):
y=f(→x)MAE(y,y0)=|y0−y|MBE(y,y0)=y0−yMSE(y,y0)=(y0−y)2
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Loss and Error Functions
→y=f(→x)MAE(→y,→y0)=∑ni=1∣∣yi−y0,i∣∣nMBE(→y,→y0)=∑ni=1yi−y0,inMSE(→y,→y0)=∑ni=1(yi−y0,i)2n
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
Most of the CNN layers involve parameters which are required to be tuned appropriately for a given computer vision task (e.g., image classification and object detection).
Assume again a single perceptron neuron with only two inputs a and b.
Then we can change the respective weight parameter w just by computing the "forward" application error, and subtracting the error multiplied with the current input value from the weight w(Rough approximation!):
w´i=wi−α(y−y0)xi
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
data = { {x1=0, x2=0, y=0},
{x1=1, x2=0, y=0.3},
{x1=0, x2=1, y=0.5},
{x1=1, x2=1, y=1},
}function sigmoid(x) { 1/(1+exp(-x))}function neuron(x1,x2,w,b) { accu = x1w[1]+x2w[2] sigmoid(accu+b)
}
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
w = [0,0] b=0 samples=1:4 rate=0.01for (run in 1:1000) { set=sample(samples,1) row=data[[set]] y=neuron(row$x1,row$x2,w,b) err=y-row$y w[1]=w[1]-rate*err*row$x1 w[2]=w[2]-rate*err*row$x2 b=b-rate*err}print(w) print(b)
Training with randomized selected sample instances
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
for (index in 1:4) { row=data[[index]] y=neuron(row$x1,row$x2,w,b) print(paste('Index',index,'Predicted',y,'Error',y-row$y))}
Test with sample instances
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
w´i=wi−α∂(y−y0)∂wi
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
The learning rate α determines the steps to be taken along the slope to achieve the goal. Too large steps could result in jumping over or missing the point of global minimum(also known as overshooting) and too small steps results in a very slow process of achieving the goal. This is a hyperparameter that needs to be tuned. In practice, people often start with 0.01, and either decrease or increase accordingly. (Aminah Mardiyyah Rufai)
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
But: We have a lot of different training samples, and if we change the parameter only based on the error from the current sample we will not converge to an average!
Therefore, only a small fraction given by the learning rate parameter α is used!
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
Up to here we considered only one functional node (one neuron).
If parameters of functions of previous nodes/layers must be adapted, the process is a little bit more complicated, although, the same principle is applied, i.e., in general the derivative of the error function by the respective weight/parameter to be adjusted must be computed:
∂E∂wi
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
Matt Mazur, https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Example network with teo input node, two inner nodes, and two output nodes
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
∂E∂wi=∂E∂outi⋅∂outi∂neti⋅∂neti∂wi
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
In the hidden inner layer, we start with the same formula, but slightly different to account for the fact that the output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons.
We know that outh1 affects both outo1 and outo2 therefore the gradient needs to take into consideration its effect on the both output neurons:
∂E∂w1=∂E∂outh1⋅∂outh1∂neth1⋅∂neth1∂w1∂E∂outh1=∂Eo1∂outh1+∂Eo2∂outh1
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation
Error backpropagation from output to inner layer nodes must consider error accumulation by multiple nodes
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Weight Initialization
A correct weight initialization is the key to stably train very deep networks. An ill-suited initialization can lead to the vanishing or exploding gradient problem during error back-propagation.
A common approach to weight initialization in CNNs is the Gaussian random initialization technique. This approach initializes the convolutional and the fully connected layers using ran- dom matrices whose elements are sampled from a Gaussian distribution with zero mean and a small standard deviation (e.g., 0.1 and 0.01).
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Weight Initialization
The uniform random initialization approach initializes the convolutional and the fully connected layers using random matrices whose elements are sampled from a uniform distribution (instead of a normal distribution as in the earlier case) with a zero mean and a small standard deviation (e.g., 0.1 and 0.01).
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Weight Initialization
A random initialization of a neuron makes the variance of its output directly proportional to the number of its incoming connections (a neuron’s fan-in measure).
Var(w)=2nfin+nfout
where w are network weights. Note that the fan-out measure is used in the variance above to balance the back-propagated signal as well. Xavier initialization works quite well in practice and leads to better convergence rates.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Weight Initialization
Neurons (or filters with transfer functions) with a ReLU non-linearity do not follow the assumptions made for the Xavier initialization.
Var(w)=2nfin
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Pre-training
One approach to avoid the gradient diminishing or exploding problem is to use layer-wise pre-training in an unsupervised fashion.
The unsupervised pre-training can be followed by a supervised fine-tuning stage to make use of any available annotations.
However, due to the new hyper-parameters, the considerable amount of effort involved in such an approach and the availability of better initialization techniques, layer-wise pre-training is seldomused now to enable the training of CNN-based very deep networks.
(not a good idea)
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Supervised Pre-Training
In practical scenarios, it is desirable to train very deep networks, but we do not have a large amount of annotated data available for many problem settings.
A very successful practice in such cases is to first train the neural network on a related but different problem, where a large amount of training data is already available.
Afterward, the learned model can be “adapted” to the new task by initializing with weights pre-trained on the larger dataset.
This process is called “fine-tuning” and is a simple, yet effective, way to transfer learning from one task to another.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training and Validation (Test)
The set of data samples are commonly split in two sub-sets:
For gradient error back-propagation commonly linear error functions are used. For the validation, higher-order funtions (like MSE) can be used.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Regularization
Since deep conv. and neural networks have a large number of parameters, they tend to over-fit on the training data during the learning process.
Regularization approaches aim to avoid this problem using several intuitive ideas.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Regularization
We can categorize common regularization approaches into the following classes, based on their central idea:
1 norm,
2 norm, max-norm, and elastic net constraints); andPD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Data Augmentation
Data augmentation is the easiest, and often a very effective way of enhancing the generalization power of CNN models. Especially for cases where the number of training examples is relatively low, data augmentation can enlarge the dataset (by factors of 16x, 32x, 64x, or even more) to allow a more robust training of large-scale models.
Data augmentation is performed by making several copies from a single image using straightforward operations such as rotations, cropping, flipping, scaling, translations, and shearing. These operations can be performed separately or combined together to form copies, which are both flipped and cropped.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Data Augmentation
Khan, 2018 Examples of data augmentation using image cropping, flipping, and rotation
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Drop Out
One of the most popular approaches for neural network regularization is the dropout technique.
During network training, each neuron is activated with a fixed probability (usually 0.5 or set using a validation set).
This random sampling of a sub-network within the full-scale network introduces an ensemble effect during the testing phase, where the full network is used to perform prediction.
Activation dropout works really well for regularization purposes and gives a significant boost in performance on unseen data in the test phase.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Drop Out
A random dropout layer generates a mask m ∈ Bm, where each element mi is indepently sampled from a Bernoulli distribution a probability p being on (or 1-p being off).
→al=→m⊙f(^W⋅→al−1+→bl)
Here, a ∈ ℝn and b ∈ ℝm denote the activations and biases respectively. W ∈ ℝm×n is the weight matrix, and f the transfer function.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Ensemble Model Averaging
The ensemble averaging approach is another simple, but effective, technique where a number of models are learned instead of just a single model.
Each model has different parameters due to different random initializations, different hyper-parameter choices (e.g., architecture, learning rate) and/or different sets of training inputs.
The output from these multiple models is then combined to generate a final prediction score.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Ensemble Model Averaging
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Early Stopping
The overfitting problem occurs when a model performs verywell on the training set but behaves poorly on unseen data.
Early stopping is applied to avoid overfitting in the iterative gradient-based algorithms.
This is achieved by evaluating the performance on a held-out validation set at different iterations during the training process.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Early Stopping
Khan, 2018 An illustration of the early stopping approach during network training using the validation error for decision making instead a pre-defined training error threshold.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning
The CNN learning process tunes the parameters of the network such that the input space is correctly mapped to the output space.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning
Each iteration which updates the parameters using the complete training set is called a “training epoch".
Each training iteration at time t using the following parameter update equation modifies the parameters (same for linear filter mask wieghts as well as for non-linear neuronal functions):
θt=θt−1−αδtδt=∇θF(θt)
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning
But in contrast to neuron with fixed input data for a given data sample, the filter mask of a convolution operation moves the window over the entire input matrix!
Let's say we have 3x3 image, I, and a 2x2 filter W. Sliding this filter over the image will produce 2x2 output (no padding).
O11=I11W11+I12W12+I21W21+I22W22O12=I12W11+I13W12+I22W21+I23W22O21=I21W11+I22W12+I31W21+I32W22O22=I22W11+I23W12+I32W21+I33W22
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning
o=O11+O12+O21+O224
∂L∂W=[∂L∂W11∂L∂W21∂L∂W12∂L∂W22]
The error must be computed and accumulated for all pixels of the input image!
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning
Gradient descent algorithms work by computing the gradient of the objective function with respect to the network parameters, followed by a parameter update in the direction of the steepest descent.
The basic version of the gradient descent, termed “batch gradient descent,” computes this gradient on the entire training set.
However, the training sets can be very large in computer vision problems, and therefore learning via the batch gradient descent can be prohibitively slow because for each parameter update, it needs to compute the gradient on the complete training set.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning
Stochastic Gradient Descent (SGD) performs a parameter update for each set of input and output that are present in the training set.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient Computation
∇=∂u∂v≈ΔuΔv=ui−ui−1vi−vi−1
But such a difference formula tends to be very inaccurate for large gradients (not known in advance and dynamic). So analytical differentiation (of a node function) is preferred if possible.
On the other hand, analytically deriving the derivatives of complex expressions is time-consuming and laborious. Furthermore, it is necessary to model the layer operation as a closed-form mathematical expression. However, it provides an accurate value for the derivative at each point.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient Computation
Gradients of functions f can be computed by:
ΔfΔx=f(x+h)−f(x)h
Analytical differentiation (for simple functions)
Symbolic differentation (for complex functions)
Programmed differentiation
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient Computation
Every computer program is implemented using a programming language, which only supports a set of basic functions (e.g., addition, multiplication, exponentiation, logarithm and trigonometric functions). Automatic differentiation uses this modular nature of computer pro- grams to break them into simpler elementary functions. The derivatives of these simple functions are computed symbolically and the chain rule is then applied repeatedly to compute any order of derivatives of complex programs.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient Computation
Khan, 2018 Relationships between different differentiation methods
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Summary
Error backpropagation requires a previous forward computation to get the error and to compute the errror gradients (Bazaga et al., 2019).
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Understanding CNN by Visualization
The visualization can be categorized into three types depending on the network signal that is used to obtain the visualization, i.e., weights, activations, and gradients.We summarize some of these three types of visualization methods below
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Understanding CNN by Visualization
Visualization of regions which are important for the correct prediction from a deep network.
This is an iteative method to get either an heatmap of regions to show their contribution in a classification problem or to mask out irrelevant regions.
PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Understanding CNN by Visualization
(a) The grey regions in input images is sequentially occluded and the output probability of correct class is plotted as a heat map (blue regions indicate high importance for correct classification). (b) Segmented regions in an image are occluded until the minimal image details that are required for correct scene class prediction are left