PD Stefan Bosse
University of Siegen - Dept. Maschinenbau
University of Bremen - Dept. Mathematics and Computer Science
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
How can we find, localize, and classify objects in engineering images?
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
How can we find, localize, and classify objects in engineering images?
What are the challenges and pitfalls?
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
How can we find, localize, and classify objects in engineering images?
What are the challenges and pitfalls?
Which models exist already - are they suitable?
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Classifying the entire image containing one object ⇒ Simple Task (e.g., by using CNN or ANN models)
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Classifying the entire image containing one object ⇒ Simple Task (e.g., by using CNN or ANN models)
Finding (detecting) and classifying of (multiple) objects ⇒ Challenging task!
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Object classifiction output: A discrete class label c ∈ C
Object localization output: A position, i.e., a bounding box b ∈ ℝn, or a centre point pc(x,y,..)
Object detection output: Probability that there is an (or multiple) object(s)
Object recognition: Class and bounding box
Region proposal: Within a ROI there can be an object (foreground) or the ROI contains no object (background)
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Object classification, localization, detection, and higher-order feature prediction (e.g., geometrical features)
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
How to find objects in an image? The search parameter space is very big and high-dimensional!
This is a chicke-egg problem: To find relevant regions we need firstly some understanding (i.e., classificaion) of objects, and than we can estimate the bounding box.
Using generic features like edges, geonmetries (closed polgon paths), color boundaries, or colour similarities can help to find ROIs
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Object recognition is a hybrid classification and regression problem!
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
machinelearningmastery.com/object-recognition-with-deep-learning/ Overview of Object Recognition Computer Vision Tasks: Taxonomy of object recognition
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Many spatially distributed multi-object classifier models consist of two stages: region proposal and classification
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Given an input tensor of size CxHxW constructed from pixel values of some image ...
A. Brown, 2017, NVIDIA
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
A. Brown, 2017, NVIDIA
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Region-propsale networks (RPN) deliver a map of bounding boxes and simultaneously predicts object bounding boxes and objectness scores at each position.
Each bounding box is assigned a probability that the region conatins an object feature (or not)
RPN is a fully convolutional network, which is trained in an end-to-end fashion, to produce high quality region proposals for object detection using Fast R-CNN
There is an imbalance in the false positive and false negative prediction rates: Higher FP is okay, higher FN is bad (missing objectes, e.g., smaller)!
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
https://towardsdatascience.com/region-proposal-network-a-detailed-view-1305c7875853 Anchor boxes generation
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Examples of region proposals originating from fixed spaced anchor points
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
The R-CNN family of methods refers to the R-CNN, which may stand for “Regions with CNN Features” or “Region-Based Convolutional Neural Network,” developed by Ross Girshick, et al.
This includes the techniques:
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
The R-CNN model is comprised of three modules; they are:
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
The anchor boxes were created a priori in best hope, but these are dummy boxes that are different from the actual object of interest.
Also, there might be many boxes which are not having any object in it. So we need to learn whether the given box is foreground or background, at the same time we need to learn the offsets for the foreground boxes to adjust for fitting the objects.
These two tasks are achieved by two convolution layers on the feature map obtained from the backbone network. Those layers are rpn_cls_score and rpn_bbox_pred and the architecture looks like below.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Once these fg/bg scores and offsets are learned using convolution layers, some portions of fg and bg boxes are considered according to confidence scores.
The offsets are applied to those boxes to get the actual ROIs to be processed further.
This post-processing of anchor boxes using offsets is called proposal generation.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Scaling of images to a normalized image (tensor/volume) size
In order to extract features from a region of a given image, the region is first converted to make it compatible with the network input. More precisely, irrespective of the candidate region’s aspect ratio or size, all pixels are converted to the required size by warping them in a tight bounding box.
A CNN convolution operation alaways applies a kernel of fixed size to an input matrix of fixed size
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
https://cs.stanford.edu/people/karpathy/convnetjs/
Pure JavaScript framework
Supports:
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Structure model of the TensorFlow
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Runtime Platform Model
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
See also:
The idea: If you have ever used a Digital single-lens reflex camera (DSLR) before, you should notice the viewfinder is interesting.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
An image viewn by a camera can be segmented in a matrix of focal points, some covering an object ⇒ ROI proposals
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Full image object classification (we know there is an object)
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Assume our input image is 64x64 pixel, then there has an image layer (above figure green text) it’s 8x8 pixel image. ( we don’t care channels here )
That’s what we want, like the DSLR focus points ( probability points ) to tell us which “pixel” has an object detected.
Typically any RGB image has three channels ( Red, Green, Blue); we are now outputting a channel as the probability.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Now we try to find parts of an object, the "focal points", the seed for a ROI bounding box proposal
The image is statically segmented, and each segment is a feature indicator for an object proposal "something is there"
Static segmentation of an image in equally sized patched; each segment is an object feature detector (there is something nearby)
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
About the bounding box and the classification.
As shown above, we can use one channel image as the probability of the object’s existence.
And how about the bounding box? As we can know, a standard bounding box will at least have four numbers ( x1, y1, x2, y2, or some people are using x1, x2, width, height, doesn’t matter.)
Now each segment consists of four channels (or each segment outputs a four dimensional vector): The ROI bounding box {x1,y1,x2,y2}
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Region proposal by different output channels of each segment
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
[probability,x1,y1,x2,y2]
[probability,x1,y1,x2,y2,p1,p2,..,pn]
with pi as the probability that the ROI contains an object (or part of it) of class ci
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Region prediction and object classification in segment (anchor) by more output channels for each segment
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
All we need is a CNN with input shape (width,height,depth) of the input image, and an output "image" of shape (q,p,5+n).
The entire number of output "pixels" determine the granularity, accuracy, and maximal number of objects to be classified in one input image.
But we don’t want backward propagation or wrong object output that doesn’t exist (false positive).
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
As already mentioned, we need different loss function:
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Up to here we assumed that each segment classifier can be recognize one entire object.
Often, an object is covered by another object, and the "backgrounnd" object can be detected by different fragments as an output form different segments
Post-processing should try to merge fragments
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Khan RCNN object detection system. Input to R-CNN is an RGB image. It then extracts region proposals (Module A), computes features for each proposal using a deep CNN, e.g., AlexNet (Module B), and then classifies each region using class-specific linear SVMs (Module C).
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Given an image, the first module (Module A) uses selective search [Uijlings et al., 2013] to generate category-independent region proposals, which represent the set of candidate detections available to the object detector.
The second module (Module B) is a deep CNN (e.g., AlexNet or VGGnet), which is used to extract a fixed-length feature vector from each region.
In both cases (AlexNet or VGGnet), the feature vectors are high-dimensional.
In order to extract features from a region of a given image, the region is first converted to make it compatible with the network input. More precisely, irrespective of the candidate region’s aspect ratio or size, all pixels are converted to the required size by warping them in a tight bounding box.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Next, features are computed by forward propagating a mean-subtracted RGB image through the network and reading off the output values by the last fully connected layer just before the soft-max classifier
After feature extraction, one linear SVM per class is learned, which is the third module(C) of this detection system.
Note that selective search produces many region proposals Multiple stages must trained independently
Training is slow (84h), takes a lot of disk space
R-CNN runtime roughly 47 seconds per image (even using GPUs)
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Brown, 2017
REgion-proposal and classification networks uses different loss (erro) functions for different features
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Ground-truth and intersection over union (IoU)
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Each bounding box (i.e. detection) is associated with a confidence (sometimes called rank)
Detections are assigned to ground truth objects and judged to be true/false positives by measuring overlap
To be considered a correct detection (i.e. true positive), the area of overlap aovl between predicted bounding box BBp and the ground truth bounding box BBgt (training label) must exceed 0.5 according to:
aovl=BBp∩BBgtBBp∪BBgt
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Brown, 2017 Intersection over unoin illustration
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Brown, 2017 Intersection over unoin examples
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
The output of an AE is then compared with the input, calculating an mean average error (MAE)
If the MAE is over a threshold, something "not normal" was found!
Basic architeture of an anomaly detector
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
There is large consent that successful training of deep networks requires many thousand annotated training samples.
U-Net is a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently.
The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Ronneberger et al. showed that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neu- ronal structures in electron microscopic stacks.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Ronneberger U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Fast R-CNN combines stages B and C of R-CNN and is trained with multi-task loss (log loss and smooth L1 loss)
A Fast R-CNN network takes as input an image and a set of object proposals
The network first processes the whole image with conv and pooling layers to produce a conv feature map.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
The role of the RoI pooling layer is to convert the features, in a valid RoI, into small feature maps of fixed size (X Y , e.g., 7 7), using max-pooling.
A RoI itself is a rectangular window that is characterized by a 4-tuple that defines its top-left corner (a, b) and its height and width (x,y).
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Fast R-CNN is still using an independent region proposal stage.
Each feature vector is then given as input to fully connected neural layers, which branch into two sibling output layers.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Fast R-CNN architecture and data flow
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Faster R-CNN revision will introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network.
The RPN simultaneously predicts object bounds and objectness scores at each position.
The RPN is trained end-to-end to generate high-quality region proposals which are used by Fast R-CNN
The RPN and Fast R-CNN are merged into a single network by sharing their convolutional feature. Using “attention” mechanisms, the RPN component tells unified network where to look.
Faster R-CNN generates about 300 proposals per image
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Faster R-CNN: Combining RPN with fast R-CNN object detector
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Typical classification networks take fixed-size images as input and produce non-spatial output maps, that are fed to a soft-max layer to perform classification.
The spatial information is lost, because these networks use fixed dimension fully connected layers in their architecture.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Pixel classifier: Central pixel classification by neighbourood (high spatial accuracy)
Segment classifier: Segment classification by entire segment pixels (medium spatial accuracy)
Fully convolutional networks can take (1) input images of any size and produce (2) spatial output maps. These two aspects make the fully convolutional models a natural choice for semantic segmentation.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Difference between static segment classification and pixel classification using a moving and sliding window
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Khan Pixel classifier with convoultional layers only
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Brown, 2017 Ciresan - Neuronal membrane segmentation
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Brown, 2017 Different levels of object detection and segmentation
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
COCO: Common Objects in Context, SSD: Single Shot Detector
Yolo: "You Only Look Once" system, an open-source method of object detection that can recognize objects in images and videos swiftly whereas SSD runs a convolutional network on input image only one time and computes a feature map.
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Up to here we only considered data-driven model-less region proposal networks
But in measuring technologies and measuring data we have mostly an idea about geometric features of ROIs
Edge detection, e.g., combined with point clustering methods can be used to propose ROIs very fast and accurately
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
(Top) Region proposal by search and data-driven learning (Bottom) Model-based images transformation and point clustering
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Three-dimensional data, e.g., from x-ray ct-scans, can bereduced to a set of lower-dimensional data:
CNNs can be applied to arbitrary dimensional data, indeed, n-dimensional data is commonly processed as a linear one-dimensional array (linearily packed multi-dimensional data)
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
Z-profile signals as 1D images as input for a CNN damage classifier (ND: No damage class, D1: Damage 1, D2: Damage 2, and so on) ⇒ Pixel sgementation detector!
PD Stefan Bosse - AFEML - Module G: Advanced Object Detectors and Classifiers
(Left) Damage feature maps retrieved from four different CNN classifiers and for the specimen A (training and prediction), B, C, and D) (Right) CT image volume and selected x‐y slice visualization (A‐B) With centred resin defect in the PREG layer)