Convolutional Neural Networks

How it works

1. Convolution operation

convolution operations can make the detector dependent on features and independent of locations -> coarse and invariant detection
stride (Convolutional Neural Networks#Stride) = the moving step size of the feature detector: usually 2
feature detector = kernel = filter: usually 3x3, but other models such as AlexNet use 7x7

2. Pooling

Pooling = down-sampling = aggregate results over a window of values
can reduces features and is more robust to overfitting
types
- max pooling: represent a subregion by max activation map
- global pooling: reduce spatial dimension to 1
- aum pooling, Avg Pooling(~subsampling)
max-pooling is preferable to average pooling, as it confers some degree of invariance to output

3. Flattening

4. Full connection

The flattened vectors are propagated into an ANN as the input layer

Size calculation

Assuming that the input shape is $n_{h} \times n_{w}$ and the convolution kernel shape is $k_{h} \times k_{w}$ , the output shape will be $(n_{h} - k_{h} + 1) \times (n_{w} - k_{w} + 1)$ .
Techniques can also change the output shape:

Padding

Padding can increase the height and width of the output
padding = add extra pixels of filler around the boundary of our input image, thus increasing the effective size of the image
if we add a total of $p_{h}$ rows of padding (roughly half on top and half on bottom) and a total of $p_{w}$ columns of padding (roughly half on the left and half on the right), the output shape will be $(n_{h} - k_{h} + p_{h} + 1) \times (n_{w} - k_{w} + p_{w} + 1)$
In many cases, we will want to set $p_{h} = k_{h} - 1$ and $p_{w} = k_{w} - 1$ to give the input and output the same height and width

Stride

Stride can reduce the resolution of the output
stride = the number of rows and columns traversed per slide
when the stride for the height is $s_{h}$ and the stride for the width is $s_{w}$ , the output shape is $[(n_{h} - k_{h} + p_{h} + 1) / s_{h}] \times [(n_{w} - k_{w} + p_{w} + 1) / s_{w}]$