LeNet Architecture - The Original CNN

Input -> Convolution -> Subsampling -> Convolution -> Subsampling -> Conv 5 -> FC 6 -> FC 84 -> 10 (Gaussian)

6@28x28 -> 6@14x14 -> 16@10x10 -> 16@5x5 -> 120 -> 84 -> 10

Convolution vs. Subsampling

Subsampling changes spatial dimension, NOT feature space depth.
Convolution can change both.
Example: 6@28x28 (subsampling) -> 6@14x14.
- Spatial dimension decreases.
- Some depth increases.

Increasing channel depth decreases depth of spatial dimension.
Example: 16@14x14 -> 16@10x10. Convolution.
- IP channel depth = 6, there are 6 kernels.
- M-output = f(conv + bias)

Fewer parameters.
- N_p = P(C²+1)+M
- N_p = (28x28+1) + 12.8 (FC)
- 16x kernel = (7x7+1)=28 -> 30,100k
- N_p = 5x(5+1) = 6 kernels
- P=(K²+1)xM 15x kernel size
Translational invariance: FC needs to be trained on all translations of an image because pixel detection is fixed.

Convolution: Feature Extraction.
FC: Classification.
Note: At the end of LeNet, use FC because the output is a human-readable integer, not spatial like a CNN output.

Fewer parameters: (C_i * (K²+1) * M)
Translational invariance: maintains similar pixels.
Dimensionality Reduction: pooling.
FC training: needs to be trained on all translations because pixel detection is fixed.
Hierarchies: CNN's first layer recognizes hierarchies, while MLPs do not; they are fixed at each layer equally.

Datasets: LeNet used MNIST, AlexNet used ImageNet. ImageNet was advanced with better images.
Hardware: GPU parallelization became much better.
Algorithmic Differences:
- ReLU activation: converges quicker, allowing for larger NNs.
- Local Response Normalization
- Overlapping Pools: Z < S (kernel < Stride). This reduced overfitting.
- Data Augmentation:
  - Transforms dataset images to increase quantity.
  - Cropping and extracting patches, horizontal rotation.
  - Increases data by 2048.
  - Adds random noise (RGB) using a scale factor.
- Dropout Method:
  - At each epoch, some weights are "tossed out" to force the network to learn a slightly different structure each epoch.
  - Increases robustness and doubles training time.
- All of Ensemble Method: Run all weights and average each epoch. Always a benefit.