ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

Main Objective: Advance state-of-the-art in image classification, object detection, and localization at scale.
Competition Timeline: Ran annually from 2010–2017.
Organized by: The ImageNet project, at Princeton and Stanford.
Dataset: Subset of the full ImageNet database, covering 1,000 object categories with over 1.2 million training images, 50,000 validation images, and 100,000 test images.

Tasks

Image Classification: Assign a single class label to an image. Top-1 and Top-5 accuracy were standard metrics.
Object Localization: Predict both the class label and the bounding box of the primary object in the image.
Object Detection: Detect and localize multiple objects of different categories in an image with bounding boxes (2013–2017).
Scene Parsing (later years): Pixel-wise labeling of scenes into object categories.

Catalyst for Deep Learning: ILSVRC popularized deep convolutional neural networks (CNNs), which outperformed traditional hand-crafted feature methods.
Benchmarking: Provided a standardized benchmark for comparing algorithms in computer vision.
Industrial Relevance: Progress at ILSVRC directly influenced advances in self-driving cars, facial recognition, and large-scale image search.

2010–2011: Non-deep learning methods (SIFT, HOG, bag-of-words) dominated before the deep learning breakthrough.
2012: AlexNet by Krizhevsky, Sutskever, and Hinton — reduced Top-5 error from ~26% to ~15% using deep CNNs and GPUs.
2013: ZFNet refined AlexNet with visualization-guided improvements; VGG also appeared with deeper networks.
2014: GoogLeNet (Inception) — introduced inception modules for efficient depth and parameter reduction.
2015: ResNet — introduced residual connections, enabling networks over 100 layers deep; achieved ~3.6% Top-5 error (surpassing human-level performance).
2016–2017: Variants of ResNet, Inception-ResNet, and ensemble methods. Less inventive so the challenge concluded.

Top-1 Accuracy: Percentage of times the top predicted label matches the ground truth.
Top-5 Accuracy: Percentage of times the ground truth label is among the top 5 predictions.
mAP (mean Average Precision): Standard metric for object detection tasks.

Motivation: Improve upon AlexNet using feature visualization as diagnostic feedback.
Technique: Introduced deconvolutional visualization to “look inside” CNNs and understand feature activations.

Attempts to reverse max-pooling by using “switches” that record the position of maximum activations.
Reassigns features to approximate original spatial structure.

Purpose: Assess contribution of model components.
Examples in ZFNet:
- Occlusion tests with gray boxes measured robustness.
- Visualization revealed AlexNet’s large filters were suboptimal.
- Showed certain filters correspond to semantic object parts.

Observations:
- AlexNet’s 11×11 first-layer filters with stride 4 caused aliasing and captured noise.
- ZFNet reduced filter size to 7×7 with stride 2 for sharper feature maps.
Impact: Enabled clearer, more discriminative representations, guiding future architecture design.