Overview of Computer Vision

The core of modern CV lies in the mathematical formalization of the learning problem, where a model learns a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that maps an input space of images $\mathcal{X}$ to an output space of predictions $\mathcal{Y}$.

For display equations:

$$ \mathcal{L}_{CE}(\theta) = -\sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log p_{i,k} $$

Computer vision (CV) is the science of teaching machines to interpret and understand visual data. Since 2012 (ILSVRC, AlexNet), CV has shifted from algorithmic, rule-based systems to use data-driven, optimization-based systems.

The core of modern CV lies in the mathematical formalization of the learning problem, where a model learns a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that maps an input space of images $\mathcal{X}$ to an output space of predictions $\mathcal{Y}$.

Computer vision (CV) is the science of teaching machines to interpret and understand visual data. Since 2012 (ILSVRC, AlexNet), CV has shifted from algorithmic, rule-based systems to use data-driven, optimization-based systems.

The core of modern CV lies in the mathematical formalization of the learning problem, where a model learns a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that maps an input space of images $\mathcal{L}_{X}$ to an output space of predictions $\mathcal{Y}$.

Mathematical Formalism of Core CV Tasks

At its heart, a CV task is an optimization problem where the goal is to find a set of model parameters $\theta$ that minimize a loss function $\mathcal{L}_{CE}$ .

Image Classification

Assigning a single or multiple labels to an image, at an image level (e.g., image contains cat vs. dog).

Example:

This is a multiclass classification problem. Given an image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ (height, width, channels), the model outputs a probability distribution $\mathbf{p} = [p_1, \dots, p_K]$ over $K$ classes. A standard loss could be cross-entropy, MSE, etc. Here's standard cross-entropy loss:

$$ mathcal{L}_{CE}(\theta) = -\sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log p_{i,k} $$

where $N$ is the number of samples, $y_{i,k}$ is a one-hot encoded ground-truth label, and $p_{i,k}$ is the predicted probability. The model parameters $\theta$ are updated using stochastic gradient descent (SGD) or its variants.

Object Detection

This task is a composite problem involving both classification and regression. For each object, the model predicts a class label and a bounding box. The total loss is typically a weighted sum of two components:

$$ \mathcal{L}_{Total} = \mathcal{L}_{cls} + \lambda \mathcal{L}_{loc} $$

Here, $\mathcal{L}_{cls}$ is a classification loss, and $\mathcal{L}_{loc}$ is a regression loss (e.g., L1 loss or IoU loss) for the bounding box coordinates. The parameter $\lambda$ balances the two terms.

Segmentation (Semantic and Instance)

These are pixel-level classification problems. The loss is often a pixel-wise cross-entropy loss or a Dice loss, which measures the overlap between the predicted and ground-truth masks.

Theoretical Underpinnings of Architectures

The architectural choice defines the function $f(\mathbf{x}; \theta)$ and the inductive biases it possesses.

Convolutional Neural Networks (CNNs)

The core of a CNN is the convolutional layer. This operation is defined by a convolution integral (or summation for discrete data):

$$ (I * K)(i, j) = \sum_{m} \sum_{n} I(i-m, j-n) K(m, n) $$

The key theoretical principles of CNNs are:

Parameter Sharing: A single kernel is applied across the entire image.
Equivariance to Translation: Shifting an object results in a corresponding shift in the feature map.
Hierarchy of Features: Deeper layers learn more complex, abstract features.

Feature Extraction — What to Look For

Features with high variance

Non-correlated features

Discriminative

Orthogonal & unit vectors

Vision Transformers (ViTs)

ViTs discard convolutions in favor of the attention mechanism. The core is the multi-head self-attention layer:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Unlike CNNs, which have a local receptive field, the self-attention mechanism allows a ViT to model global dependencies between all patches, regardless of their spatial distance.

Learning Paradigms and Theoretical Concepts

Self-Supervised Learning (SSL)

This paradigm generates supervision from the data itself. The core idea is to train a model to solvve a pretext task>.

Contrastive Learning: Models learn to pull together (in an embedding space) different augmented views of the same image while pushing apart views of different images.
Masked Image Modeling (MIM): The model masks a high percentage of image patches and trains to reconstruct the missing pixels.

Generative Models (GANs & Diffusion)

These models learn the underlying probability distribution of the training data $p_{data}(\mathbf{x})$ to generate new samples.

Generative Adversarial Networks (GANs): This is a minimax game with the value function: $$ \min_G \max_D V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_{data}(\mathbf{x})}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))] $$
Diffusion Models: These models learn to reverse a gradual, iterative process of adding Gaussian noise to an image. This process can be mathematically defined as a Markov chain.

Regularization and Optimization

Regularization techniques are mathematical tools used to prevent overfitting and improve model generalization.

L2 Regularization (Weight Decay): Adds a penalty term to the loss function that is proportional to the square of the magnitude of the weights.
$$ \mathcal{L}_{new}(\theta) = \mathcal{L}_{original}(\theta) + \lambda \sum_{i} \theta_i^2 $$
Dropout: Randomly "drops out" neurons during training, which acts as a form of ensemble learning.
Data Augmentation: Expands the training data by applying transformations, forcing the model to learn representations that are invariant to these changes.

Overfitting Scenario

Train on 20 samples → good results

New dataset → poor performance

Reasons:

Too many free parameters vs training data

Overfitting → memorization

Inadequate complexity

Unbalanced data

Unnormalized data

Fix:

More data

Regularization