Chapter 1: Introduction to Neural Networks

Perceptron

Perceptron (Rosenblatt, 1958)

The simplest form of a neural network. It thresholds a weighted sum of inputs:

$$y = \begin{cases} 1 & \text{if } \mathbf{w}^\top \mathbf{x} + b > 0 \\ 0 & \text{otherwise} \end{cases}$$
Perceptron Structure x₁ x₂ x₃ +1 bias w₁ w₂ w₃ b Σ step y
Figure 1: Perceptron — a single neuron that thresholds a weighted input sum into 0 or 1

Limitations of the Perceptron

A perceptron splits the input space with a single line (a hyperplane in higher dimensions), so it can only solve linearly separable problems. Patterns such as XOR, which cannot be separated by a straight line, are unreachable.

In 1969, Minsky and Papert formalized this limitation in their book Perceptrons, triggering the research slowdown known as the AI winter. Practical training of multilayer networks via backpropagation (Rumelhart et al., 1986) finally made nonlinear problems including XOR tractable.

Multilayer Perceptron (MLP)

Multilayer Perceptron

A neural network built by stacking multiple layers. Composing nonlinear functions lets it represent complex functions.

Multilayer Perceptron (2 hidden layers) Input layer Hidden layer 1 Hidden layer 2 Output layer
Figure 2: MLP with two hidden layers — input, hidden 1, hidden 2, and output fully connected

Forward Propagation

The output of the $l$-th layer:

$$\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$$
  • $\mathbf{W}^{(l)}$: weight matrix of the $l$-th layer
  • $\mathbf{b}^{(l)}$: bias vector of the $l$-th layer
  • $\sigma$: activation function

Universal Approximation Theorem

An MLP with a single hidden layer can approximate any continuous function to arbitrary precision, given enough hidden units (Cybenko, 1989; Hornik, 1991).

Practical caveat: being "approximable" does not imply being "learnable", and it is known that the number of hidden units required can grow exponentially. Modern deep learning stacks many layers because the same function can often be represented far more efficiently — with far fewer parameters — by multiple layers.

Activation Functions

Role of the Activation Function

It introduces nonlinearity. Without it, no matter how many layers you stack the network would just compute a composition of linear transformations (which is again linear).

Representative Activation Functions Sigmoid σ(z) = 1/(1+e⁻ᶻ) tanh tanh(z) ReLU max(0, z) Leaky ReLU max(αz, z), α≈0.1
Figure 3: Common activation functions — Sigmoid, tanh, ReLU, and Leaky ReLU compared

Comparison of Activation Functions

Function Output Range Characteristics
Sigmoid (0, 1) Convenient for probabilistic output. Prone to vanishing gradients.
tanh (-1, 1) Zero-centered. Larger gradients than Sigmoid.
ReLU [0, ∞) Fast to compute. Suffers from the Dead ReLU problem.
Leaky ReLU (-∞, ∞) Mitigates the Dead ReLU problem.

Why ReLU Is Popular

  • Fast to compute: only max(0, z).
  • Resistant to vanishing gradients: the gradient is 1 in the positive region.
  • Sparsity: roughly half of the units output zero (efficient).

What Is the Dead ReLU Problem?

ReLU outputs 0 for any negative input and its gradient is also 0 there. If a large negative weight update pushes a neuron into a regime where its preactivation $z$ is always negative, the neuron's output and gradient become permanently 0 and the unit stops learning. This failure mode is called the Dead ReLU problem.

Common countermeasures include Leaky ReLU ($\max(\alpha z, z)$ with $\alpha \approx 0.01$), PReLU (where $\alpha$ is learned), and ELU, all of which keep a small gradient on the negative side. Lowering the learning rate and using Glorot/He initialization are also effective preventive measures.

Output Layer and Loss Function

Output Layer by Task

Task Output Layer Loss Function
Regression Linear (identity) MSE
Binary Classification Sigmoid Binary Cross-Entropy
Multiclass Classification Softmax Cross-Entropy

Summary

  • Perceptron: a single neuron, only linearly separable problems
  • MLP: multiple layers, can represent nonlinear functions
  • Activation function: introduces nonlinearity; ReLU is the mainstream choice
  • Universal approximation theorem: a single hidden layer can approximate any function

References