Chapter 1: Introduction to Neural Networks
Perceptron
Perceptron (Rosenblatt, 1958)
The simplest form of a neural network. It thresholds a weighted sum of inputs:
$$y = \begin{cases} 1 & \text{if } \mathbf{w}^\top \mathbf{x} + b > 0 \\ 0 & \text{otherwise} \end{cases}$$Limitations of the Perceptron
A perceptron splits the input space with a single line (a hyperplane in higher dimensions), so it can only solve linearly separable problems. Patterns such as XOR, which cannot be separated by a straight line, are unreachable.
In 1969, Minsky and Papert formalized this limitation in their book Perceptrons, triggering the research slowdown known as the AI winter. Practical training of multilayer networks via backpropagation (Rumelhart et al., 1986) finally made nonlinear problems including XOR tractable.
Multilayer Perceptron (MLP)
Multilayer Perceptron
A neural network built by stacking multiple layers. Composing nonlinear functions lets it represent complex functions.
Forward Propagation
The output of the $l$-th layer:
$$\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$$- $\mathbf{W}^{(l)}$: weight matrix of the $l$-th layer
- $\mathbf{b}^{(l)}$: bias vector of the $l$-th layer
- $\sigma$: activation function
Universal Approximation Theorem
An MLP with a single hidden layer can approximate any continuous function to arbitrary precision, given enough hidden units (Cybenko, 1989; Hornik, 1991).
Practical caveat: being "approximable" does not imply being "learnable", and it is known that the number of hidden units required can grow exponentially. Modern deep learning stacks many layers because the same function can often be represented far more efficiently — with far fewer parameters — by multiple layers.
Activation Functions
Role of the Activation Function
It introduces nonlinearity. Without it, no matter how many layers you stack the network would just compute a composition of linear transformations (which is again linear).
Comparison of Activation Functions
| Function | Output Range | Characteristics |
|---|---|---|
| Sigmoid | (0, 1) | Convenient for probabilistic output. Prone to vanishing gradients. |
| tanh | (-1, 1) | Zero-centered. Larger gradients than Sigmoid. |
| ReLU | [0, ∞) | Fast to compute. Suffers from the Dead ReLU problem. |
| Leaky ReLU | (-∞, ∞) | Mitigates the Dead ReLU problem. |
Why ReLU Is Popular
- Fast to compute: only max(0, z).
- Resistant to vanishing gradients: the gradient is 1 in the positive region.
- Sparsity: roughly half of the units output zero (efficient).
What Is the Dead ReLU Problem?
ReLU outputs 0 for any negative input and its gradient is also 0 there. If a large negative weight update pushes a neuron into a regime where its preactivation $z$ is always negative, the neuron's output and gradient become permanently 0 and the unit stops learning. This failure mode is called the Dead ReLU problem.
Common countermeasures include Leaky ReLU ($\max(\alpha z, z)$ with $\alpha \approx 0.01$), PReLU (where $\alpha$ is learned), and ELU, all of which keep a small gradient on the negative side. Lowering the learning rate and using Glorot/He initialization are also effective preventive measures.
Output Layer and Loss Function
Output Layer by Task
| Task | Output Layer | Loss Function |
|---|---|---|
| Regression | Linear (identity) | MSE |
| Binary Classification | Sigmoid | Binary Cross-Entropy |
| Multiclass Classification | Softmax | Cross-Entropy |
Summary
- Perceptron: a single neuron, only linearly separable problems
- MLP: multiple layers, can represent nonlinear functions
- Activation function: introduces nonlinearity; ReLU is the mainstream choice
- Universal approximation theorem: a single hidden layer can approximate any function