What is the difference between a perceptron and a multilayer perceptron (MLP)?

A perceptron is the simplest classifier: a single neuron that thresholds a weighted sum of inputs, and it can only solve linearly separable problems (it cannot solve XOR). An MLP is a network of stacked layers; by inserting an activation function (a nonlinear function) at each layer, it can express complex nonlinear functions, and with enough hidden units it can approximate any continuous function (Cybenko's universal approximation theorem).

Why do activation functions need to be nonlinear?

The composition of linear transformations is itself a linear transformation, so if the activation is linear (e.g., the identity function) then no matter how many layers you stack you only get the expressive power of a single layer. Inserting a nonlinear function such as Sigmoid, tanh, or ReLU is what gives stacking layers any meaning and lets the network represent complex functions.

When should I use Sigmoid versus ReLU?

ReLU is the mainstream choice in modern deep learning. It is fast to compute (just max(0, z)), it has a gradient of 1 in the positive region so it suffers less from vanishing gradients, and roughly half of the units output zero, which gives a useful sparsity. Sigmoid is prone to vanishing gradients and so is avoided in hidden layers, but it is still used in the output layer for binary classification (mapping to a probability in (0, 1)) and in the gating mechanisms of LSTMs.

How do I choose the output layer and loss function?

The choice is determined by the task. For regression, use a linear (identity) output layer with MSE loss. For binary classification, use a Sigmoid output with Binary Cross-Entropy loss. For multiclass classification, use a Softmax output with Cross-Entropy loss. These pairings can be derived as the negative log-likelihood of a probabilistic model (normal distribution, Bernoulli distribution, or categorical distribution, respectively).

Chapter 1: Introduction to Neural Networks

Perceptron

Perceptron (Rosenblatt, 1958)

The simplest form of a neural network. It thresholds a weighted sum of inputs:

$$y = \begin{cases} 1 & \text{if } \mathbf{w}^\top \mathbf{x} + b > 0 \\ 0 & \text{otherwise} \end{cases}$$

Figure 1: Perceptron — a single neuron that thresholds a weighted input sum into 0 or 1

Limitations of the Perceptron

A perceptron splits the input space with a single line (a hyperplane in higher dimensions), so it can only solve linearly separable problems. Patterns such as XOR, which cannot be separated by a straight line, are unreachable.

In 1969, Minsky and Papert formalized this limitation in their book Perceptrons, triggering the research slowdown known as the AI winter. Practical training of multilayer networks via backpropagation (Rumelhart et al., 1986) finally made nonlinear problems including XOR tractable.

Multilayer Perceptron (MLP)

Multilayer Perceptron

A neural network built by stacking multiple layers. Composing nonlinear functions lets it represent complex functions.

Figure 2: MLP with two hidden layers — input, hidden 1, hidden 2, and output fully connected

Forward Propagation

The output of the $l$-th layer:

$$\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$$

$\mathbf{W}^{(l)}$: weight matrix of the $l$-th layer
$\mathbf{b}^{(l)}$: bias vector of the $l$-th layer
$\sigma$: activation function

Universal Approximation Theorem

An MLP with a single hidden layer can approximate any continuous function to arbitrary precision, given enough hidden units (Cybenko, 1989; Hornik, 1991).

Practical caveat: being "approximable" does not imply being "learnable", and it is known that the number of hidden units required can grow exponentially. Modern deep learning stacks many layers because the same function can often be represented far more efficiently — with far fewer parameters — by multiple layers.

Activation Functions

Role of the Activation Function

It introduces nonlinearity. Without it, no matter how many layers you stack the network would just compute a composition of linear transformations (which is again linear).

Figure 3: Common activation functions — Sigmoid, tanh, ReLU, and Leaky ReLU compared

Comparison of Activation Functions

Function	Output Range	Characteristics
Sigmoid	(0, 1)	Convenient for probabilistic output. Prone to vanishing gradients.
tanh	(-1, 1)	Zero-centered. Larger gradients than Sigmoid.
ReLU	[0, ∞)	Fast to compute. Suffers from the Dead ReLU problem.
Leaky ReLU	(-∞, ∞)	Mitigates the Dead ReLU problem.

Why ReLU Is Popular

Fast to compute: only max(0, z).
Resistant to vanishing gradients: the gradient is 1 in the positive region.
Sparsity: roughly half of the units output zero (efficient).

What Is the Dead ReLU Problem?

ReLU outputs 0 for any negative input and its gradient is also 0 there. If a large negative weight update pushes a neuron into a regime where its preactivation $z$ is always negative, the neuron's output and gradient become permanently 0 and the unit stops learning. This failure mode is called the Dead ReLU problem.

Common countermeasures include Leaky ReLU ($\max(\alpha z, z)$ with $\alpha \approx 0.01$), PReLU (where $\alpha$ is learned), and ELU, all of which keep a small gradient on the negative side. Lowering the learning rate and using Glorot/He initialization are also effective preventive measures.

Output Layer and Loss Function

Output Layer by Task

Task	Output Layer	Loss Function
Regression	Linear (identity)	MSE
Binary Classification	Sigmoid	Binary Cross-Entropy
Multiclass Classification	Softmax	Cross-Entropy

Summary

Perceptron: a single neuron, only linearly separable problems
MLP: multiple layers, can represent nonlinear functions
Activation function: introduces nonlinearity; ReLU is the mainstream choice
Universal approximation theorem: a single hidden layer can approximate any function