Chapter 3: Supervised Learning
What Is Supervised Learning?
Supervised Learning
Given pairs of input $\mathbf{x}$ and correct label $y$, learn a function $f$ that predicts the output for new inputs.
$$\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\} \rightarrow f: \mathbf{x} \mapsto y$$What Is the "Supervisor"?
The correct label $y$ acts as the supervisor. When the model's prediction is wrong, the gap between the prediction and the label (the error) is fed back to correct the model.
Regression
Regression
The case where the output $y$ is a continuous value. The task is to predict a numeric quantity.
Examples of Regression Problems
- Housing price prediction: floor area, building age, location → price (in tens of thousands of yen)
- Sales forecasting: ad spend, season, day of week → revenue (yen)
- Temperature forecasting: past weather data → tomorrow's high temperature (°C)
- Stock price prediction: past price movements → next-day closing price (yen)
Classification
Classification
The case where the output $y$ is a discrete category. The task is to predict a class.
Examples of Classification Problems
- Spam detection: email contents → spam / not spam (binary classification)
- Image recognition: an image → cat / dog / bird / ... (multiclass classification)
- Medical diagnosis: test results → positive / negative
- Sentiment analysis: text → positive / negative / neutral
Regression vs. Classification
| Aspect | Regression | Classification |
|---|---|---|
| Output | Continuous value (real number) | Discrete value (category) |
| Examples | Price, temperature | Class label |
| Evaluation metrics | MSE, MAE | Accuracy, F1 score |
Training Data and Test Data
Splitting the Data
Divide the available data into a training set and a test set.
- Training data: used to fit the model
- Test data: used to evaluate generalization performance (never used for training)
Why the Split Matters
Strong performance on the training data is meaningless if performance on unseen data is poor.
The test set serves as a simulation of "unseen data" so we can evaluate generalization performance.
What Is Overfitting?
Overfitting is the phenomenon where a model becomes so specialised to the training data that its performance on unseen data drops. A model that scores high on training data but poorly on test data is showing signs of overfitting. The train/test split is the most basic technique for detecting it.
Key Principles
- The test set must never be used for training.
- Evaluate on the test set only once, for the final evaluation.
- If you need to evaluate multiple times, prepare a separate validation set.
- The 70-80% / 20-30% ratio is just a guideline. With small datasets, shift toward training (e.g., 90% / 10%) or use cross-validation to average over multiple splits.
Summary
- Supervised learning: learn from input-label pairs
- Regression: predict a continuous value (price, temperature, etc.)
- Classification: predict a category (spam detection, image recognition, etc.)
- Data splitting: separate training and test sets to evaluate generalization performance