Statistics 2nd ed

classification

Lesson 16 — Machine Learning Basics

supervised regression
unsupervised k means
overfitting vs generalization

Machine learning is where statistics meets computers.
Instead of only writing formulas, we teach a computer to learn patterns from data.


What is Machine Learning?

Machine learning uses algorithms to improve automatically with experience.

  • Supervised learning: the computer is given examples with correct answers.
  • Unsupervised learning: the computer finds patterns without answers.

Supervised Learning

Goal: predict Y from X.

Examples:

  • Predict exam scores from study hours
  • Predict house price from size, location, and age

Steps:

  1. Split data into training set and test set
  2. Train the model on training data
  3. Test accuracy on new (unseen) data

Formula (simple linear regression as machine learning):
$$\hat{Y} = a + bX$$

Here, the computer “learns” $$a$$ and $$b$$ from the data.


Unsupervised Learning

Goal: find hidden structure in the data.

Examples:

  • Group students by study habits
  • Cluster shoppers by buying patterns

Algorithms:

  • k-means clustering
  • Hierarchical clustering

No “correct answer” is given — the computer organizes the data.


Overfitting vs. Generalization

  • Overfitting: the model memorizes the training data but fails on new data.
  • Generalization: the model captures the underlying pattern and works on new data.

Example:
If a student memorizes past exam answers (overfit), they may fail a new test.
If they learn the concepts (generalize), they succeed.


Key Concepts

  • Training set: data used to build the model
  • Test set: data used to evaluate performance
  • Accuracy: how well the model predicts new data

Visuals

Figure 16.1 — Supervised learning example: regression line predicting Y from X.

Figure 16.2 — Unsupervised learning example: scatterplot with clusters (k-means).

Figure 16.3 — Overfitting vs. generalization: wiggly curve vs. smooth line.


Why This Matters

Machine learning grows directly out of statistics:

  • Regression → prediction
  • ANOVA → group classification
  • Clustering → organizing data

By learning the basics of ML, students see how statistics powers AI.

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.