Statistics 2nd ed

r-squared

Appendix 1 — Symbols and Notation (Cheat Sheet)

Symbols and Notation

A quick reference to the symbols used in this book.

SymbolMeaningExample
$$\Sigma$$Summation (add them up)$$\Sigma X = 2+4+6=12$$
$$\bar{X}$$Sample mean$$\bar{X} = \tfrac{12}{3} = 4$$
$$\mu$$Population mean“The true average of all scores”
$$s$$Sample standard deviationSpread of quiz scores
$$\sigma$$Population standard deviationSpread of SAT scores
$$df$$Degrees of freedom$$df = n-1 = 29$$ if $$n=30$$
$$t$$t-test statisticCompare two group means
$$F$$ANOVA statisticCompare 3+ group means
$$r$$Pearson correlationStrength of linear relationship
$$R^2$$Coefficient of determinationProportion of variance explained
$$\chi^2$$Chi-square statisticCompare observed vs. expected counts
$$p$$Probability value“p < 0.05” → significant result

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

Lesson 17 — Regression Beyond the Line

multiple regression plane
logistic curve

Simple regression predicts Y from one X.
But in real life, outcomes often depend on several variables — or may not be linear.

This chapter introduces multiple regression and logistic regression.


Multiple Regression

Formula:

$$\hat{Y} = a + b_1X_1 + b_2X_2 + \dots + b_kX_k$$

In words:
$$\text{Predicted Y} = \text{intercept} + (b_1 \times X_1) + (b_2 \times X_2) + \dots$$

Where:

  • $$X_1, X_2, \dots X_k$$ = predictors
  • $$b_1, b_2, \dots b_k$$ = slopes (weights for each predictor)

Example: Predicting college GPA from:

  • High school GPA ($$X_1$$)
  • Study hours ($$X_2$$)

Equation:
$$\hat{Y} = 1.0 + 0.5X_1 + 0.1X_2$$

Interpretation:

  • For each 1-point increase in HS GPA, college GPA rises 0.5.
  • For each extra study hour, GPA rises 0.1.

Coefficient of Determination

In multiple regression, $$R^2$$ tells us the proportion of variance explained by all predictors together.

Example: $$R^2 = 0.65$$ → predictors explain 65% of the outcome’s variability.


Logistic Regression

What if the outcome is yes/no (categorical)?
Example: Will a student pass or fail?

We use logistic regression.

Formula:

$$P(Y=1) = \frac{1}{1 + e^{-(a + bX)}}$$

In words:
$$\text{Probability of success} = \frac{1}{1 + e^{-(\text{intercept} + \text{slope} \times X)}}$$

Output: probability between 0 and 1.

Example: Predicting pass/fail from study hours.

  • Equation: $$P = \frac{1}{1 + e^{-( -2 + 0.5X )}}$$
  • If X = 6 hours: $$P = \frac{1}{1 + e^{-1}} = 0.73$$
  • About 73% chance of passing.

Visuals

Figure 17.1 — Multiple regression plane: Y predicted from two predictors.

Figure 17.2 — Logistic regression curve: probability vs. study hours.


Why This Matters

  • Multiple regression = prediction with many factors
  • Logistic regression = prediction when the outcome is categorical
  • $$R^2$$ = strength of prediction

These methods expand the power of regression beyond a straight line, preparing for modern predictive modeling.

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

Lesson 10 — Regression

scatter intercept
slope intercept

Correlation tells us the strength of the relationship between two variables.
Regression goes one step further: it gives us an equation to predict one variable from another.

 


The Regression Equation

The regression line predicts Y from X.

Symbolic formula:
$$\hat{Y} = a + bX$$

Formula in words:
$$\text{Predicted Y} = \text{intercept} + (\text{slope} \times X)$$

Where:

  • $$\hat{Y}$$ = predicted value of Y
  • $$a$$ = intercept (value of Y when X = 0)
  • $$b$$ = slope (change in Y for each 1-unit change in X)

Slope and Intercept

The slope is calculated as:

$$b = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sum (X - \bar{X})^2}$$

The intercept is:

$$a = \bar{Y} - b\bar{X}$$


Example

Study hours (X) and test scores (Y):

  • X = [2, 4, 6]
  • Y = [50, 60, 80]
  • $$\bar{X} = 4, \quad \bar{Y} = 63.3$$

Step 1: Slope

  • Numerator = Σ(X – X̄)(Y – Ȳ) = 60
  • Denominator = Σ(X – X̄)² = 8
  • $$b = \tfrac{60}{8} = 7.5$$

Step 2: Intercept

  • $$a = 63.3 - (7.5)(4) = 33.3$$

Regression equation:
$$\hat{Y} = 33.3 + 7.5X$$

Interpretation: each extra study hour adds about 7.5 points to the predicted test score.


Coefficient of Determination

The square of the correlation, $$r^2$$, shows the proportion of variance explained by regression.

Here: $$r^2 = 0.98$$, so 98% of score variation is explained by study hours.

 


Definition

  • Regression: predicts one variable from another using a line
  • Slope (b): how much Y changes per unit change in X
  • Intercept (a): expected value of Y when X = 0
  • r²: proportion of variance explained by regression

Visuals

Figure 10.1 — Scatterplot with regression line (Y predicted from X).

Figure 10.2 — Illustration of slope (rise/run) and intercept.


Why This Matters

Regression is a predictive tool.
It connects statistical description to practical forecasting: how much outcome (Y) changes with predictor (X).
It is the basis for more advanced models used in science, business, and data analysis.

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

Lesson 9 — Correlation

scatter plot
scatterplots

Correlation measures the strength and direction of the relationship between two variables.
It tells us whether high values of one variable go with high (or low) values of another.


Pearson’s r

The most common measure is Pearson’s correlation coefficient, $$r$$.
It ranges from –1 to +1.

  • $$r = +1$$ → perfect positive correlation (as X increases, Y increases).
  • $$r = –1$$ → perfect negative correlation (as X increases, Y decreases).
  • $$r = 0$$ → no linear relationship.

Symbolic formula:
$$r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 , \sum (Y - \bar{Y})^2}}$$

Formula in words:
$$r = \frac{\text{sum of the cross-products of deviations from the mean}}{\text{square root of (sum of squared deviations in X × sum of squared deviations in Y)}}$$


Example

Suppose study hours (X) and test scores (Y) are:

  • X = [2, 4, 6]
  • Y = [50, 60, 80]

Means:

  • $$\bar{X} = 4$$
  • $$\bar{Y} = 63.3$$

Deviations:

  • (2–4)(50–63.3) = (–2)(–13.3) = 26.6
  • (4–4)(60–63.3) = (0)(–3.3) = 0
  • (6–4)(80–63.3) = (2)(16.7) = 33.4

Sum cross-products = 60

Sum squares X = (–2)² + 0² + 2² = 8
Sum squares Y = (–13.3)² + (–3.3)² + 16.7² ≈ 466.7

So:
$$r = \frac{60}{\sqrt{8 \times 466.7}} = \frac{60}{\sqrt{3733}} = \frac{60}{61.1} = 0.98$$

A very strong positive correlation.


Coefficient of Determination

The square of correlation is $$r^2$$.
It represents the proportion of variance in Y explained by X.

Example above:
$$r^2 = (0.98)^2 = 0.96$$

So about 96% of the variation in scores is explained by study hours.


Definition

  • Correlation: degree of linear relationship between two variables.
  • Pearson’s r: ranges from –1 to +1.
  • Coefficient of determination (r²): proportion of explained variance.

Visual Placeholders

Figure 9.1 — Scatterplot with positive correlation (points rising, line upward).

Figure 9.2 — Scatterplots showing r ≈ +1, r ≈ 0, r ≈ –1.


Why This Matters

Correlation is the first step in studying relationships.
It helps identify whether variables move together, setting the stage for regression analysis.

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.