Statistics 2nd ed

educational-statistics

About | High School Statistics (Pre-College)

About This Textbook

Statistics for High School Students: Pre-College is a free, comprehensive, and interactive online textbook written by Dr. Michael Nikoletseas—a professor and researcher with numerous publications in neuroscience, philosophy of science, and mathematics.  Using only simple arithmetic, straightforward formulas, and plain English, this resource is designed to be highly accessible. Despite its simplicity, it covers both elementary and advanced statistics topics, as well as modern data science concepts.

Mission & Vision

Our mission is to deliver a statistics textbook that:

  • supports students across a wide range of disciplines (from social and behavioral sciences to engineering and mathematics) to acquire a deep understanding of statistical reasoning, not just procedural techniques;
  • presents key statistical concepts in a manner that bridges theory and practice, emphasizing interpretive insight (“what does this mean?”) alongside computational method;
  • adopts an open mindset toward pedagogy: the site is structured for readability, modular use (individual chapters may be used independently if desired), and easy updates as the field evolves;
  • integrates modern elements—resampling, simulation, machine learning prelude, robust inference—while preserving the classical foundations (distributions, hypothesis testing, ANOVA, regression) so students are well‐grounded for further work.

Who This is For

This textbook is ideal for:

  • high school students preparing for biology or social science majors.
  • students in a one- or two-semester introductory statistics sequence who want more than formula memorization;
  • non‐mathematics majors (e.g. philosophy of science) who need to understand how to interpret and apply statistical reasoning in their discipline;
  • mathematics or statistics majors seeking a readable, web‐enabled resource that complements more formal references;
  • educators who want a ready‐to-use, modular, up‐to‐date resource for their course, including figures, examples, and modern topics.

Author & Credentials

Dr Michael Nikoletseas is the author of this textbook and brings a unique interdisciplinary background: his published works span neuroscience, philosophy of science, and mathematics, and are held in leading academic libraries (Harvard, Oxford, Princeton). His ambition with this text is to raise the bar for clarity, coherence, and depth in undergraduate statistics education.

With this online text, he applies the same analytical rigor he uses in his philosophical and mathematical writing: clear definitions, structured exposition, precise notation, and an emphasis on the limits of inference and interpretation (a theme that resonates with his broader work in epistemology).

Structure of the Textbook

The book is arranged into chapters each designed to stand on its own while also fitting into an integrated whole. Typical chapters will proceed in this order:

  1. Introduction & motivation
  2. Essential theory and notation for mathematics and formulas)
  3. Detailed examples and figures (copyable images for instructor use)
  4. Worked problems, with step-by-step solutions and commentary
  5. Live self-test quizzes
  6. Ask questions in each chapter
  7. Advanced topics, extensions, and links (for students preparing for further study)

Current chapters already include: descriptive statistics, probability, distributions, the normal distribution, hypothesis testing, t‐tests, one‐way and multi‐way ANOVA (including mixed designs and post-hoc comparisons), resampling and simulation, machine learning foundations, and big data computational statistics.

Contact & Feedback

Your feedback is valuable. Should you spot an error, have a suggestion for improvement or want to request supplementary material use Feedback on main menu. Use the contact form below each chapter to ask questions. 

Acknowledgements

The creation of this textbook has drawn on countless influences—from classical mathematics and modern statistics pedagogy to insights from neuroscience, philosophy of science, and epistemology. Special thanks to readers and educators who engage with the text, write with questions, and propose improvements. Together we advance statistical literacy and interpretive clarity.

Thank you for visiting StatisticsTextbook.com. May this textbook serve you well in your statistical journey.

— Michael Nikoletseas

Students

For Students: How to Use statisticstextbook.com

A simple guide for starting, studying in order, and reviewing.

Audience: Pre-college and high school students

1. What This Site Is

statisticstextbook.com is a free, page-by-page statistics textbook. You can read it in order like a print book, or use it as a reference when you need help with a topic.

Most students do best by moving from the foundations (data, variability, probability) into core tests (t-tests and ANOVA), and then into modern topics (resampling, big data, and an introduction to machine learning).

2. How to Use This Textbook

  1. Start with the first lesson.
  2. Follow the Next / Previous links. Each lesson ends with navigation links so you can keep the correct order without guessing what comes next.
  3. Keep a small “definitions” page in your notes. Write down the meaning of key terms (mean, variance, standard deviation, probability, distribution) as you encounter them.
  4. For each test, practice three skills. (1) what the question is, (2) the computation, (3) the interpretation in words.
  5. Use the review pages when you get stuck.

3. Reading the Math

Formulas are displayed with MathJax so they stay clear on different screens. If a formula looks unfamiliar, read it slowly and connect each symbol to a meaning in words.

4. Why This Format Helps

  • Clear sequence: lessons build from basic ideas to core tests.
  • Readable math: formulas render cleanly across devices.
  • Study-friendly: minimal distractions and no sign-in required.
  • Open access: free to use for learning and review.

5. Summary

Use the textbook in order if you are learning statistics for the first time, and use it as a reference when you need a quick explanation or a worked example. If you study steadily and keep your own notes of definitions and interpretations, the material becomes much easier over time.

© 2025. This page uses MathJax with LaTeX delimiters \(…\) and \[…\] in Drupal Full HTML.

Mixed (Split-Plot) ANOVA

mixed anova layout
mixed anova mean profile
partitioning variance
f distribution
split-plot interaction

Goal. Test a between-subjects factor (Group: Drug vs. Placebo) and a within-subjects factor (Time: Weeks 1–3), plus their interaction, on exam scores.

Design & Experiment

  • Between-subjects factor: Group = {Drug, Placebo}
  • Within-subjects factor: Time = {Week 1, Week 2, Week 3}
  • Balanced: 8 participants per group (\(s_g=8\)), 3 repeated measures per participant (\(k=3\)).

Participants are randomly assigned to Drug or Placebo. The same exam is given at Week 1, Week 2, and Week 3.

Figure 1: Mixed design layout (Drug vs Placebo × Weeks 1–3).


Data

Group: Drug (8 participants × 3 weeks)

SubjectW1W2W3Row sumRow mean
D170747822274.00
D269737721973.00
D371757922575.00
D472768022876.00
D568727621672.00
D670747822274.00
D773778123177.00
D871768022775.67
Column sums564597629Group sum = 1790Group mean \( \bar X_{\text{Drug}} = 1790/24 = 74.5833 \)

Group: Placebo (8 participants × 3 weeks)

SubjectW1W2W3Row sumRow mean
P170717221371.00
P269707121070.00
P371727321672.00
P472737421973.00
P568697020769.00
P670717221371.00
P769707121070.00
P871727321672.00
Column sums560568576Group sum = 1704Group mean \( \bar X_{\text{Plac}} = 1704/24 = 71.0000 \)

Totals. Grand sum = 1790 + 1704 = 3494, total observations \(N = 16\times3 = 48\), grand mean \( \bar X = 3494/48 = 72.7917\).

Figure 2: Mean profiles over weeks (Drug rises sharply; Placebo ~ flat).


Step 1 — Marginal Means

By Time (across both groups; 16 participants each week): \[ \bar X_{\text{W1}}=\tfrac{1124}{16}=70.2500,\qquad \bar X_{\text{W2}}=\tfrac{1165}{16}=72.8125,\qquad \bar X_{\text{W3}}=\tfrac{1205}{16}=75.3125, \] where column sums are \(1124, 1165, 1205\).

By Group (across all weeks): \[ \bar X_{\text{Drug}}=74.5833,\qquad \bar X_{\text{Placebo}}=71.0000. \]


Step 2 — Sums of Squares (SS)

Decompose total variability into Between-Subjects and Within-Subjects parts.

2A. Total

\[ SS_{\text{total}}=\sum (X_{igt}-\bar X)^2=\mathbf{527.9167}. \]

2B. Between-Subjects

Let each subject’s mean be \(\bar X_{i\cdot}\). Then \[ SS_{\text{BS-total}}=k\sum_{i=1}^{16}(\bar X_{i\cdot}-\bar X)^2=\mathbf{247.2500}. \] Split into Group and Subjects-within-Group: \[ SS_{\text{Group}}=k\sum_{g} n_g(\bar X_{g\cdot\cdot}-\bar X)^2=\mathbf{154.0833}, \] \[ SS_{\text{Subj}(g)}=k\sum_{i\in g}(\bar X_{i\cdot}-\bar X_{g\cdot\cdot})^2=\mathbf{93.1667}. \]

2C. Within-Subjects

\(SS_{\text{WS-total}}=SS_{\text{total}}-SS_{\text{BS-total}}=\mathbf{280.6667}.\)

Decompose into Time, Group×Time, and residual Error: \[ SS_{\text{Time}}=s\sum_{t}(\bar X_{\cdot\cdot t}-\bar X)^2=\mathbf{205.0417}, \] \[ SS_{\text{Group}\times\text{Time}} =\sum_{g,t} n_g\Big(\bar X_{g\cdot t}-\bar X_{g\cdot\cdot}-\bar X_{\cdot\cdot t}+\bar X\Big)^2 =\mathbf{75.0417}, \] \[ SS_{\text{Error(WS)}}=SS_{\text{WS-total}}-SS_{\text{Time}}-SS_{\text{G}\times\text{T}} =\mathbf{0.5833}. \]

Figure 3: Partitioning diagram (Between: Group + Subj(Group); Within: Time + G×T + Error).


Step 3 — Degrees of Freedom (df) & Mean Squares (MS)

\[ \begin{aligned} &df_{\text{Group}}=g-1=1,\qquad df_{\text{Subj}(g)}=N_s-g=16-2=14,\\ &df_{\text{Time}}=k-1=2,\qquad df_{\text{G}\times\text{T}}=(g-1)(k-1)=2,\\ &df_{\text{Error(WS)}}=(N_s-g)(k-1)=(16-2)\times2=28,\\ &df_{\text{Total}}=Nk-1=48-1=47. \end{aligned} \]

\[ \begin{aligned} &MS_{\text{Group}}=\frac{SS_{\text{Group}}}{df_{\text{Group}}}= \frac{154.0833}{1}= \mathbf{154.0833},\qquad MS_{\text{Subj}(g)}=\frac{93.1667}{14}= \mathbf{6.6548},\\ &MS_{\text{Time}}=\frac{205.0417}{2}= \mathbf{102.5208},\qquad MS_{\text{G}\times\text{T}}=\frac{75.0417}{2}= \mathbf{37.5208},\\ &MS_{\text{Error(WS)}}=\frac{0.5833}{28}= \mathbf{0.02083}. \end{aligned} \]


Step 4 — F Tests & p-values

Between-subjects test: \[ F_{\text{Group}}=\frac{MS_{\text{Group}}}{MS_{\text{Subj}(g)}}=\frac{154.0833}{6.6548}= \mathbf{23.1538}, \quad df=(1,14),\quad p\approx \mathbf{0.00028}. \]

Within-subjects tests: \[ F_{\text{Time}}=\frac{MS_{\text{Time}}}{MS_{\text{Error(WS)}}} =\frac{102.5208}{0.02083}= \mathbf{4921.0},\quad df=(2,28),\quad p\ll 10^{-20}. \] \[ F_{\text{G}\times\text{T}}=\frac{MS_{\text{G}\times\text{T}}}{MS_{\text{Error(WS)}}} =\frac{37.5208}{0.02083}= \mathbf{1801.0},\quad df=(2,28),\quad p\ll 10^{-20}. \]

Figure 4: F distributions with observed statistics marked.


Mixed ANOVA Summary Table

SourceSSdfMSFp
Between: Group154.08331154.083323.15380.00028
Between: Subjects within Group93.1667146.6548
Within: Time205.04172102.52084921.0< 1e-20
Within: Group × Time75.0417237.52081801.0< 1e-20
Within: Error (Subj×Time within Group)0.5833280.02083
Total527.916747

Interpretation

Group: Drug > Placebo overall (significant between-subjects effect).
Time: Scores increase across weeks (strong within-subjects effect).
Group × Time: The Drug group improves sharply week-to-week while the Placebo group changes little (significant interaction).

Figure 5: Interaction plot showing non-parallel lines (Drug rising; Placebo flat).

Assumptions (checklist)

  • Independence between subjects; correct grouping.
  • Approximate normality within each Group×Time cell.
  • Homogeneity of variance across groups (between-subjects).
  • Sphericity for the within-subject factor Time (apply Greenhouse–Geisser/Huynh–Feldt corrections if violated).

Note: The residual within-subject error is intentionally small in this teaching dataset, so the Time and G×T F values are very large. Real data typically have larger residual variability.

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

Repeated-Measures ANOVA

rm profile
rm sem
rm partitioning var
f distrib
rm sphericity

Goal. Test whether performance changes across four conditions measured on the same participants.

Design & Experiment

  • Within-subjects factor: Condition with 4 levels (C1, C2, C3, C4).
  • s = 8 participants measured in k = 4 conditions ⇒ total observations \(N = s \times k = 32\).
  • Example context: the same students take four weekly quizzes after different study activities.

Figure 1: Profile plot (each subject as a line across the four conditions).


Data

Scores (rows = participants S1–S8; columns = conditions C1–C4):

SubjectC1C2C3C4Row sumRow mean
S17074758130075.00
S27375788230877.00
S36873737829273.00
S47479818531979.75
S57174788230576.25
S67072767829674.00
S77377808431478.50
S87477808431578.75
Column sums573601621654Grand sum = 2449Grand mean \( \bar X = 2449/32 = 76.53125 \)

Figure 2: Means ± SEM for C1–C4 (bar/line).


Step 1 — Condition Means (and sample variances)

\[ \begin{aligned} \bar X_{\mathrm{C1}} &= 573/8 = 71.625, \quad & s^2_{\mathrm{C1}} &= 4.8393 \\ \bar X_{\mathrm{C2}} &= 601/8 = 75.125, \quad & s^2_{\mathrm{C2}} &= 5.5536 \\ \bar X_{\mathrm{C3}} &= 621/8 = 77.625, \quad & s^2_{\mathrm{C3}} &= 7.6964 \\ \bar X_{\mathrm{C4}} &= 654/8 = 81.750, \quad & s^2_{\mathrm{C4}} &= 7.0714 \end{aligned} \]


Step 2 — Sums of Squares

Notation: \(s=8\) subjects, \(k=4\) conditions, grand mean \( \bar X = 76.53125\).

2A. Total

\[ SS_{\text{total}}=\sum_{i=1}^{s}\sum_{j=1}^{k}\bigl(X_{ij}-\bar X\bigr)^2 =\mathbf{611.96875}. \]

2B. Conditions (Treatment)

\[ SS_{\text{cond}}= s \sum_{j=1}^{k}\bigl(\bar X_{\cdot j}-\bar X\bigr)^2 = 8 \left[(71.625-76.53125)^2 + (75.125-76.53125)^2 + (77.625-76.53125)^2 + (81.75-76.53125)^2\right] =\mathbf{435.84375}. \]

2C. Subjects

\[ SS_{\text{subj}}= k \sum_{i=1}^{s}\bigl(\bar X_{i\cdot}-\bar X\bigr)^2 = 4 \sum_{i=1}^{8}\bigl(\bar X_{i\cdot}-76.53125\bigr)^2 =\mathbf{162.71875}. \]

2D. Error (Residual)

\[ SS_{\text{error}}= SS_{\text{total}} - SS_{\text{cond}} - SS_{\text{subj}} = 611.96875 - 435.84375 - 162.71875 =\mathbf{13.40625}. \]

Figure 3: Partitioning variance diagram (Total → Conditions + Subjects + Error).


Step 3 — Degrees of Freedom & Mean Squares

\[ \begin{aligned} df_{\text{cond}} &= k-1 = 3, \\ df_{\text{subj}} &= s-1 = 7, \\ df_{\text{error}} &= (s-1)(k-1) = 7\times3 = 21, \\ df_{\text{total}} &= sk-1 = 31. \end{aligned} \]

\[ MS_{\text{cond}} = \frac{SS_{\text{cond}}}{df_{\text{cond}}} =\frac{435.84375}{3}=\mathbf{145.28125},\qquad MS_{\text{error}} = \frac{SS_{\text{error}}}{df_{\text{error}}} =\frac{13.40625}{21}=\mathbf{0.6383928571}. \]


Step 4 — Test Statistic & p-value

\[ F = \frac{MS_{\text{cond}}}{MS_{\text{error}}} = \frac{145.28125}{0.6383928571} =\mathbf{227.5734}. \] With \(df_1=3\) and \(df_2=21\), this is extremely large. The right-tail p-value is effectively \(p \lt 10^{-12}\) (i.e., \(p \ll .001\)).

Figure 4: F distribution with observed F marked and right-tail region shaded.


Repeated-Measures ANOVA Summary Table

SourceSSdfMSFp
Conditions (within)435.843753145.28125227.5734< 1e-12
Subjects162.71875723.24554
Error (residual)13.40625210.63839
Total611.9687531

Interpretation

Mean performance increases steadily from C1 → C4, and the repeated-measures ANOVA shows a highly significant effect of Condition, \(F(3,21)=227.57,\, p\ll .001\). Follow-ups (e.g., paired t-tests with Bonferroni/Holm) can localize which pairs of conditions differ.

Assumptions (checklist)

  • Sphericity (equal variances of the differences between condition pairs). If violated, apply Greenhouse–Geisser or Huynh–Feldt correction to \(df\).
  • Approximately normal scores within each condition.
  • No carryover/fatigue effects that confound order (counterbalancing helps).

Figure 5: Sphericity concept sketch (pairwise difference variances).

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

Factorial ANOVA

factorial layout
factorial means interaction
factorial interaction

Goal. Test the effects of Method (Lecture vs. Online) and Time (Early vs. Late) on exam scores, and whether there is an interaction between Method and Time.

Design & Experiment

  • Factor A (Method): Lecture vs. Online
  • Factor B (Time): Early vs. Late
  • Balanced design: \(n=5\) per cell ⇒ total \(N=20\).

Students are randomly assigned to one of four cells (Method × Time). After a short module, all students take the same 100-point exam.

Figure 1: 2 × 2 layout (Method × Time).


Data

Scores by cell (five students per cell):

MethodTimeScoresCell Mean
LectureEarly686870727270.0
LectureLate767678808078.0
OnlineEarly707072747472.0
OnlineLate717173757573.0

Within each cell the sample variance is 4 (SD = 2), so the within-cell sum of squares is \((n-1)s^2 = 4\times4 = 16\) per cell.

Figure 2: Means with SEM by Time, separate lines for Method.

Figure 3: Interaction plot (Lecture rises sharply; Online nearly flat).


Step 1 — Marginal Means and Grand Mean

Cell means: \[ \bar X_{\text{Lecture,Early}}=70,\; \bar X_{\text{Lecture,Late}}=78,\; \bar X_{\text{Online,Early}}=72,\; \bar X_{\text{Online,Late}}=73. \] Marginal means: \[ \bar X_{\text{Lecture}}=\frac{70+78}{2}=74,\quad \bar X_{\text{Online}}=\frac{72+73}{2}=72.5; \qquad \bar X_{\text{Early}}=\frac{70+72}{2}=71,\quad \bar X_{\text{Late}}=\frac{78+73}{2}=75.5. \] Grand mean: \[ \bar X=\frac{70+78+72+73}{4}=73.25. \]


Step 2 — Sums of Squares (Between)

Balanced design formulas (with \(n\) per cell, \(a=b=2\)):

  • \(SS_A = nb \sum_a(\bar X_{a\cdot}-\bar X)^2\), here \(nb=10\).
  • \(SS_B = na \sum_b(\bar X_{\cdot b}-\bar X)^2\), here \(na=10\).
  • \(SS_{AB} = n \sum_{a,b}\big(\bar X_{ab}-\bar X_{a\cdot}-\bar X_{\cdot b}+\bar X\big)^2\), here \(n=5\).

Compute each term:

Factor A (Method): \[ \begin{aligned} SS_A &= 10\Big[(74-73.25)^2 + (72.5-73.25)^2\Big]\\ &= 10\big[0.75^2 + (-0.75)^2\big] = 10(0.5625+0.5625)=\mathbf{11.25}. \end{aligned} \]

Factor B (Time): \[ \begin{aligned} SS_B &= 10\Big[(71-73.25)^2 + (75.5-73.25)^2\Big]\\ &= 10\big[(-2.25)^2 + (2.25)^2\big] = 10(5.0625+5.0625)=\mathbf{101.25}. \end{aligned} \]

Interaction \(A\times B\): For each cell compute \(d_{ab}=\bar X_{ab}-\bar X_{a\cdot}-\bar X_{\cdot b}+\bar X\). Here each \(d_{ab}=\pm1.75\) so \(d_{ab}^2=3.0625\) and there are four cells: \[ SS_{AB}=5\times(4\times3.0625)=\mathbf{61.25}. \]


Step 3 — Within-Group (Error) and Total SS

Within each cell, \((n-1)s^2=16\). With 4 cells: \[ SS_{\text{within}}=\mathbf{64.00}. \]

Total: \[ SS_{\text{total}}=SS_A+SS_B+SS_{AB}+SS_{\text{within}} =11.25+101.25+61.25+64.00=\mathbf{238.75}. \]


Step 4 — Degrees of Freedom & Mean Squares

\[ \begin{aligned} &df_A=a-1=1,\quad df_B=b-1=1,\quad df_{AB}=(a-1)(b-1)=1,\\ &df_{\text{within}}=N-ab=20-4=\mathbf{16},\quad df_{\text{total}}=N-1=19. \end{aligned} \] \[ MS_A=\frac{11.25}{1}=11.25,\quad MS_B=\frac{101.25}{1}=101.25,\quad MS_{AB}=\frac{61.25}{1}=61.25,\quad MS_{\text{within}}=\frac{64.00}{16}=\mathbf{4.00}. \]


Step 5 — F Tests & p-values

\[ F_A=\frac{MS_A}{MS_{\text{within}}}=\frac{11.25}{4}= \mathbf{2.8125},\qquad F_B=\frac{MS_B}{MS_{\text{within}}}=\frac{101.25}{4}= \mathbf{25.3125},\qquad F_{AB}=\frac{MS_{AB}}{MS_{\text{within}}}=\frac{61.25}{4}= \mathbf{15.3125}. \] With \(df_1=1\), \(df_2=16\): \[ p_A \approx 0.11\;(\text{n.s.}),\quad p_B < 0.001,\quad p_{AB} \approx 0.001. \]


ANOVA Summary Table

SourceSSdfMSFp
Method (A)11.25111.252.8125≈ 0.11
Time (B)101.251101.2525.3125< 0.001
A × B61.25161.2515.3125≈ 0.001
Within (Error)64.00164.00
Total238.7519

Interpretation

Main effect of Time (B) is significant: Late > Early on average. Main effect of Method (A) is not significant at conventional levels. The interaction (A × B) is significant: Lecture improves markedly from Early→Late, while Online changes little—non-parallel lines in the interaction plot.

Figure 4: Interaction plot highlighting non-parallel lines.

Assumptions (checklist)

  • Independence of observations within and across cells.
  • Approximately normal scores within each cell.
  • Homogeneity of variances across cells (here, each cell variance ≈ 4).

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

One-Way ANOVA

anova boxplot
anova means sem
anova partition variance
anova f distribution

Goal. Test whether three teaching methods lead to different average exam scores.

Design & Experiment

Twenty-four students are randomly assigned to one of three methods (n = 8 per group):

  • Group A: Active discussion
  • Group B: Structured lecture
  • Group C: Self-study

After a 2-week module, everyone takes the same 100-point exam.


Data

Group AGroup BGroup C
727865
688270
758066
707768
697967
738169
718364
747671

Figure 1: Boxplots of scores by group.

Group sizes: \(n_A=n_B=n_C=8\). Total \(N=24\).


Step 1 — Sums & Means

\(\displaystyle \begin{aligned} \text{Sums:}&\quad \sum A=572,\;\; \sum B=636,\;\; \sum C=540.\\[4pt] \text{Means:}&\quad \bar A=\tfrac{572}{8}=71.5,\;\; \bar B=\tfrac{636}{8}=79.5,\;\; \bar C=\tfrac{540}{8}=67.5.\\[4pt] \text{Grand mean:}&\quad \bar X=\tfrac{572+636+540}{24}=72.8333\ldots \end{aligned} \)


Step 2 — Within-Group Variability (sample variances)

For each group, compute \( s_g^2=\dfrac{\sum(x-\bar x_g)^2}{n_g-1} \).

  • \(s_A^2 = 6.0\)
  • \(s_B^2 = 6.0\)
  • \(s_C^2 = 6.0\)

Corresponding sums of squares within each group: \(\displaystyle SS_A=\sum(x-\bar A)^2=42,\; SS_B=42,\; SS_C=42\Rightarrow SS_{\text{within}}=42+42+42=126.0. \)

Figure 2: Group means with SEM error bars.


Step 3 — Between-Groups Variability

\(\displaystyle SS_{\text{between}}=\sum_{g} n_g(\bar x_g-\bar X)^2 =8(71.5-72.8333)^2+8(79.5-72.8333)^2+8(67.5-72.8333)^2 =597.3333\ldots \)

Total sum of squares: \(\displaystyle SS_{\text{total}}=\sum (x-\bar X)^2 = SS_{\text{between}}+SS_{\text{within}} =597.3333\ldots+126.0=723.3333\ldots \)

Figure 3: Partitioning variance (\(SS_{\text{total}}=SS_{\text{between}}+SS_{\text{within}}\)).


Degrees of Freedom & Mean Squares

\(\displaystyle df_{\text{between}}=k-1=3-1=2,\qquad df_{\text{within}}=N-k=24-3=21,\qquad df_{\text{total}}=N-1=23. \)

\(\displaystyle MS_{\text{between}}=\frac{SS_{\text{between}}}{df_{\text{between}}} =\frac{597.3333}{2}=298.6667,\qquad MS_{\text{within}}=\frac{SS_{\text{within}}}{df_{\text{within}}} =\frac{126.0}{21}=6.0. \)


Test Statistic & p-value

\(\displaystyle F=\frac{MS_{\text{between}}}{MS_{\text{within}}} =\frac{298.6667}{6.0}=49.7778. \)

With \(df_1=2\), \(df_2=21\), the (right-tail) p-value is \(p\approx 1.07\times10^{-8}\) (i.e., \(p<0.00000002\)).

Figure 4: F distribution curve with right-tail decision region.


ANOVA Summary Table

SourceSSdfMSFp
Between groups597.33332298.666749.7778< 0.00000002
Within (error)126.0000216.0000
Total723.333323

Conclusion

There is a statistically significant difference among the three methods’ mean scores (\(F(2,21)=49.78,\; p\ll .001\)). A post-hoc comparison (e.g., Tukey HSD) would identify which pairs differ.

Assumptions (checklist)

  • Independent observations (via random assignment).
  • Approximately normal scores within each group.
  • Homogeneity of variance (here, each group variance \(\approx 6\)).

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

Appendix 8 — Glossary of Key Terms

Mean (average)
Sum of all scores divided by number of scores.
Example: (6 + 8 + 10) / 3 = 8.

Median
Middle score when data are ordered.
Example: For [5, 7, 8], median = 7.

Mode
Most frequent score.
Example: For [2, 3, 3, 5], mode = 3.

Variance (s²)
Average squared deviation from the mean.

Standard Deviation (s)
Square root of variance. Spread of scores around the mean.

Standard Error of the Mean (SEM)
How much sample means vary.
Formula: $$SEM = \frac{s}{\sqrt{n}}$$

t-test
Compares two means.

ANOVA (F-test)
Compares three or more means.

Post Hoc Test
Used after ANOVA to find which groups differ.

Correlation (r)
Strength and direction of a linear relationship. Range: –1 to +1.

Regression
Equation that predicts Y from X.
Example: $$\hat{Y} = a + bX$$

Chi-square (χ²)
Test for categorical data (counts).

Degrees of Freedom (df)
Independent pieces of information in a test.

p-value
Probability of getting the observed result (or more extreme) if the null hypothesis is true.


📱 QR: Interactive glossary (search symbols, formulas, definitions)

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

Appendix 7 — Study Tips for Statistics

Learning statistics is not about memorizing formulas — it’s about thinking with data.
Here are some strategies to make it easier.


1. Read Formulas in Two Ways

  • Symbolic: $$\bar{X} = \frac{\Sigma X}{n}$$
  • Words: “Mean = sum of scores / number of scores”

2. Practice by Hand First

  • Work out a mean or variance with a small dataset.
  • Then check with calculator/Excel.
  • This builds intuition and confidence.

3. Draw Pictures

  • Normal curve with shaded area
  • Bar charts for group means
  • Scatterplots for correlation
    Visuals make ideas stick.

4. Watch Out for Common Mistakes

  • Mixing up SD and SEM
  • Forgetting to subtract 1 for df
  • Using a one-tailed test when two-tailed is needed

5. Use Short Sessions

  • 10–15 minutes of practice each day beats one long cram.
  • Try one formula or test per session.

6. Check Your Understanding

  • Can you explain in words what the test does?
  • Example: “t-test compares two means. ANOVA compares three or more.”

📱 QR: Online flashcards + short quiz (practice key terms & formulas)


Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.

Appendix 6 — Data Sets for Practice

spreadsheet dataset

```html

Appendix 6 — Data Sets for Practice

Working with real numbers is the best way to learn statistics. This appendix provides small “mini datasets” you can analyze by hand (or with a calculator), plus larger files for practice with spreadsheets.


Dataset Provenance (Read This First)

  • Pedagogical = small, simplified numbers chosen to make learning and checking easier.
  • Simulated = computer-generated numbers designed to resemble real data (not collected from real people).
  • Empirical = collected from real observations (only used if explicitly stated).

Note: Unless a dataset is explicitly labeled Empirical, you should treat it as Pedagogical or Simulated practice data.


Mini Datasets (In-Page)

1) Quiz Scores

Provenance: Pedagogical
n: 10
Scale: Ratio (points)
Data: 6, 7, 8, 9, 10, 7, 8, 6, 9, 10

  • Suggested Lessons:
    • Lesson 2 — The Averages: mean, median, mode
    • Lesson 3 — Variance & Standard Deviation: variance, SD, z-scores
    • Lesson 4 — The Standard Normal Curve: interpret z-scores (as a bridge)
  • Check values (optional): Mean = 8.0; SD ≈ 1.41

2) Reaction Times (ms)

Provenance: Pedagogical (human-like values)
n: 8
Scale: Ratio (milliseconds)
Units: ms
Data: 220, 250, 270, 230, 260, 280, 240, 300

  • Suggested Lessons:
    • Lesson 3 — Variance & Standard Deviation: spread, outliers, SD
    • Lesson 6 — The t-test: use as a template dataset (e.g., compare two conditions by splitting into two groups)
    • Lesson 7 — ANOVA: extend to 3+ groups by creating conditions
  • Instructor tip: reaction time data often show mild skew in real life. If you want skew, see the larger practice files below.

3) Stress Reduction Scores (Three Groups)

Provenance: Pedagogical (grouped scores)
Scale: Interval/Ratio (score units; treat as interval for ANOVA practice)
Groups:

  • Meditation (n = 3): 65, 70, 72
  • Exercise (n = 3): 68, 71, 75
  • Music (n = 3): 75, 78, 82
  • Suggested Lessons:
    • Lesson 7 — ANOVA: one-way ANOVA (three independent groups)
    • Lesson 8 — Post Hoc Tests: follow-up comparisons after ANOVA (conceptual)
    • Lesson 13 — Degrees of Freedom Cookbook: df for one-way ANOVA
  • Important note: The sample sizes are intentionally small for learning mechanics. In real studies, groups are usually larger.

Larger Practice Datasets (Download Files)

These datasets are designed for spreadsheet work, graphing, and full problem sets.

  • Exam Scores (n = 100)
    Provenance: Simulated
    Suggested Lessons: Lesson 4 (normal curve), Lesson 5 (SEM), Lesson 6 (t-test foundations)
  • Survey Data (preferences by gender/age)
    Provenance: Simulated (categorical practice)
    Suggested Lessons: Lesson 12 (chi-square), Lesson 1 (why statistics matters in decisions)
  • Simulated Medical Trial (treatment vs. control, repeated measures)
    Provenance: Simulated (instructional “trial-style” dataset; not clinical research)
    Suggested Lessons: Lesson 6 (t-test concepts), Lesson 7 (variance partitioning concepts), and for advanced learners: repeated-measures ideas (optional)

Downloads: CSV and Excel files are provided via the QR code(s) on this page (and/or direct links, if enabled on your device).

Reproducibility note (simulated files): If you revise these datasets in future editions, consider generating them with a fixed random seed so instructors and students can reproduce results across versions.


Trusted External Sources (Optional)

If you want additional datasets beyond the practice files above, the following repositories are widely used for learning and benchmarking:

  • NIST Statistical Reference Datasets (SRD)
    High-quality benchmark datasets for practice and verification (excellent for checking calculations and software).
  • UCI Machine Learning Repository
    Larger, more complex datasets. Recommended only for advanced students or enrichment projects.

Visual Reference

Figure F.1 — Example spreadsheet view of a dataset (columns such as ID, Score, Group). Use this as a template for organizing your own data before running calculations.


Self-Test Quiz Access

Practice problems and self-test quizzes may appear below. If full access is restricted, please sign up (free) to unlock the quiz section.

```

Appendix 5 — Technology Tips (On Your Phone & Laptop)

mean across tools

Statistics can be done with calculators, spreadsheets, or software. Here’s a quick guide.


Excel / Google Sheets

TaskFormulaExample
Mean=AVERAGE(A1:A10)Mean of scores in A1–A10
Standard Deviation=STDEV.S(A1:A10)Spread of scores
t-test=T.TEST(A1:A10,B1:B10,2,2)Compare two groups

R (RStudio or RStudio Cloud)

TaskCommandExample
Meanmean(x)mean(c(6,8,10)) = 8
SDsd(x)sd(c(6,8,10)) = 2
t-testt.test(x,y)Compare two groups

Python (NumPy / SciPy / Pandas)

TaskCommandExample
Meannp.mean(x)np.mean([6,8,10]) = 8
SDnp.std(x, ddof=1)np.std([6,8,10],ddof=1) = 2
t-teststats.ttest_ind(x,y)Compare two groups

iPhone Calculator

  • Rotate sideways → scientific mode
  • Use √ for square root
  • Parentheses matter: type numerator, then divide by denominator
  • Fine for small problems, but not for full datasets

Summary

  • For quick homework: iPhone calculator
  • For assignments: Excel / Google Sheets
  • For coding: Python (Colab) or R (RStudio Cloud)

📱 QR: Open sample data in Google Sheets (ready to practice mean, SD, t-test)


Visuals

Figure E.1 — Screenshots of the same mean calculation in Sheets, R, and Python side by side.

Practice self-test quiz

In the space below, please find practice problems and self-test quizzes. For full access, please signup free.