Story 13 — The t-test | High School Statistics Textbook

Title	Link
My science and philosophy books	Open
My theology books	Open
My books on Classics	Open
My literary work	Open

William Sealy Gosset

(June 13, 1876–October 16, 1937)

Wikipedia Aug, 2014 PD

William Sealy Gosset published

the t-test under the pen name

Student. We refer to the

distribution of this test as the

t-distribution. The t-test is a test of

inference, i.e., it allows us to infer

on the basis of our data, whether

the difference between two means

is reliable or, as we say, significant.

In what follows, we will first try to

develop the concepts needed for

understanding the logic and the

operations involved in the t-test.

We will talk about this and that and

the other. Be patient. Then we will

go over an example of the t-test.

Developing the concepts in the

t-test

As is the case with all parametric

tests that we will cover in this

book, the t-test analysis is based

on variance.

In experiments in which we have

two groups we analyze our data by

using the t-test. There are two

types of t-tests. The t-test for

independent samples (groups),

and the other for dependent or

paired samples. Here we will

consider independent samples.

What are independent samples?

you say.

Ok. We will make a small

parenthesis in order to develop the

concept of independence.

Drama

Apple-pie IQ

A psychologist has a sneaky suspicion that

the type of apple pie has an effect on

intelligence. She randomly selected 20

students and randomly assigned them to

two groups. Group 1, golden delicious

apple pie, Group 2, red delicious apple pie.

John Gluck was assigned in Group 1, and

his friend Paul Crust was assigned to

Group 2. The psychologist proceeded with

giving these subjects a pound of apple pie

to eat. Subsequently she tested their

intelligence. Each subject was allowed to

see their intelligence score. There were 20

intelligence scores, one for each subject.

These groups are independent as you see.

She analyzed her data by using a t-test for

independent groups in order to see if there

was a significant difference in intelligence

in the two groups.

Note: An unexpected event occurred during the running of the experiment. One of the subjects in Group 2 did not show up on time so the experiment was delayed for a few minutes. John Gluck, who, as you remember, was in Group 1, offered to participate in Group 2, in addition to his participation in Group 1. The experimenter did not allow this. Had she allowed John Gluck to be a subject in both groups, she would have violated the rule of independence, and she would not have been able to analyze her data by using the t-test for independent samples.

The formula for the t-test is:

We read this as follows:

T equals mean 1 minus mean 2

divided by the square root of the

variance of group 1 and group 2

divided by the number of scores

that went into the calculation of the

variance.

Faithful to our goal we must

understand the concepts in the

t-test.

First look at the numerator.

Mean 1 minus mean 2, that is the

difference of the two means.

Next look at the denominator.

The square root of the variance

is the standard deviation.

This looks like the z formula that

we considered above. Here it is again:

Yes, you say, but the numerator of

the z formula is score minus the

mean. The numerator of the t-test

formula is mean 1 minus mean 2.

Where is the mean? They are not

the same, you say.

They are the same, I say. Mean 1

minus mean 2 is the difference

between the two means. Gosset

treats this difference as a score.

Yes, you say, but then where is

the mean in the t-formula?

The mean is there! I say.

It is there but you do not see it. It

is 0. The mean is zero.

The mean is zero!

Let the drums thunder at this

point. Let the bugles sound in the

four corners of the world!

The normal (t-) distribution with 0

in the middle. In other words, a

curve with a mean of 0.

A most important point in the history of

statistics.

Let us call this curve the

curve of no difference.

I do not understand the t formula

at the gut level, you say.

Watch the ritual dance with the

t-formula.

Drama

An archetypal ceremony II

I have a difference between

the two means. I hold this

difference up, wave it in the

air, I baptize it score. Then I

wear my glasses and stick my

nose on the t curve, running

up and down the line with

standard deviations on it, and

mumble: where does this

score fall? Where does this

score fall?

I then use the z formula -

oops, the t-formula - and

find where exactly our score -

oops, our difference - falls.

This is an archetypal ceremony.

Remember? Amazing, isn’t it? The

t-formula is actually the z formula!

I promised you that you do not

need the mind-boggling array of

fear inspiring formulas. Hang on.

Here is more of the story of the t

distribution. The motive in

Gosset’s mind was to modify the

normal curve so that it could safely

be used for small samples. He

decided to make it difficult for

researchers to find significance

(i.e., to decide whether the

difference between our two means

is reliable) when the samp

les are

small. The curve he created is a

normal curve with some intriguing

qualities. The noses, or tails of the

curve lift up as the sample size

decreases. The tails of the curve

lift up. You can see that in the graph that follows.

The shorter (lower) curve

corresponds to the smaller sample.

T-CURVES

What an ingenious idea!

Some of you, few, very few,

already see what this means. It

means that it becomes more

difficult for you to find significance

because the percentage of the

curve increases in the tails so that

the magic standard deviation of

1.96 becomes larger, which in turn

makes it difficult for you to find

significance, that is to have a

“real” effect, a “true” finding so that

you can publish your experiment.

Let’s finish this story of the t-test,

a statistical test that is used so

frequently in labs across the world

daily.

So far, we have said that we have

two groups, therefore two means,

two standard deviations. We are

interested in deciding whether the

difference between the two means

is reliable, i.e., that it is not a

chance event, that it is, in a sense,

real. We find the t, which is

actually a z, that is, it tells us

where on the curve this difference

falls.

Ok, you say. So, we find the t

which is like z, then what?

If our sample is large the t

distribution is identical to the

normal distribution. In the normal

distribution, we have seen that z of

1.96 is an important mark.

Between -1.96 and +1.96 95% of

the curve falls. A score (here a

difference) that falls within these

two marking points, has a

probability of 95% to occur by

chance in a known situation in

which there is no difference, that is

in a situation in which the two

samples were drawn from the

same population. That is what our

curve of no difference graphs.

If our sample is small? you ask.

While in the normal distribution

1.96 always marks the 95% of the

curve, in the t distribution it is so

only if the sample size is very

large. With smaller samples, 1.96

increases inversely proportional to

sample size. The smaller the

sample size, the greater the

increase in 1.96.

Remember, again, Gosset’ s goal

was to make it difficult for

researchers to find significance

with small samples. He calculated

these new values of 1.96.

Drama

Mercy Mr. Gosset

Mr. Gosset, good morning. This is

Samir calling from India. I ran an

experiment with 20 subjects. How

much should I increase 1.96?

Half asleep, Gosset takes out his notes

and read the value to Samir.

The new value for z 1.96, the t, is

2.101.

Good night Mr. Samir.

Two hours later the phone rings again.

Good evening, Mr. Gosset. I am

Michel Duzed, calling from Montreal

Canada. Please give me the value of z

for an experiment with 72 subjects.

Gosset looks at his notes and says:

It is 1.994

Goodnight.

Going back to sleep is difficult for Mr. Gosset. One hour later the phone

rings again, this time from Japan.

Good day Mr. Gosset. Please give me

the new z for an experiment with 402

subjects.

1.966, Gosset says, and he hangs up.

I got to do something about this, he

says, aside from pulling the phone

from the plug. I got to do something…

I will never be able to sleep.

He did. He published his notes

with the recalculated z of 1.96.

Have you seen this in any

statistics book? All of my students

say no, actually a lot of my

colleagues say no, too. I will

disclose the secret. It is the

famous t-table found at the end of

every statistics book, including the

one you are reading now (see

Appendix). Promise to keep our secret between us.

Climax in the drama.

Samir of India wrote down the t

value:

2.101

With hands trembling he picked up

the sheet with the data analysis of

his thesis to find the result of the

t-test. He read it out aloud:

t=2.24

I made it! I made it!

he chants as he dances around in

his room. Samir will get his

Masters. The difference between

the two means of his experiment is

reliable.

He will report his finding as

significant. In his thesis he will

write:

This difference is significant

(p<0.05).

What does this notation mean?

It means that the chance that his

finding is not reliable, i.e., that it is a

chance event, is less than 5 per

cent.Scientists have agreed to accept

findings as being reliable if the p

value is less than 0.05.

Scientists have agreed to accept

findings as being reliable,

if the p value is less than 0.05.

Remember this!

Back in Canada Michel Duzed, holding the

note with the t value that Mr. Gosset just

gave her (it was 1.994, remember?),

compares it with the result of the t-test

value that she got by analyzing the data of

her experiment, which was t=1.982.

Alas! The t she computed by using the t

formula is smaller than the t Mr. Gosset

gave her.

Her eyes open wide, her face get gloomy,

and she collapses on an armchair. Michel

will not get her Doctorate. Her finding is

not significant. She will not write her

dissertation. If she were to write it, she

would report it as follows.

This finding is not significant (p>0.05).

P greater than 0.05.

This means that her difference could have

been a chance event more than 5 times in

a hundred.

Michel quits graduate school, she marries

her sweetheart and moves out of Quebec

to a job as a business consultant.

What happened to the Japanese

guy?

Standard deviation 1.96

Percent of curve 95%

Standard deviation 2.56

Percent of curve 99%

The curve of no difference, as we

said, has zero in the middle, that is

the mean is zero.

What does this curve graph?

Remember the curve in the case

of the woolen caps for the farmers

of the State of Wisconsin?

That curve was a curve

that we would be getting if we

were allowed to collect many

samples, figure out the mean of

each, and graph these means. In

effect this curve was empty.

But remember we knew

something about it. We knew the

estimate of the standard deviation

(standard error of the mean). We

calculated it from the standard

deviation of the one and only

sample mean we had.

In our present case, the t-curve is

a similar curve. It would be

graphing differences between two

means, if were we allowed to run

our experiment of two groups

many times, each time having two

means, calculating the mean of

each group and then calculating

the difference of the two means.

Because we have the standard

deviation of this curve, we can

consider our difference as a score

and engage in the ritual dance of

where my score falls. What point

on the standard deviation line

does this score (difference) lie.

We used the z formula to tell us

where on the curve our difference

lies. We said above that the

t-formula is actually a z formula, a

modified one, to take into account

the sample size. Let’s look at them

again.

z equals score minus the mean

divided by the standard deviation.

t equals score (mean 1 minus

mean 2 gives us the difference

which we consider to be a score)

divided by the standard deviation

(the standard deviation of both

groups; remember that the square

root of variance is the standard

deviation).

You see then that the t is a z.

Why is the variance of each group

divided by the number of subjects

in the group?

Remember, Gosset’s goal was to

make it difficult to researchers to

find significance if their sample

size was small.

Understanding the logic of the

t-curve, the curve of no

difference.

Drama

Me minus me equals 1

Professor Lilly Prydum, a statistician,

decided to run a simple experiment to test

a model for guessing if two means came

from one specific group, or they came

from two different groups.

Quite convoluted, you say.

She invited Memy Tallibum, Ph.D. in

Education, to be part of the experiment

and try to guess if the difference of two

means came from the same group, or not.

The subjects were 30 students from

Trenton College. They were asked to take

a test of 120 questions on Chinese

geography, mythology, and culture. The

scoring of the test was done by a

computer, and the recording of the data

was done in such a way that the subjects

remained anonymous. The experiment

lasted 35 days. Subjects were asked not

to read about China during this time

period. They were also asked not to chat

amongst themselves and not to compare

notes.

Day 1 of the experiment. Time 9:00 in the

morning. The test is given. Time allotted 1

hour. 10:00 a.m. the testing period is

over. Students are asked to take a 30

minute break. During that time the exam

papers are graded, and the score is

recorded next to each subject. Dr.

Tallibum gets a copy of the results and

studies them with care.

At 10:30 the students are called back in

again, and are given the same test, the

one they took 30 minutes earlier. At

11:30 the testing session is over. The

subjects are thanked, and asked to be

back the next day. “Please be back here

at 9:00 in the morning sharp. And

remember, no reading on China”.

The students leave, their exam papers are

graded, and the score of each student is

recorded. Each subject now has two

scores. The score from session 1 and the

score from session 2. Dr. Tallibum gets a

copy and scrutinizes it. As expected the

difference between the first session and

the second is very small, and for most

students it is zero.

The next day all students are present, Dr.

Tallibum is present, and the experiment

proceeds as planned.

Session 1, the students are given the

same test they took the previous day. Session

2, the students are given the same test.

Scores are recorded and are handed to Dr.

Tallibum who watches like a hawk. At the

end of the second day Dr. Tallibum is not

surprised to see that, once again, the

difference between the first and the

second session is very small, close to zero.

Students are reminded that the

experiment will last 35 days, and asked to

“come back tomorrow “.

On day 16 of the experiment Dr. Tallibum

is surprised to see that the difference

between the first and the second session

is substantial. Students did not score as

well in the second session. The

explanation came in the evening as she

was watching the news. Pop star Mickey

Wiggletuchs had died. Students must have

heard about this during the break between

the two sessions and were emotionally

disturbed. That influenced their

performance in the second session.

On day 33 Dr. Tallibum was shocked to

see that the difference between session 1

and session 2 was so great. Scores in the

second session were so low! Again the

explanation came in the evening news.

The stock market had crashed. Apparently

the students heard of this disaster during

the break between the two sessions. Most

likely parents called and told their

children that there would be no money to

pay the college fees.

Day 34, the day before the last day

of the experiment, rolls smoothly.

At the end of the second session,

Dr. Tallibum is asked to leave the

exam room.

Students are given the same

instructions as usual. They are told

to be present as usual, for the last

day of the experiment.

Dr. Tallibum is left in the dark, she

does not know what the students

were told.

Day 35, the last day of the experiment.

Session 1. Students are taking the test

as in the past and Dr. Tallibum is

present, as in the past. During the

break the exam papers are graded and

scores handed to Dr. Tallibum.

At this point, there is a dramatic

change in the procedure. Dr. Tallibum

is asked to leave the exam room. She

is led to the elevator and taken to the

biology laboratories on the top floor,

where there are no windows. She is

asked not to leave this lab, not to

make any phone calls, and wait there

for an hour. To make sure, she is

guarded by the graduate students of

the biology lab.

At 11:30 the second session of the

experiment is over. The students are

told that the experiment came to an

end, that there was no need for them

to come back again, and were given a

brief talk as to the purpose of the

experiment.

The papers were graded, the scores

were recorded and Professor Lilly

Prydum, the experimenter, took the

elevator to the Biology Lab and

handed the results to Dr. Tallibum.

Dr. Tallibum, she said. In the second

session we prevented you from seeing

who were the subjects. As you realize

we may have used the same subjects,

the students from Trenton College, or

we may have used another group. We

could have used workers from the

Physical Plant of the College, students

from Trenton High, or senior citizens

from the local senior citizens club.

Your task is to guess whether the

group that took the test during today’s

second session of the experiment were

the Trenton College students that you

watched for 35 days, or another,

different group.

Dr. Tallibum looked at the data sheet. It

contained the difference between session

1 and session 2 for 35 days. The difference

score of day 35 was the score in question.

Instinctively, her eyes ran up and down

the list of differences. She was trying to

find if the score of day 35 has occurred in

the past 34 days. Not only that. If it has

occurred, how often has it occurred. If she

finds that it has occurred many times, she

will conclude that the students in the

second session of today were the same

students, that is the Trenton College

students.

If the difference score of today is not on

the list, if it has never occurred in the last

34 days, then the wisest guess would be

that in the second session of today it was

not the Trenton College students that took

the test.

It will be more difficult to guess if the

difference of day 35 has occurred a few

times, very few times. That's where Dr.

Tallibum will resort to her knowledge of

statistics. Guess what. She will draw

Goddess Normal Curve and pray. Let’s

listen to her reasoning out loud:

Dr. Tallibum’s soliloquy

The solution to my problem must be in the

normal curve, as always. What do I have

here. I have a difference between two

means but I do not know if the two means

came from the same group, or from two

different groups. I want to guess wisely. I

will start by placing 0 in the middle of the

normal curve. This is the curve of no

difference. I assume that this curve

graphs differences between means that

come from a known case, a case in which

all means come from the same group, the

same people. If the difference that I am

not sure about is 0 or close to 0, I can

safely conclude that this difference comes

from the same group of people, i.e., that in

both sessions of day 35 it was the Trenton

College students that took the test. If the

difference in question is large …ay, there's

the rub. Whether it safer to say same or

not say, and suffer the slings and arrows

of rough ridicule, or to shut up and by

going against my promise to participate,

my participation end. But behold, there is

light. I can compute the standard

deviation of this no-difference curve and

reason. I can engage in the archetypal

dance.

Behold Dr. Tallibum dancing around the

biology lab to the amazement of the bio

graduate students.

I have a difference between the two

means. I hold this difference up, wave it

in the air, I baptize it score. Then I wear

my glasses and stick my nose on the

normal curve, running up and down the

line with standard deviations on it, and

mumble: where does this score fall?

Where does this score fall?

I then use the z formula and find where

exactly my score - oops, our difference -

falls. If it falls within -1.96 and +1.96 then this score occurs frequently, 95% of the time, in this case of known no difference. If this score, difference, falls beyond 1.96 then it is a rare occurrence.

In this case the wisest guess is that it

does not come from the same group, that

is, in the second session it was not the

Trenton students but another different

group of people.

I hear a question. Speak up.

Can she find out what was this

new group, you say.

Yes, the Oracle of Delphi should

be able to tell you that.

Drama

The long jump

The prototypic experimenter is standing in

the middle of the normal curve holding his

score or difference in his hand. He is

focusing on 1.96 and is about to jump

beyond it. If he succeeds in jumping over

1.96, he gets a medallion (his degree

perhaps); if he does not succeed, he lands

on his behind, and gets a kick on the

same.

Talking seriously now. All statistical tests

are based on the same logic we developed

above. So, I advise you to read my

theatrical masterpieces carefully if you

want to get the concepts, and be able to

reason in statistics, and solve the

problems with only 5 formulas.

What are these 5 formulas?

The five formulas we need to know:

currently editing the formulas