| jump to content | main menu | tips on using this site | site map |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Test Reliability
Introduction
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Test-retest reliabilityTest-retest reliability is calculated by giving the test twice with no intervention in the interval; this can be quite a conservative measure depending on the length of the time interval between the two testings. Most tests are considered to be “reliable” over a two-week interval. There is no rationale for this except an old saw about learning of isolated facts is lost after this time period. The best interval to select for your use is the time between the test and when the grades are assigned. |
As mentioned above, each item on a test is assumed to be a replication of any other one. If a student gets one correct, he should have a higher probability of getting the next one correct. All pairs of items should have moderate to high phi coefficients. In the old days, before computers, that is, researchers and teachers used to calculate the split half coefficient (correlation of one half of the test with the other). This would be like a test-retest correlation with a zero length time interval. There are two problems with this: one, the test is half as long as the whole test, and there are very many halves of a 30 item test (7.8 x 10 11 more or less). A statistician named Lee Cronbach has shown that his coefficient alpha is actually the average of all possible split halves—so there is no point in calculating the split half anymore.
Most university testing centers can calculate Cronbach's alpha for you, but if you would like to attempt it on your own, simply calculate the test variance (as in any statistics book) and, after entering the person by item data (zeroes and ones) into a spreadsheet, calculate the item variances (probably the columns, where p is the item difficulty and q = 1 - p). Then the coefficient is simply
where k is the number of items,
is the variance for item i, and
is the total test variance.
This index is considered to be a measure of“internal consistency” of the test, a measure of the cohesiveness of the set of items. It can be interpreted as a reliability coefficient because it is the average of all possible split halves and each half is considered a form of the test. So, in a manner of speaking, we also have an estimate of how well this test will correlate with another from the same domain in the same way.
If we have judges making decisions about a group of people, we have to
be concerned with inter-judge reliability, the extent to which the judges
agree with each other. With two judges and several examinees, we can calculate
the correlation between the two judges. If we have two judges who are
categorizing a number of people, we can study how often they are in agreement
or summarize this into a statistic by calculating Cohen’s kappa
coefficient.
The first coefficients that were used to measure rater agreement were,
essentially, summations of how much two responders agreed with each other.
And you’d think that indices ranging from zero percent agreement
to one-hundred percent would have been satisfactory. Jacob Cohen recognized
that you could get a good agreement by chance alone (If we both flip a
coin 100 times, we’ll agree fifty percent of the time!) and developed
a coefficient, (Greek kappa) which corrects for this. The formula, with
possible values running between 0.0 and 1.0, is
![]()
where
is the number of times two raters agree,
is how much you’d expect by chance, and n is the number of times
each responded. You count to get the number of agreements, but you have
to calculate the number correct by chance.
Suppose we have two instructors, A and B, evaluating 100 students on a
three-point scale (1 = poor, 2 = ok, 3 = good). The expected agreement
between them is
![]()
where
is the number of students to whom rater A gave a 1 and
is the number of students rated 1 by instructor B. In this case n = 100.
Working with the data in Figure 8, we note first that instructor A rated
20 students “good” while B rated 30 “good,” and
that, out of 100 students, they agreed on 75 (10 + 50 + 15). That might
seem high until we calculate the chance data. There are three values we
are interested in:
,
,
and
.
These are:
![]()
or
= 2,
= 35, and
= 6. If the raters keep the same degree of strictness (rating the same
ratios of 1’s, 2’s, and 3’s), then we would expect them
to agree on 2 + 35 + 6 or 43 students by chance alone. Our measure of
agreement, kappa, becomes

Student Scores |
Totals |
|||
|---|---|---|---|---|
Poor Student (1) |
10 |
0 |
0 |
10 |
OK Student (2) |
5 |
50 |
15 |
70 |
Good Student (3) |
5 |
0 |
15 |
20 |
Student Scores |
Totals |
|||
|---|---|---|---|---|
Poor Student (1) |
10 |
5 |
5 |
20 |
OK Student (2) |
0 |
50 |
0 |
50 |
Good Student (3) |
0 |
15 |
15 |
30 |
| Instructor B | |||||
|---|---|---|---|---|---|
| Instructor A | Poor (1) | OK (2) | Good (3) | Totals | |
| Poor (1) | 10 | 0 | 0 | 100 | |
| OK (2) | 5 | 50 | 15 | 70 | |
| Good (3) | 5 | 0 | 15 | 20 | |
| Totals | 20 | 50 | 30 | 100 | |
Improving Validity and Relibility has implications for grading. Learn How to Improve Test Reliability and Validity
| © CET, SFSU 2003 |
Introduction |
Design |
Development |
Implementation |
Assessment |
Site Home this is the end of the page. |