jump to content | main menu | tips on using this site | site map
OCT sitemap
assessment unit home
evaluating tests
Evaluating Tests Home
Mathematical underpinnings button button

Mathematical
Underpinnings


print module; link opens in new window search the O C T site tell a friend about the O C T site; link opens in new window contact the O C T team; link opens in new window  meet the O C T team

 







photo

 

  Printer-friendly page Printer-friendly page

Test Reliability

This Page Includes
 Introduction
  Internal vs. External
  Time Considerations
  Objectivity Considerations
  Indices
  Inter-Judge reliability

 

Introduction

Test reliability is the consistency of measurement that you get from an instrument. For most applications, this consistency is inversely related to the amount of random error in the process. Systematic error, say between an easy form of a test and a more difficult one, or between one evaluator and a more liberal one can be studied and eliminated, but usually only through a more advanced analysis called generalizability theory. For our purposes we are referring to error in a test as random, non-reproducible error. You should be aware that this kind of error is found in all observations (as in witnesses to crimes or measuring the length of a wall hanging) but is more easily studied and quantified in the multiple question, objective test. The reason for this is that there are more than one observation—each item is considered a measurement, however minute—and the error can be estimated by differences in the results. If we had several judges in a contest, we could study the inconsistencies, that is, the error among them. Error is thus found whenever we have more than a few measurements on the same thing, measurements which, by all rights, should not differ. This excludes, then, differences between scores at the beginning of a course compared with those at the end.


Internal vs External Return to top of page

Error can enter the system from within or from outside influences. Internal factors may include an examinee’s level of motivation, interest, or attention, while outside sources may include distractions in the room or evaluator judgments.


Time considerations

If we observe a student’s achievement at one time and then again at a later date using the same technique and assuming no influential intervention such as further study or additional instruction, we should get the same score for him. Please notice the assumptions under which we are working. They are not too artificial since we hope for some degree of consistency in our world and the people who inhabit it. Without the assumption that today’s information will be repeatable tomorrow, there would be no purpose in recording anything or even trying to remember it! In practice, it’s always good practice to try to estimate how long you believe the phenomenon to last in its present state.


Objectivity considerations Return to top of page

Whenever you write a test question, you are communicating information. The question and the others like it which make up your test are really quick, efficient substitutes for an extended oral test or interview you would have with the student to determine where he is on the learning curve or the observations you would make as he apprentices with you. The way you phrase the question or how it might be erroneously interpreted might cause random error in the examinee as he reads the item. If you are scoring an essay test, even using the best thought-out rubric, you might be biased against a particular student, more or less willing to give the student the benefit of the doubt, or influenced by poor handwriting or poor spelling (true story from the literature).


Indices Return to top of page

If we can assume that there are no meaningful differences across examinees in how much error there is in a test or if we are comfortable in referring to the “average error” across examinees, then we can derive coefficients for tests that summarize the error for a test. There is a variety of these, each one useful for a specific situation. All of them are scaled range from zero to 1.

Test-retest reliability

Test-retest reliability is calculated by giving the test twice with no intervention in the interval; this can be quite a conservative measure depending on the length of the time interval between the two testings. Most tests are considered to be “reliable” over a two-week interval. There is no rationale for this except an old saw about learning of isolated facts is lost after this time period. The best interval to select for your use is the time between the test and when the grades are assigned.

As mentioned above, each item on a test is assumed to be a replication of any other one. If a student gets one correct, he should have a higher probability of getting the next one correct. All pairs of items should have moderate to high phi coefficients. In the old days, before computers, that is, researchers and teachers used to calculate the split half coefficient (correlation of one half of the test with the other). This would be like a test-retest correlation with a zero length time interval. There are two problems with this: one, the test is half as long as the whole test, and there are very many halves of a 30 item test (7.8 x 10 11 more or less). A statistician named Lee Cronbach has shown that his coefficient alpha is actually the average of all possible split halves—so there is no point in calculating the split half anymore.

Most university testing centers can calculate Cronbach's alpha for you, but if you would like to attempt it on your own, simply calculate the test variance (as in any statistics book) and, after entering the person by item data (zeroes and ones) into a spreadsheet, calculate the item variances (probably the columns, where p is the item difficulty and q = 1 - p). Then the coefficient is simply

click to open a new window with a text description of this formula

where k is the number of items, S sub I squared is the variance for item i, and S squared is the total test variance.

This index is considered to be a measure of“internal consistency” of the test, a measure of the cohesiveness of the set of items. It can be interpreted as a reliability coefficient because it is the average of all possible split halves and each half is considered a form of the test. So, in a manner of speaking, we also have an estimate of how well this test will correlate with another from the same domain in the same way.


Inter-Judge Reliability Return to top of page

If we have judges making decisions about a group of people, we have to be concerned with inter-judge reliability, the extent to which the judges agree with each other. With two judges and several examinees, we can calculate the correlation between the two judges. If we have two judges who are categorizing a number of people, we can study how often they are in agreement or summarize this into a statistic by calculating Cohen’s kappa coefficient.

The first coefficients that were used to measure rater agreement were, essentially, summations of how much two responders agreed with each other. And you’d think that indices ranging from zero percent agreement to one-hundred percent would have been satisfactory. Jacob Cohen recognized that you could get a good agreement by chance alone (If we both flip a coin 100 times, we’ll agree fifty percent of the time!) and developed a coefficient, (Greek kappa) which corrects for this. The formula, with possible values running between 0.0 and 1.0, is

quantity F sub O minus F sub E over quantity N minus F sub E

where f sub o is the number of times two raters agree, f sub e is how much you’d expect by chance, and n is the number of times each responded. You count to get the number of agreements, but you have to calculate the number correct by chance.
Suppose we have two instructors, A and B, evaluating 100 students on a three-point scale (1 = poor, 2 = ok, 3 = good). The expected agreement between them is

formula: F sub e 1 equals quantity F sub A 1 times F sub B 1 all over N

where f sub a one is the number of students to whom rater A gave a 1 and f sub b one is the number of students rated 1 by instructor B. In this case n = 100.

Working with the data in Figure 8, we note first that instructor A rated 20 students “good” while B rated 30 “good,” and that, out of 100 students, they agreed on 75 (10 + 50 + 15). That might seem high until we calculate the chance data. There are three values we are interested in: f sub e one, f sub e two, and f sub e three. These are:

formulas: F sub e 1 equals ten times twenty all over 100; f sub e 2 equals 70 times 50 all over 100; f sub e 3 equals 20 times 30 all over 100

or f sub e one = 2, f sub e two = 35, and f sub e one = 6. If the raters keep the same degree of strictness (rating the same ratios of 1’s, 2’s, and 3’s), then we would expect them to agree on 2 + 35 + 6 or 43 students by chance alone. Our measure of agreement, kappa, becomes

equation: quantity 75 minus 43 over quantity 100 minus 43 equals 32 over 57 equals point five six

Instructor A - Student Scores
Student Scores
Totals
Poor Student (1)
10
0
0
10
OK Student (2)
5
50
15
70
Good Student (3)
5
0
15
20


Instructor B - Student Scores
Student Scores
Totals
Poor Student (1)
10
5
5
20
OK Student (2)
0
50
0
50
Good Student (3)
0
15
15
30


Figure 8: Comparison of Student Scores given by Instructors A and B
Instructor B
Instructor A Poor (1) OK (2) Good (3) Totals
Poor (1) 10 0 0 100
OK (2) 5 50 15 70
Good (3) 5 0 15 20
Totals 20 50 30 100

Improving Validity and Relibility has implications for grading. Learn How to Improve Test Reliability and Validity

 

return to top

 

© CET, SFSU 2003 Introduction | Design | Development | Implementation | Assessment | Site Home
this is the end of the page.