jump to content | main menu | tips on using this site | site map
OCT sitemap
assessment unit home
Assessment Overview
overview of assessment home button Overview of Assessment Home
introduction button Introduction
historical background button Historical
Background
purpose of testing button Purpose of Testing
testing elements button Testing Elements

print module; link opens in new window search the O C T site tell a friend about the O C T site; link opens in new window contact the O C T team; link opens in new window  meet the O C T team

 






 

  Printer-friendly page Printer-friendly page

Testing Elements (Some Measurement Theory)

This Page Includes
 Introduction
  Introduction to True Score Theory
  True Score Theory
- Domain Sampling
  True Score Theory
- Concerns
  Activity
  Item Response Theory - (IRT)
  Activity
  Unique uses of IRT
  Strength & weaknesses of IRT
  Implications for the Classroom Test
  Test Content

 


Introduction to Measurement Models

Coincident with the development of the short, objective test item (multiple choice, true/false, et al.) work began on objective ways to study these tests. Since, in most cases, one point was given if the examinee got the item correct and no point was given if he answered incorrectly, and the examiners simply added up the number of correct answers, the total score became a focus. It was readily noted that the total score is a linear sum of the dichotomous item scores, and Classical (linear) Test Theory (CTT) was born.

In 1952 Frederick Lord published a monograph on another model and in the late 1960’s, Georg Rasch, a Danish educator, followed with a comparable work on a variation derived from different assumptions. Both are far more sophisticated and flexible than CTT and both have led to advances in test theory and applications not possible with CTT. Called Item Response Theory (IRT) because the item is the basic unit of analysis, more elaborate versions of the model call for computers and a good deal of study for any convenient application.

Both theories are in current use today and seem to have found their own, somewhat separate niches. Classical theory lends itself to introductory and day-to-day application, especially in the classroom. IRT is used by large testing companies and organizations that want to use computer adaptive testing or develop large scale testing programs.


True Score Theory Return to top of page

There are several versions and derivations of the classical model, but they all seem to have started with Harold Gulliksen (Theory of Mental Tests, 1950) who gives historical credit to numerous predecessors dating throughout the first half of the twentieth century. He assumed that the gross, observed score, is made of two components, T and E. T is the relatively stable trait being measured and is called the true score, and E is an error component and is assumed to be a random process.

Errors can come from several places. Testing conditions are seldom optimal, and there is always the possibility that the room, its temperature, or other students are momentarily creating distractions. The person taking the test might not feel well or have personal problems that affect his motivation. Even in objective tests, it is possible for scoring errors to occur, either by a tired or distracted professor or a dirty optical scanner. It is also possible that the test items are not representative of the domain and give a certain degree of slant to the estimate of the students’ real score. It is possible that the ideas taught in class, read about in a text, or discussed with friends are slowly being forgotten or remembered incorrectly.

activity


Errors are considered random either across individual examinees for a given item or across items for a given examinee. One example might be a late night out, or perhaps a late night studying, and another might be a non-technical word that one examinee blocks on when reading an item. What other sources of error might reduce the ability of the test to come up with the same results on two different occasions?

The model is then simply, where i indicates an individual,

Xi = Ti + Ei.

This says that any observation can be decomposed into a systematic and a random component and that these are added to get the observation. With minimal assumptions, a series of important and useful statements and formulas can be derived.

Several of these are:

  1. The correlation between the true score and the error component is zero,that is,

    rTE = 0.

  2. In the long term, the error scores cancel out giving an average of zero, that is,

    ME = 0.

  3. In the long run, we expect the observed scores to equal the true score, that is,

    E[X] = T.

  4. The variability of the observed score can be decomposed into systematic variance and error variance, that is,

    _X = _T + _E.

  5. The Pearson correlation between two forms of the test can be interpreted as the proportion of systematic variance in the test, that is,

    formula: R sub x x equals S sub T squared over S sub X squared

  6. The standard deviation of the error scores can be estimated easily from observed data. This is called the standard error of the test and can be calculated by

    formula: S sub E equals S sub x times the square root of quantity one minus R sub x x

  7. The standard error is assumed to be the same for each examinee and can be used to put upper and lower boundaries on your estimate of his T score:

    Upper limit = Xi + 2SE
    Lower limit = Xi - 2SE

activity


Suppose we find that on a final examination, ST = 8 and SX = 9. Calculate the reliability for this test. How much of the “information” on the test would you be likely to reproduce on a similar test? (ans: rXX = .79)


If an examinee gets an observed score of 30 (i.e. X = 30) on this test. Within what range would you expect to observe his T score? (SE = 4.1; ans: 21.8 – 38.2)

 


Domain sampling Return to top of page

There are very few practical assumptions that you have to make in order to apply these very powerful ideas. Of course, the formulas can be calculated whether or not you meet the assumptions, but the inferences about your test will rest on how closely you meet several criteria. First, since we are talking about objective test items, we assume that it is possible to write many of them. In fact, it is best to be able to assume that, if you and your colleagues had the time and inclination,

  1. you could write hundreds of them, creating a Domain (that is, a population of items from which you can draw)
  2. the items are fairly interchangeable (after all, they will be reduced to 1’s and 0’s and then added up to get a total score),
  3. either you can randomly sample from the domain to create one of many possible “parallel” tests or any test you write cannot be distinguished from such a randomly selected one.


If you can meet these assumptions, then it follows that

  1. you are not restricted to discussing only the items on a particular test,
  2. you can generalize your conclusions about the test to other versions
  3. you can generalize from a student’s score to a score on any test you could generate from your Domain.


Strengths and weaknesses
Return to top of page
The particular strength that allows True Score Theory to maintain its popularity and usefulness is its simplicity. The algebra is straight-forward and many universities have software to analyze tests using it. Software is available for the desktop computer, and with a little study, the concepts become second nature.

The principal weakness of True Score Theory is that certain pieces of information do not readily generalize to other situations. For example, a score of 45 on Test Alpha is not the same as a 45 on Test Beta or for another group of examinees unless the two tests are comparable in

  1. over-all difficulty,
  2. length, that is, the number of items on the test,
  3. the number of items attempted by each examine.

In addition, if the items are not more or less homogeneous, then replacing one group of items with another may change the statistical characteristics of the test and possibly any decisions you make using that test. The difficulty of an item is defined by the group of examinees who took it and is thus group-dependent. A question that is determined to be “easy” based on data from one group may be estimated to be difficult if given to another group.

activity


Suppose we find that on a final examination, ST = 8 and SX = 9. Calculate the reliability for this test. How much of the “information” on the test would you be likely to reproduce on a similar test? (ans: rXX = .79)

If an examinee gets an observed score of 30 (i.e. X = 30) on this test. Within what range would you expect to observe his T score? (SE = 4.1; ans: 21.8 – 38.2)

 

 


Item Response Theory Return to top of page

As the name suggests, Item Response Theory (IRT) is a series of models that function at the level of the individual item, specifically, it is a set of equations with up to four parameters that model the response of a person to a theoretical continuum. We first assume that there exists a latent trait, a single theoretical continuum that describes the characteristic of the person we want to measure. Then, depending on other issues, include one or more parameters. The four parameters are

  1. item difficulty, the point location of the item on the continuum (b);
  2. item discrimination, the ability of the item to divide the continuum into two clearly separate parts (a);
  3. a guessing value, to correct for examinees' guessing (c).

In the most efficient model only the first parameter is used, and the other two are included to increase the model's fit to the test data. The model's characteristics are compared to the ability of the examinee (χ) in a logistic equation:

function P of epsilon equals one over the quantity one plus e where e has an exponent of negative a times the quantity epsilon minus b

Here, P means probability and the graph of the equation is an ogive (ess shaped curve) starting near zero and increasing slowly, then faster, then more slowly toward a value of 1. This is depicted in Figure 1(below).

The model has a nice interpretation. Ignoring the "a" parameter (discrimination), and looking at the formula, you can see that the working part is simply (χ - b), the numerical difference between the item's difficulty, b, and the person's proficiency, χ. If χ is greater than b, that is, if the person's proficiency is greater than the item's difficulty, the person will answer the item correctly. And the larger the difference, the greater the probability, P, that this will happen. If b is greater than χ, the item is "stronger" than the person, and the person will get the item wrong. The location of the curve on the X-axis, then, determines the "difficulty" of the item: it is the place where a person has a probability of success of .5. The pitch of the curve's rise determines how well the item can "discriminate" between locations to the left of the difficulty and to the right: flat curves do not discriminate well while sharply increasing ones do.

Figure1: graphical depiction of the IRT model

figure 1: Graphical Depiction of IRT model.

activity


Draw a vertical line up from a proficiency level of zero until you hit the curve. Draw a horizontal line left until you hit the Probability axis. You should be close to P = .5. This is the item difficulty for this item. What would the graph look like for a more difficult item? Draw it.

Look at the slope of the graph. This is the item discrimination, and for this item it is somewhat moderate. What would the graph for a highly discriminating item look like? Draw it.


Some Unique Uses of IRT Return to top of page

Because the basic work is done at the item level, tests are constructed by compiling items into a whole. Items are identified by their parameters and the resulting test will have characteristics determined by the items used. And so tests can be built with specifics determined beforehand, either to meet certain characteristics or to be sensitive at certain levels of the latent variable.

It can be proven that we can mathematically estimate the parameters of an item independent of what other items are used in a test and independent of the proficiency of the group of examinees used to estimate them. This parameter invariability is a powerful characteristic because it leads to some exceptional conclusions.

First, we can locate a person’s proficiency with very few well-chosen items: there is no point in having the examinee take items that he has a probability of 0 or 1 of passing.

Second, we do not have to give the same items to every examinee. As long as the examinee has close to a 50/50 chance of getting the item correct, we give only the items to each examinee that are appropriate for him.

Third, we don’t have to be concerned about selecting the appropriate population for estimating our item parameters.

Fourth, we can identify biased items, or items with other negative characteristics, by how the parameters vary across groups. Since they are not supposed to vary, any group differences are indicative of some problem.



Strengths and Weaknesses
Return to top of page
The strengths of IRT are directly derived from the characteristic of parameter invariability and have been listed above. The one parameter model, called the Rasch model, can be easily studied and applied on a standard personal computer and there are good programs available. The problem is that this restricts the user a bit since he has to remove and rewrite items that do not fit the model.

If you want to tackle it, there are programs, not so user-friendly, for working with the two and three parameter models. These models need large groups of examinees (more than 500, often) to calibrate the parameters of the equations and so are inappropriate for classroom use. They are most often used by theoretical researchers and large testing companies such as Educational Testing Service.


Implications for the Classroom Test
Return to top of page

  1. When you count the “number right” on a test you have assumed that each question on the test has an equal impact on the total score.
  2. The T score reflects consistency only. It may be way off base, and could be measuring unintended attributes or repeatable “consistent error” as opposed to random error, E.
  3. A test will always have some error (what measurement doesn’t?).
  4. The correlation between the error score and the T score (rTE = 0) implies that there is no pattern to the errors and so there could be as much inaccuracy in a lower score as there is in a higher one.
  5. The fact that, in the long run E[X] = T implies that longer tests will be more accurate measures of T than will shorter ones.
  6. On two separate occasions, or on two parallel tests, the test scores will agree to the extent that they share the same consistent information. This is not necessarily meaningful information.
  7. There is only one standard error (SE) on a test, and this is assumed to be the same for all examinees and all possible observed scores on that test.
  8. Using the standard error to establish limits of accuracy can be used to add caution to assigning letter grades, especially for those examinees near the cut points between letter grades.
  9. One good way to develop a Domain of test items is to store similar items in computer files as “item pools” or “item banks”and to routinely update and upgrade the items as you use them and calculate data on them.
  10. To help assure that any given test has similar difficulties, keep records on items in the item pool, and check the item statistics before typing up the test. To assure that all students have taken all items, have the students respond on the test itself and then transfer their answers to an answer sheet. Encourage students to write on the tests and observe their test papers for tell-tale marks which could give you important information on their thinking.


Test Content Return to top of page

If you reread the preceding module, you’d notice that much of the discussion appears to relate to knowledge of intellectual content in traditional classroom situations. And you’d be right. But professional, full-time measurement specialists have easily extended the strategies and models to a wide range of situations. Measuring attitudes was one of the primary applications until the advent of the personal computer and sophisticated programs for analyzing data.

picture of student working on laptop

Skipping a lot of history and development of the technology, we can describe contemporary applications. The desktop personal computer can be used to articulate speech with written word to simulate foreign language situations. Photographs and complex scenarios can be reproduced and bundled into software with which the student can interact. Most of us do not have the software skills to put all this together, so we have to search out a good audio-visual person. The advantages of this are that we have a product that can test students without our direct participation, so any of the problems with observer bias or your time commitment (except for planning and developing the test!) are minimized. Also, from a measurement perspective, we have standardization of the product (either in a contrived, artificial situation or in a situation with photographic accuracy and naturalistic reality) so that all students get the same test under the same conditions. This is not always possible without a lot of planning and training on the part of the observer.

Some very good ideas for developing quantitative observation and evaluation schemas can be found in the industrial psychology and the physical education literature. Most of us would not be interested in the psycho-motor aspects of our techniques, but there is material that could be adapted from the evaluation of dance movements or the gross movements of sports. The decisions you would have to make before you start involve defining the behavior to be observed (fundamental actions, perceptual skills, learned physical activities, skilled activities) and outlining a context within which to place the behaviors. In the definition, you have to delineate the duration, latency, frequency, or amplitude of the behavior. Then you have to determine how much of the behavior to observe, how to rate the behavior, and how much of the observed behavior should be observed for a “pass” grade. Not an easy task, and one that most classroom instructors have not dealt with.

 

 

return to top


© CET, SFSU 2003 Introduction | Design | Development | Implementation | Assessment | Site Home
this is the end of the page.