| jump to content | main menu | tips on using this site | site map |
|
|
|
||||||||||||||||||||
|
Testing Elements (Some Measurement Theory)
|
|
The model is then simply, where i indicates an individual,
Xi = Ti + Ei.
This says that any observation can be decomposed into a systematic and a random component and that these are added to get the observation. With minimal assumptions, a series of important and useful statements and formulas can be derived.
Several of these are:
rTE = 0.
In the long term, the error scores cancel out giving an average of zero, that is,
ME = 0.
In the long run, we expect the observed scores to equal the true score, that is,
E[X] = T.
_X = _T + _E.

![]()
The standard error is assumed to be the same for each examinee and can be used to put upper and lower boundaries on your estimate of his T score:
Upper limit = Xi + 2SE
Lower limit = Xi - 2SE
|
There are very few practical assumptions that you have to make in order to apply these very powerful ideas. Of course, the formulas can be calculated whether or not you meet the assumptions, but the inferences about your test will rest on how closely you meet several criteria. First, since we are talking about objective test items, we assume that it is possible to write many of them. In fact, it is best to be able to assume that, if you and your colleagues had the time and inclination,
If you can meet these assumptions, then it follows that
Strengths and weaknesses
![]()
The particular strength that allows True Score Theory to maintain its
popularity and usefulness is its simplicity. The algebra is straight-forward
and many universities have software to analyze tests using it. Software
is available for the desktop computer, and with a little study, the concepts
become second nature.
The principal weakness of True Score Theory is that certain pieces of information do not readily generalize to other situations. For example, a score of 45 on Test Alpha is not the same as a 45 on Test Beta or for another group of examinees unless the two tests are comparable in
In addition, if the items are not more or less homogeneous, then replacing one group of items with another may change the statistical characteristics of the test and possibly any decisions you make using that test. The difficulty of an item is defined by the group of examinees who took it and is thus group-dependent. A question that is determined to be “easy” based on data from one group may be estimated to be difficult if given to another group.
If an examinee gets an observed score of 30 (i.e. X = 30) on this test. Within what range would you expect to observe his T score? (SE = 4.1; ans: 21.8 – 38.2) |
As the name suggests, Item Response Theory (IRT) is a series of models that function at the level of the individual item, specifically, it is a set of equations with up to four parameters that model the response of a person to a theoretical continuum. We first assume that there exists a latent trait, a single theoretical continuum that describes the characteristic of the person we want to measure. Then, depending on other issues, include one or more parameters. The four parameters are
In the most efficient model only the first parameter is used, and the other two are included to increase the model's fit to the test data. The model's characteristics are compared to the ability of the examinee (χ) in a logistic equation:

Here, P means probability and the graph of the equation is an ogive (ess shaped curve) starting near zero and increasing slowly, then faster, then more slowly toward a value of 1. This is depicted in Figure 1(below).
The model has a nice interpretation. Ignoring the "a" parameter (discrimination), and looking at the formula, you can see that the working part is simply (χ - b), the numerical difference between the item's difficulty, b, and the person's proficiency, χ. If χ is greater than b, that is, if the person's proficiency is greater than the item's difficulty, the person will answer the item correctly. And the larger the difference, the greater the probability, P, that this will happen. If b is greater than χ, the item is "stronger" than the person, and the person will get the item wrong. The location of the curve on the X-axis, then, determines the "difficulty" of the item: it is the place where a person has a probability of success of .5. The pitch of the curve's rise determines how well the item can "discriminate" between locations to the left of the difficulty and to the right: flat curves do not discriminate well while sharply increasing ones do.
Figure1: graphical depiction of the IRT model
Look at the slope of the graph. This is the item discrimination, and for this item it is somewhat moderate. What would the graph for a highly discriminating item look like? Draw it. |
Because the basic work is done at the item level, tests are constructed by compiling items into a whole. Items are identified by their parameters and the resulting test will have characteristics determined by the items used. And so tests can be built with specifics determined beforehand, either to meet certain characteristics or to be sensitive at certain levels of the latent variable.
It can be proven that we can mathematically estimate the parameters of an item independent of what other items are used in a test and independent of the proficiency of the group of examinees used to estimate them. This parameter invariability is a powerful characteristic because it leads to some exceptional conclusions.
First, we can locate a person’s proficiency with very few well-chosen items: there is no point in having the examinee take items that he has a probability of 0 or 1 of passing.
Second, we do not have to give the same items to every examinee. As long as the examinee has close to a 50/50 chance of getting the item correct, we give only the items to each examinee that are appropriate for him.
Third, we don’t have to be concerned about selecting the appropriate population for estimating our item parameters.
Fourth, we can identify biased items, or items with other negative characteristics, by how the parameters vary across groups. Since they are not supposed to vary, any group differences are indicative of some problem.
Strengths and Weaknesses
![]()
The strengths of IRT are directly derived from the characteristic of parameter
invariability and have been listed above. The one parameter model, called
the Rasch model, can be easily studied and applied on a standard personal
computer and there are good programs available. The problem is that this
restricts the user a bit since he has to remove and rewrite items that
do not fit the model.
If you want to tackle it, there are programs, not so user-friendly, for working with the two and three parameter models. These models need large groups of examinees (more than 500, often) to calibrate the parameters of the equations and so are inappropriate for classroom use. They are most often used by theoretical researchers and large testing companies such as Educational Testing Service.
If you reread the preceding module, you’d notice that much of the discussion appears to relate to knowledge of intellectual content in traditional classroom situations. And you’d be right. But professional, full-time measurement specialists have easily extended the strategies and models to a wide range of situations. Measuring attitudes was one of the primary applications until the advent of the personal computer and sophisticated programs for analyzing data.
Skipping a lot of history and development of the technology, we can describe contemporary applications. The desktop personal computer can be used to articulate speech with written word to simulate foreign language situations. Photographs and complex scenarios can be reproduced and bundled into software with which the student can interact. Most of us do not have the software skills to put all this together, so we have to search out a good audio-visual person. The advantages of this are that we have a product that can test students without our direct participation, so any of the problems with observer bias or your time commitment (except for planning and developing the test!) are minimized. Also, from a measurement perspective, we have standardization of the product (either in a contrived, artificial situation or in a situation with photographic accuracy and naturalistic reality) so that all students get the same test under the same conditions. This is not always possible without a lot of planning and training on the part of the observer.
Some very good ideas for developing quantitative observation and evaluation schemas can be found in the industrial psychology and the physical education literature. Most of us would not be interested in the psycho-motor aspects of our techniques, but there is material that could be adapted from the evaluation of dance movements or the gross movements of sports. The decisions you would have to make before you start involve defining the behavior to be observed (fundamental actions, perceptual skills, learned physical activities, skilled activities) and outlining a context within which to place the behaviors. In the definition, you have to delineate the duration, latency, frequency, or amplitude of the behavior. Then you have to determine how much of the behavior to observe, how to rate the behavior, and how much of the observed behavior should be observed for a “pass” grade. Not an easy task, and one that most classroom instructors have not dealt with.
| © CET, SFSU 2003 |
Introduction |
Design |
Development |
Implementation |
Assessment |
Site Home this is the end of the page. |