| jump to content | main menu | tips on using this site | site map |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
Test Validation
We use the term “test validation” to describe the process of gathering evidence to support the use of the test as an interpretation of the underlying construct and also to imply that we never have a valid test but are constantly working to establish and strengthen valid interpretations. To the extent we have accumulated evidence defending the test’s interpretation and usefulness for its intended purpose, we have supported our arguments for its validity. This evidence can come from a wide variety of sources, which have been catalogued for convenience. Content Evidence
The “content” of a test includes such material as the theme, wording, and format of the items as well as any directions to the examinee or scorer. Don’t forget to study the representativeness of the items, that is, the coverage of the domain, text, or lecture and the relative weights of these across the items. Rely on your own judgment and that of your colleagues.
|
|
If the test involves behaviors, specify your “testing” conditions and evaluation criteria beforehand. If you use outside judges, other instructors or teaching assistants to help you evaluate student skills, make sure they are using the criteria that you supplied and not their own. This assumes, of course, that the criteria you supplied are concordant with the construct. See Figure 5 for some examples of this.
| Conditions | Student Behavior | Evaluation Criteria |
|---|---|---|
| Given the 24 steps used in isolating DNA from yeast cells, without the use of text or other aids | Arrange the 24 steps in their proper order |
|
| Given a liquid, an analytic balance, and a graduated cylinder and no directions | Calculate the density of the liquid | One point for each step recalled Calculation/measurement at each step accurate to +/- x% |
| Given a mixture of two known liquids with different boiling points and | Correctly perform a vacuum distillation to separate the liquids | Liquids separated with less than 10% cross contamination |
| Pre-experimental preparation in lab | Spontaneously wears safety glasses | Recorded at each lab session |
There is a good deal of information written about how to study the internal structure of a test. Here, we are talking about the items and the pattern of correlations among them. If all the items are measuring the same construct, they should have moderately high correlations among themselves (phi’s) and with the total score (point biserials). If the inter-item correlations are low, there is a chance that you are testing more than one variable.If the point-biserials are low, rewrite the item(s) in question or investigate for more than one variable.
For example, the following question is intended to measure deductive thinking,
but is also measuring the student’s knowledge of the rowan (European
mountain ash) and snakeroot (rauwolfia) and of common and Latin names:
“A traveler just back from Europe relates that he was cured of acne
with a poultice made from rowan berries and rauwolfia. What is the most
reasonable hypothesis that can be stated from this claim?”
If an item is answered differently for groups of similar achievement,
then there is evidence for differential item functioning (DIF). While
the computer programs for this analysis are complex and not readily available,
a rough approximation for classroom use would be to group students in
categories based on test score and compare the item statistics for the
two groups.
Group 1 |
Group2 |
|
|---|---|---|
Fourth Quartile |
item statistics for members of group 1 in upper 25% of the class | item statistics for members of group 2 in upper 25% of the class |
Third Quartile |
||
Second Quartile |
||
First Quartile |
Much of the strength in validating a test is to see how it relates to other measures. There are two patterns we look at: converging arguments (how does the test relate to similar measures or other related outcomes, like grades?) and divergent arguments (how does the test relate to extraneous measures, that is, constructs it is not supposed to measure?)
Convergent Evidence
Convergent evidence includes correlations with some outside criteria,
using either the Pearson index or a prediction equation like the regression
equation described above (in “Regression”). We are looking
at the consistency of agreement among our data. For the classroom test,
we could look at other grades, college major (should students outside
the major area of the test do as well as majors?), SAT scores, or other
measures used in your class.
Divergent Evidence
There are variables that we do not want our test to correlate with. This
subject is not often studied directly in non-published tests or outside
research situations. In the classroom, we usually look at the issue in
a somewhat casual manner. We do not want a test of scientific induction
to measure vocabulary or a test of mathematical reasoning to measure perseverance.
Close observation of your students and a well-planned and well-written
test can increase the divergence between your test and extraneous variables.
On a test of astronomy, how would a student’s background in the classics bias a test in his favor?
|
When the test acts in ways that are unexpected or destructive to an individual or group, we may have a signal that there is a source of invalidity in the content of the items. Thus, while group differences do not indicate a biased test directly, these differences may be instructive in finding a source of bias, should it exist. Suppose that a group of nursing students did exceptionally poorly one semester on a midterm. If this had never happened before, and if the other students did as well as others in the past, we might suspect that they had had a large number of tests or major assignments concurrent with your test.
Any of the methods discussed above can be quantified with rudimentary statistics, usually the ones described above. If you are concerned with variables which exist simultaneously with your test, they are concurrent criteria, and can be measured with the correlations described above. If you are concerned with measures in the future, we refer to predictive criteria and use the regression model. Accuracy of prediction can be quantified by calculating Y – Y’, the difference between what your test predicts for the student and what he actually received. Studying this over several semesters can yield evidence of consistency in how the test behaves and how it can be interpreted, that is, evidence of validity generalization.
Improving Validity and Relibility has implications for grading. Learn How to Improve Test Reliability and Validity
| © CET, SFSU 2003 |
Introduction |
Design |
Development |
Implementation |
Assessment |
Site Home this is the end of the page. |