jump to content | main menu | tips on using this site | site map
OCT sitemap
assessment unit home
evaluating tests
Evaluating Tests Home
Mathematical underpinnings button button

Mathematical
Underpinnings


print module; link opens in new window search the O C T site tell a friend about the O C T site; link opens in new window contact the O C T team; link opens in new window  meet the O C T team

 






photo

  Printer-friendly page Printer-friendly page

Test Validation

This Page Includes
 Content Evidence
  Evidence from response Processes
  Evidence from internal structure
  Evidence from outside relationships
  Evidence about the consequence of testing

We use the term “test validation” to describe the process of gathering evidence to support the use of the test as an interpretation of the underlying construct and also to imply that we never have a valid test but are constantly working to establish and strengthen valid interpretations. To the extent we have accumulated evidence defending the test’s interpretation and usefulness for its intended purpose, we have supported our arguments for its validity. This evidence can come from a wide variety of sources, which have been catalogued for convenience.


Content Evidence

The “content” of a test includes such material as the theme, wording, and format of the items as well as any directions to the examinee or scorer. Don’t forget to study the representativeness of the items, that is, the coverage of the domain, text, or lecture and the relative weights of these across the items. Rely on your own judgment and that of your colleagues.

activity


Define “representativeness” in terms of the Table of specifications from Module II.

 


Evidence from Response Processes Return to top of page

Whether the examinee must recall information, think through a problem, or adjust a laboratory instrument, you should have some evidence that the appropriate process is being used. After the tests have been scored and returned to the students, class discussion can help with finding out how accurately or mistakenly students are responding to the items. Asking the students to write on the tests, to make notes, and to write reactions to the alternative answers is another source of information.

activity


Should you change any students’ grades if, during class discussion, a student argues successfully that an answer you had not considered is, indeed, correct? What happens to grades if everybody got the answer to one of your questions?

 

If the test involves behaviors, specify your “testing” conditions and evaluation criteria beforehand. If you use outside judges, other instructors or teaching assistants to help you evaluate student skills, make sure they are using the criteria that you supplied and not their own. This assumes, of course, that the criteria you supplied are concordant with the construct. See Figure 5 for some examples of this.


Figure 6. The relationship between student performance and test evaluation criteria.
Conditions Student Behavior Evaluation Criteria
Given the 24 steps used in isolating DNA from yeast cells, without the use of text or other aids Arrange the 24 steps in their proper order
  1. All five steps must be somewhere within the first 5 ranks
  2. All five steps must be somewhere within the last 5 ranks
  3. For Steps 6 through 19, one point for every step within two places of its correct position
Given a liquid, an analytic balance, and a graduated cylinder and no directions Calculate the density of the liquid One point for each step recalled
Calculation/measurement at each step accurate to +/- x%
Given a mixture of two known liquids with different boiling points and Correctly perform a vacuum distillation to separate the liquids Liquids separated with less than 10% cross contamination
Pre-experimental preparation in lab Spontaneously wears safety glasses Recorded at each lab session


Evidence from Internal Structure Return to top of page

There is a good deal of information written about how to study the internal structure of a test. Here, we are talking about the items and the pattern of correlations among them. If all the items are measuring the same construct, they should have moderately high correlations among themselves (phi’s) and with the total score (point biserials). If the inter-item correlations are low, there is a chance that you are testing more than one variable.If the point-biserials are low, rewrite the item(s) in question or investigate for more than one variable.


For example, the following question is intended to measure deductive thinking, but is also measuring the student’s knowledge of the rowan (European mountain ash) and snakeroot (rauwolfia) and of common and Latin names:


“A traveler just back from Europe relates that he was cured of acne with a poultice made from rowan berries and rauwolfia. What is the most reasonable hypothesis that can be stated from this claim?”


If an item is answered differently for groups of similar achievement, then there is evidence for differential item functioning (DIF). While the computer programs for this analysis are complex and not readily available, a rough approximation for classroom use would be to group students in categories based on test score and compare the item statistics for the two groups.

Figure 7. A chart for computing DIF
Group 1
Group2
Fourth Quartile
item statistics for members of group 1 in upper 25% of the class item statistics for members of group 2 in upper 25% of the class
Third Quartile
   
Second Quartile
   
First Quartile
   


Evidence from Outside Relationships Return to top of page

Much of the strength in validating a test is to see how it relates to other measures. There are two patterns we look at: converging arguments (how does the test relate to similar measures or other related outcomes, like grades?) and divergent arguments (how does the test relate to extraneous measures, that is, constructs it is not supposed to measure?)


Convergent Evidence
Convergent evidence includes correlations with some outside criteria, using either the Pearson index or a prediction equation like the regression equation described above (in “Regression”). We are looking at the consistency of agreement among our data. For the classroom test, we could look at other grades, college major (should students outside the major area of the test do as well as majors?), SAT scores, or other measures used in your class.

Divergent Evidence
There are variables that we do not want our test to correlate with. This subject is not often studied directly in non-published tests or outside research situations. In the classroom, we usually look at the issue in a somewhat casual manner. We do not want a test of scientific induction to measure vocabulary or a test of mathematical reasoning to measure perseverance. Close observation of your students and a well-planned and well-written test can increase the divergence between your test and extraneous variables.

activity


What other outside sources could influence, say, a test on Shakespeare, to artificially increase the test score for some students? What variables should correlate with these scores?

On a test of astronomy, how would a student’s background in the classics bias a test in his favor?

 

 

Evidence About The Consequences of Testing Return to top of page

When the test acts in ways that are unexpected or destructive to an individual or group, we may have a signal that there is a source of invalidity in the content of the items. Thus, while group differences do not indicate a biased test directly, these differences may be instructive in finding a source of bias, should it exist. Suppose that a group of nursing students did exceptionally poorly one semester on a midterm. If this had never happened before, and if the other students did as well as others in the past, we might suspect that they had had a large number of tests or major assignments concurrent with your test.


A Last Word Return to top of page

Any of the methods discussed above can be quantified with rudimentary statistics, usually the ones described above. If you are concerned with variables which exist simultaneously with your test, they are concurrent criteria, and can be measured with the correlations described above. If you are concerned with measures in the future, we refer to predictive criteria and use the regression model. Accuracy of prediction can be quantified by calculating Y – Y’, the difference between what your test predicts for the student and what he actually received. Studying this over several semesters can yield evidence of consistency in how the test behaves and how it can be interpreted, that is, evidence of validity generalization.

Improving Validity and Relibility has implications for grading. Learn How to Improve Test Reliability and Validity

return to top

 

© CET, SFSU 2003 Introduction | Design | Development | Implementation | Assessment | Site Home
this is the end of the page.