jump to content | main menu | tips on using this site | site map
OCT sitemap
assessment unit home
evaluating tests
Evaluating Tests Home
Mathematical underpinnings button button

Mathematical
Underpinnings


print module; link opens in new window search the O C T site tell a friend about the O C T site; link opens in new window contact the O C T team; link opens in new window  meet the O C T team

 






 

 

 

 

 

 

  Printer-friendly page Printer-friendly page

Introduction


Rationale for Reliability, Validity, and their Relationship

This Page Includes
Validity Reliability Activity
Use of Classroom tests
Breadth and Depth Activity
Definitions : Reliability
Definitions : Validity

 

The criteria we use derived from positivist scientific methods and fall into two major classes, those relating to consistency (reliability) and those relating to interpretability (validity). Be aware that these terms also have layman's interpretations that do not correspond to testing usage. Specifically, reliability is the consistency of the data across time, test forms, or judges and does not imply trustworthiness, as does the standard usage. A stopped clock is reliable time you look at it you get the same information even though the data is valid only twice in twenty-four hours.

Validity is the degree to which the inferences from the test are accurate, useful, or otherwise meaningful. Current usage refers to the validity of these inferences and not to the test directly (so, you don't say the "test is valid" but say that "it is valid for a specific purpose").

activity


Validity Reliability Activity

Interactive Flash Activity Opens in a new window.Requires the Flash6 plug-in.

Click to Get Flash Plug in


Flash activity
Which of the following is/are reliable? valid? both/ or neither?
Statements Reliable Valid Both Neither
Tomorrow’s weather forecast        
I answer “My name is ‘Santa Claus’” when you ask my name.        
All thirty kids in Mr. Philpott’s class voting McDonalds #1        
My score on a physics test written in Sanskrit' (which I do not know)        
A math test based on Chapter 5 of the text and the lecture but given to the wrong class        

Now, reliability was defined theoretically in Module I as the correlation between two observations or the proportion of consistent variance in the observed scores. The non-reliable information in any measurement, then, is considered random information and is by definition not repeatable or replicable. Now, no outside measure, Y, can correlate with X more than X can with itself, so it must be true that

rXY ≤ rXX

or that the measured validity of a test is always less than or equal to the reliability of that test. In other words, do not expect to know if the test is giving you meaningful information unless you can repeat it. This is not to say that any given test isn't yielding valid information—it's just that you won't be able to tell in any empirical way. For this reason, and because it's easier, we always start evaluating a test by calculating its reliability. We'll discuss how to do this below.


Interpretations and Uses of Classroom Tests Return to top of page

This module will focus on the concepts and measures most applicable to classroom tests. We should also acknowledge that the concepts are meaningful and useful for any observation and they should not be ignored if you will be using any other evaluation device: repeatability and valid inferences must also be criteria used to evaluate portfolios, judgment of art or physical feats, or other more unstructured student products. Note that we are talking about the judgments of the work by an evaluator; so, if two judges give different scores to a skater’s triple axel, we have no reliability and hence cannot know which, if either, of the two judges is correct. If they both gave the same score, we could proceed to study the credibility (validity) of their scores.

Whenever you design a test, you will be making judgments about how much breadth of content to cover and how much depth of understanding they must demonstrate. At the two extremes, a multiple-choice test of rote knowledge can cover a wide breadth of content with very little depth. In contrast, a one-hour essay on a single narrow topic could require quite sophisticated in-depth understanding. The range of knowledge, skills, abilities, and other processes to be assessed is determined only by your own creativity. There are certainly essay examinations that simply ask students to list information—no depth here, even though the format appears to be open-ended and unstructured. And you can be assured that there are multiple-choice questions that can make students think in depth. The rest of this module will cover topics germane to all observations of student behavior even though it will be easier to apply the concepts to the multiple-choice format.

activity


Breadth and Depth Activity

Interactive Flash Activity Opens in a new window. Requires the Flash6 plug-in.

Click to Get Flash Plug in

 

Which of the following observations would be concerned with breadth? and which with depth? or both?
Observation Breadth Depth Both
report on a crime investigation      
a therapist’s report to a judge      
knowledge of the Cyrillic alphabet      
ability to analyze a film      
ability to reproduce the periodic table      

Distinguish from Other Constructs Return to top of page

If you have decided, after considering some of the issues in Module I, that you will allow the test to consist of more than one construct, be sure that the amalgamation of the constructs will be meaningful for your purposes, such as assigning course grades. (We refer to any unobservable human trait which has a physical counterpart, such as knowledge/test score, anxiety/lie detector test, or work ethic/on-the-job effort, as a construct.) Specifically, try to write the test so that extraneous or confounding variables do not enter the process. For example, if the reading level (another construct) is too complex or uses words, metaphors, or cultural references not familiar to many of the students, you are first measuring reading and enculturation and secondly measuring knowledge of your content. Science or statistics tests can often be measuring computational accuracy and not the material itself. This extraneous information is called construct-irrelevant variance and will increase the reliability and decrease the validity of your test.

Relate to other Variables Return to top of page

When you design your test, there are certain areas that could be used to give you direction. First, of course, is the content of the text and the lecture or laboratory. Second, are the reasons for giving the test and the implications for the student’s career or future coursework. If you focus on what would make a “good” test or what characteristics you would be proud to show other professionals, these criteria can drive your judgments as you write. What kind of student should get a high score on this test and what other kinds of characteristics should he have? Who should get low or mediocre test scores? Which questions are directed at the “A” students? Why?

activity


Should you consider a person’s past history when you assign grades? For example, suppose a student is on probation? Should you adjust the test if the class consists of non-majors?

 

Internal and External Criteria
When you look for specific criteria, it becomes apparent that reliability tends to need internal criteria—that is, standards within the make-up of the test—and validity tends to look toward outside standards. This is not one-hundred percent true as we’ll see, but is a reasonable generalization and an alert that we should look both within the test and without for ideas.

Definitions Return to top of page


Reliability

We will define a test as a sample of behavior from a specific domain, of whatever nature, which is designed to elicit information about how a specified group reacts within that domain. A test “domain” is the (possibly infinite) set of stimuli to which the students could respond, the questions you could ask, or the situations you could design for the test. Along with most tests, although not all, is a procedure for quantifying these behaviors in order to interpret these numbers or “scores.” On most multiple-choice tests, for example, we give one point for each correct answer. Reliability is the consistency among repeated observations of the same behaviors on the same or similar groups.

Reliability Diagram.  Click to open a new window with a text definition.
Figure 1: A Possible Model of Reliability

 

activity


Could there be unreliability in the Test Domain itself? How about between the Test Domain and Test A or Test B? Where would error, lack of reliability, come from in scoring?


Validity

Validity is far more fundamental to the quality of a test although, as we discussed, you can’t have it without reliable information. When we speak of validity we are referring to the degree of support that theory and evidence provide for the interpretation of the test scores. Thus, in general, it parallels the model of reliability in Figure 1, with one important addition. We have to account for meaning in the process. Figure 2 shows the difference.

Figure 2: A Possible Model of Validity
Validity Diagram. Click to open a new window with a text description.

A good and precise definition of the construct will clarify exactly which behaviors are implied by the construct and which are not. Try to give this at least some thought before writing items in order to minimize construct under-representation vis-à-vis the curriculum, text, or the course outline and minimizing construct over-representation, that is, measuring such extraneous variables as reading difficulty. Perhaps the best way to do this in a classroom setting is by creating a table of specifications or a rubric as discussed in Module 2.

 

return to top

 

© CET, SFSU 2003 Introduction | Design | Development | Implementation | Assessment | Site Home
this is the end of the page.