| jump to content | main menu | tips on using this site | site map |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
IntroductionRationale for Reliability, Validity, and their Relationship
|
| This Page Includes |
|---|
The criteria we use derived from positivist scientific methods and fall into two major classes, those relating to consistency (reliability) and those relating to interpretability (validity). Be aware that these terms also have layman's interpretations that do not correspond to testing usage. Specifically, reliability is the consistency of the data across time, test forms, or judges and does not imply trustworthiness, as does the standard usage. A stopped clock is reliable time you look at it you get the same information even though the data is valid only twice in twenty-four hours.
Validity is the degree to which the inferences from the test are accurate,
useful, or otherwise meaningful. Current usage refers to the validity
of these inferences and not to the test directly (so, you don't say the
"test is valid" but say that "it is valid for a specific purpose").
Interactive Flash Activity Opens in a new window.Requires the Flash6 plug-in. |
| Statements | Reliable | Valid | Both | Neither |
|---|---|---|---|---|
| Tomorrow’s weather forecast | ||||
| I answer “My name is ‘Santa Claus’” when you ask my name. | ||||
| All thirty kids in Mr. Philpott’s class voting McDonalds #1 | ||||
| My score on a physics test written in Sanskrit' (which I do not know) | ||||
| A math test based on Chapter 5 of the text and the lecture but given to the wrong class |
Now, reliability was defined theoretically in Module I as the correlation between two observations or the proportion of consistent variance in the observed scores. The non-reliable information in any measurement, then, is considered random information and is by definition not repeatable or replicable. Now, no outside measure, Y, can correlate with X more than X can with itself, so it must be true that
rXY ≤ rXX
or that the measured validity of a test is always less than or equal to the reliability of that test. In other words, do not expect to know if the test is giving you meaningful information unless you can repeat it. This is not to say that any given test isn't yielding valid information—it's just that you won't be able to tell in any empirical way. For this reason, and because it's easier, we always start evaluating a test by calculating its reliability. We'll discuss how to do this below.
This module will focus on the concepts and measures most applicable to classroom tests. We should also acknowledge that the concepts are meaningful and useful for any observation and they should not be ignored if you will be using any other evaluation device: repeatability and valid inferences must also be criteria used to evaluate portfolios, judgment of art or physical feats, or other more unstructured student products. Note that we are talking about the judgments of the work by an evaluator; so, if two judges give different scores to a skater’s triple axel, we have no reliability and hence cannot know which, if either, of the two judges is correct. If they both gave the same score, we could proceed to study the credibility (validity) of their scores.
Whenever you design a test, you will be making judgments about how much
breadth of content to cover and how much depth of understanding they must
demonstrate. At the two extremes, a multiple-choice test of rote knowledge
can cover a wide breadth of content with very little depth. In contrast,
a one-hour essay on a single narrow topic could require quite sophisticated
in-depth understanding. The range of knowledge, skills, abilities, and
other processes to be assessed is determined only by your own creativity.
There are certainly essay examinations that simply ask students to list
information—no depth here, even though the format appears to be
open-ended and unstructured. And you can be assured that there are multiple-choice
questions that can make students think in depth. The rest of this module
will cover topics germane to all observations of student behavior even
though it will be easier to apply the concepts to the multiple-choice
format.
Interactive Flash Activity Opens in a new window. Requires the Flash6 plug-in. |
| Observation | Breadth | Depth | Both |
|---|---|---|---|
| report on a crime investigation | |||
| a therapist’s report to a judge | |||
| knowledge of the Cyrillic alphabet | |||
| ability to analyze a film | |||
| ability to reproduce the periodic table |
If you have decided, after considering some of the issues in Module I, that you will allow the test to consist of more than one construct, be sure that the amalgamation of the constructs will be meaningful for your purposes, such as assigning course grades. (We refer to any unobservable human trait which has a physical counterpart, such as knowledge/test score, anxiety/lie detector test, or work ethic/on-the-job effort, as a construct.) Specifically, try to write the test so that extraneous or confounding variables do not enter the process. For example, if the reading level (another construct) is too complex or uses words, metaphors, or cultural references not familiar to many of the students, you are first measuring reading and enculturation and secondly measuring knowledge of your content. Science or statistics tests can often be measuring computational accuracy and not the material itself. This extraneous information is called construct-irrelevant variance and will increase the reliability and decrease the validity of your test.
When you design your test, there are certain areas that could be used to give you direction. First, of course, is the content of the text and the lecture or laboratory. Second, are the reasons for giving the test and the implications for the student’s career or future coursework. If you focus on what would make a “good” test or what characteristics you would be proud to show other professionals, these criteria can drive your judgments as you write. What kind of student should get a high score on this test and what other kinds of characteristics should he have? Who should get low or mediocre test scores? Which questions are directed at the “A” students? Why?
|
Internal and External Criteria
When you look for specific criteria, it becomes apparent that reliability
tends to need internal criteria—that is, standards within the make-up
of the test—and validity tends to look toward outside standards.
This is not one-hundred percent true as we’ll see, but is a reasonable
generalization and an alert that we should look both within the test and
without for ideas.
We will define a test as a sample of behavior from a specific domain, of whatever nature, which is designed to elicit information about how a specified group reacts within that domain. A test “domain” is the (possibly infinite) set of stimuli to which the students could respond, the questions you could ask, or the situations you could design for the test. Along with most tests, although not all, is a procedure for quantifying these behaviors in order to interpret these numbers or “scores.” On most multiple-choice tests, for example, we give one point for each correct answer. Reliability is the consistency among repeated observations of the same behaviors on the same or similar groups.
![]() |
| Figure 1: A Possible Model of Reliability |
|
Validity is far more fundamental to the quality of a test although, as we discussed, you can’t have it without reliable information. When we speak of validity we are referring to the degree of support that theory and evidence provide for the interpretation of the test scores. Thus, in general, it parallels the model of reliability in Figure 1, with one important addition. We have to account for meaning in the process. Figure 2 shows the difference.
| Figure 2: A Possible Model of Validity |
![]() |
A good and precise definition of the construct will clarify exactly which behaviors are implied by the construct and which are not. Try to give this at least some thought before writing items in order to minimize construct under-representation vis-à-vis the curriculum, text, or the course outline and minimizing construct over-representation, that is, measuring such extraneous variables as reading difficulty. Perhaps the best way to do this in a classroom setting is by creating a table of specifications or a rubric as discussed in Module 2.
| © CET, SFSU 2003 |
Introduction |
Design |
Development |
Implementation |
Assessment |
Site Home this is the end of the page. |