How to Improve Test Reliability and Validity
Implications for Grading

Information about a test’s reliability and validity would be only
of academic interest if we were unable to improve a test we are not satisfied
with. The first hurdle to get across is one of interpretation. How large
or small can a coefficient be and still be useful? The answer depends
on the use to which the test will be put. For group decisions, the measure
can indeed be rough, but where individual students are concerned, we need
more precision. Let’s assume that our principal purpose is to assign
letter grades, A through D, say.
Reliability of the data will affect the precision and repeatability of
our grade assignments. If the test is purely random, that is with a reliability
of zero, there is no consistency whatsoever from one grade assignment
to another. With a reliability of one, we would have perfectly repeated
grade assignments. With reliability somewhere in between, the assignment
depends, in general, on the standard error of measurement, and in any
specific case, it depends on the standard error and how close the student
is to a cut point. For example, if a student is one point over a “B”
on your test and the standard error of measurement is 5 points, he has
a good (42%) chance of being classified as “C” next time.
If you are not sure of the reliability of a test or its standard error,
you must be very careful in assigning grades, especially for those students
near the cut scores (the boundaries between grades). For these students,
even if you know you test’s statistics, you must have more than
one source of information to help you fine-tune your decision.
 |
Improving reliability
The most straightforward ways to improve your test’s reliability
follow directly from the above discussion.
First, calculate the item-test
correlations and rewrite or reject any that are too low. There is no official
decision point here and convenience often rules, but it is safe to say
that any item that does not correlate with the total test at least (point-biserial)
r = .25, should be studied.
Second, look at the items that did correlate
well and write more like them. The longer the test, the higher the reliability
up to a point.
|
Another point to keep in mind is restriction of range: this is a narrow
range of scores which limits all statistics and hence correlations between
test forms. Another way of looking at the issue is that with a homogeneous
group of people, you will not have as great an degree of accuracy in grade
decisions (say, A– D) as you would with a very heterogeneous group.
Heterogeneity here refers only to the total test score.
If the content sampled from the domain is restricted, both validity and
reliability suffer. If the test is too long or appears too difficult,
the examinees will be tempted to guess, and this will increase error directly.
 |
Improving validity
Increasing validity is an on-going challenge. Here are some tips to help improve test validity.
- Clarify your test construct. Write down what you expect of the students.
If you can’t verbalize it, you can’t test it.
- Match the table of specifications with the test. Better yet, ask
another member of the faculty to do this for you. Do not take any of
this personally.
- Try rewording some of the items, specially in light of class discussion.
Listen to your students!
- Run a DIF analysis and adjust/remove items which fail.
- Compare the test to other data which might be available.
|