Have you ever had an EMS student who complained about a test question you created? Are you looking for a better way to assess test question validity? What if you could determine the validity of a test question without looking at the question? This article will illustrate how discriminatory item analysis allows instructors to use data, rather than feelings, to validate test questions.
Common test question fallacies
Fallacies:
- If most of the class missed an item, it must be a difficult question.
- If most of the class answered an item correctly, it must be an easy question.
Without applying discriminatory item analysis, it is dangerous to jump to these conclusions. There are many situations where most of the class missed an item but it is statistically valid. There are also times when most of the class answered an item correctly, but it is statistically invalid.
Fallacies:
- It takes multiple administrations of an item to validate a question.
- It is difficult to validate an item the first time it is administered.
Discriminatory item analysis can be applied after the first administration of an item without compromising validity of the data. As a side note, the more times an item is administered, the cumulative data may help strengthen its validity.
Common test writing definitions
Before delving into discriminatory item analysis, let us define the elements commonly associated with test question construction.
- Item and/or test item. All the components of a test question
- Stem. The test question
- Distractor. The wrong answers within a multiple-choice question
- Key. The correct answer within a multiple-choice question
- Item analysis. The statistical methods used to assess a test item/question
Calculating item difficulty
Item difficulty is the basic indicator for determining the difficulty level of an item. Item difficulty is calculated by dividing the number of people who attempted to answer the item, by the number of people who answered correctly. For example, if 78 people answered the item out of 100 people who attempted to answer the item, it has a .78 difficulty.
The following table illustrates how the index ranges correlate with the degree of item difficulty:
The goal is to strive for item difficulty that falls between 0.40-0.60 for most questions within an exam. It is important to recognize item difficulty without applying discriminatory item analysis can be fuzzy for the following reasons:
- The item might actually be difficult or easy
- Might be a poorly written stem
- Might be a convoluted question that is complex, and/or difficult to follow
- Might be poorly written distractors
Listen for more
The evolution of community paramedic education
SDFD’s Anne Jensen and REMSA’s Adam Heinz share tips on MIH/CP training, time scales and starting up
Demographic groups
Discriminatory item analysis starts by breaking all the participants who attempted the exam into three demographic groups based upon their test scores:
- Upper percentile group. This group should be equal to, or as close to 27% of the participants who took the exam and have the highest test scores.
- Lower percentile group. This group should be equal to, or as close to the number of participants in the upper percentile group and have the lowest test scores. If distribution is uneven, the larger group of students should be weighted toward the upper percentile group.
- Middle percentile group. This group rounds out the remaining participants with test scores that fall between the upper and lower percentile groups.
Discriminatory item analysis
Discriminatory item analysis refers to how well the data can differentiate between high and low performers for a given test. The general premise behind discriminatory item analysis is students with the higher test scores probably understand the educational concepts better than students with lower scores.
Implementing discriminatory item analysis will help test administrators eliminate much of the fuzziness associated with utilizing item difficulty alone. Once participant scores have been broken up into their three respective groupings, you are ready to analyze the data.
Statistically valid items
It is reasonable to conclude that students who score highest on a test probably understand the concepts better than students who score poorly. Since this is the case, students with the highest test scores should be answering questions correctly, and students with the lowest scores should be answering questions incorrectly. Statistical validity occurs whenever more participants within the upper percentile group correctly answer an item when compared to the lower percentile group.
Statistically flat items
Statistically flat items occur when equal numbers of the upper and lower percentile groups missed the item. As a rule of thumb, this becomes an issue when greater than 20% of the participants miss the item. This is a problem because a statistically flat item does not provide useful data between high and low performers.
Statistically invalid items
Analyzing data for statistically invalid items is a little more involved process. A test item is considered statistically invalid in these circumstances:
- Any question where the upper percentile group misses more than the lower percentile group. There should never be a time when the lower performers score better than the high performers.
- When 50% or greater of the participants missed the question, with 50% or greater of the upper percentile group missing the question. This is referred to as the 50/50 rule. Just because more than half of the participants missed an item, doesn’t mean it is a bad question. What makes it bad, is more than half of the high performers missing the item. Unless you intended the question to be very difficult, you should never expect more than half of your high performers to miss an item.
- Any statistically flat question where greater than 20% of the participants missed the item. An item should discriminate between high and low performers. If this discrimination doesn’t occur, the item needs to be reworked so it does discriminate between high and low performers.
Items warranting closer review
There are some items that may have issues with their construction and warrant a closer review. These items tend to fall within these circumstances:
- Statistically flat questions where less than 20% of the participants missed the question
- Questions where greater than 50% of the participants missed the question
- Difficult questions, that are statistically validated – questions where greater than 30% of the participants missed the question should be categorized as a difficult question
More than a grade
There is more to an exam than the letter grade participants receive. Discriminatory item analysis will help identify students who truly understand the information. It also helps direct remediation by identifying specifically where lower performing students are missing concepts. Most importantly, discriminatory item analysis will help instructors minimize the emotional burden associated with grading, and defending the items associated with their exam.
Read next: Using pre-tests in the online EMS classroom
References
- Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD.
- de Boeck, Paul; Wilson, Mark (2004) Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. Springer. ISBN 978-0-387-40275-8.
- Embretson, Susan E.; Reise, Steven P. (2000). Item Response Theory for Psychologists. Psychology Press. ISBN 978-0-8058-2819-1.
- Guilford, J. P. (1936). Psychometric methods. New York: McGraw-Hill
- Kline, P. (1986). A handbook of test construction. London: Methuen
This article was originally posted Aug. 5, 2020. It has been updated.