Analysing and Improving Questions

“What don’t die can’t live. What don’t live can’t change. What don’t change can’t learn.’”
– Terry Pratchett, Lords and Ladies

The previous posts considered why we use MCQs, how to design them, and some of the practical issues that arise when using them in assessment. There is, however, a step that sits between writing questions and re-using them, which is analysing how they performed.

Most problems with MCQs are not obvious at the point of writing. They appear only after students have answered them.

A question that seems clear and fair may turn out to be too easy, too difficult, or one that stronger students answer incorrectly while weaker students answer correctly. Without item analysis, these issues remain largely invisible.

The purpose of analysing questions is therefore not statistical sophistication, but it is quality control.

The underlying question is simple: did the question behave as a good question should?

Facility

The most straightforward measure is the facility index, $F$ (sometimes called difficulty, although it measures the opposite).

It is simply the proportion of students who answered the item correctly:

\displaystyle F=\frac{R}{N}

where $R$ is the number of students selecting the correct answer, and $N$ is the total number of students.

The value ranges from 0 to 1. A value close to 1 indicates an easy question; a value close to 0 indicates a difficult one.

Extremes are not automatically problematic. Some topics warrant straightforward questions; others justify demanding ones. Difficulty becomes an issue when most questions cluster at either end of the scale, as such items contribute little to differentiating between students.

In many contexts, items with facility values somewhere between approximately 0.3 and 0.8 tend to function most usefully, though this depends on cohort and purpose.

Discrimination

If facility tells us how many students answered correctly, discrimination tells us who did.

A good item should be answered correctly more frequently by higher-performing students.

A simple discrimination index, $D$ , compares performance between high- and low-scoring groups:

\displaystyle D=\frac{H-L}{N}

where $H$ is the number of correct responses in the top-scoring third, $L$ is the number correct in the bottom-scoring third, and N is the number of students in a thrid of the total group.

Values range between −1 and +1. Positive values indicate that stronger students were more likely to answer correctly. Values close to zero suggest the item does little to distinguish performance levels. Negative values usually indicate ambiguity, a flawed distractor, or occasionally an error in the answer key as weaker students tend to do better.

Values lower than +0.2 should have their questions either improved or eliminated.

An alternative statistic is the point-biserial correlation, which measures the relationship between success on the item and overall test score. The interpretation is similar: higher positive values indicate better alignment with overall performance.

Distractor Frequency

MCQs are distinctive in that incorrect options also carry information.

A distractor should attract students who do not yet fully understand the material. If an option is rarely selected, it is not functioning as a distractor at all.

A commonly used heuristic is that options chosen by fewer than around 5% of students are unlikely to be contributing meaningfully to the item’s effectiveness (Tarrant & Ware, 2012).

Reviewing distractor frequency is often the quickest way to improve a question. Weak distractors can usually be revised without rewriting the entire stem.

Reliability

Individual items matter, but assessment operates at the level of the whole paper.

Reliability, $\alpha$ , concerns the consistency of the assessment as a set of items. The most widely used indicator is Kunder-Richardson formula 20 (a special case of Cronbach’s alpha for dichotomous scores):

\displaystyle \alpha=\frac{K}{K-1}\left(1-\frac{\displaystyle\sum_{i=1}^{K}{p_i\left(1-p_i\right)}}{\sigma_T^2}\right)

where $K$ is the number of items, $p_i$ is the proportion of students correct on each question, and $\sigma^2_T$ is the variance of total score for the tests.

Conceptually, α reflects how well items function together as a coherent measure. Values around 0.8 are commonly regarded as appropriate for summative assessment, though interpretation always depends on context. If questions can be answered with partial marks (i.e. not just 0 or 1) then Cronbach’s alpha should be used as in:

\displaystyle \alpha=\frac{K}{K-1}\left(1-\frac{\displaystyle\sum_{i=1}^{K}{\sigma_i^2}}{\sigma_T^2}\right)

where $\sigma_i^2$ is the variance of the mark of each question.

Poorly discriminating items tend to reduce reliability. Over time, routine item review strengthens both question banks and overall assessment quality.

Summary

Item analysis does not replace academic judgement. It just supports us.

Some typical patterns are familiar:

High facility with low discrimination: the item may be too obvious or testing recall only.
Low facility with low discrimination: possible ambiguity or misalignment with teaching.
Negative discrimination: check the key and wording immediately.
Unused distractors: revise alternatives.

None of these statistics determine what to do. They just prompt review.

Over time, analysing item performance allows question banks to be built on evidence rather than assumption. Questions that behave well can be reused with confidence; those that do not can be improved or retired.

The process is not primarily statistical. It is reflective. We are asking whether items behave in a way consistent with our academic standards: that students who understand the material are more likely to succeed, and that errors reveal something meaningful about misunderstanding.

Seen in this way, item analysis is not about producing elegant statistics. It is about ensuring that assessment decisions are based on evidence rather than confidence. The numbers do not replace judgement. They make it more informed.

References

T. M. Haladyna and M. C. Rodriquez (2013) Developing and Validating Test Items, Routledge, New York.

Tom Rodgers

Teaching and Teaching Research

Blog