In recent years, automated item generation (AIG) has been increasingly used to create multiple-choice questions (MCQs) for the assessment of health professionals (Gierl, Lai, & Turner, 2012; Lai, Gierl, Byrne, Spielman, & Waldschmidt, 2016). This move is in part due to changes to the assessment landscape which have led educators to seek ways to provide more frequent testing opportunities. For example, the introduction of competency-based education models, which require multiple data points to make meaningful decisions about competence, has increased the need for test items to support more frequent and tailored assessments (Lockyer et al., 2017). Similarly, progress testing, which is gaining in popularity, requires a large number of test items to allow for the creation of multiple test forms (Albanese & Case, 2016). Finally, the creation of new content is also needed to attenuate the impact of surreptitious sharing of test items between learners through social networks (Monteiro, Silva-Pereira, & Severo, 2018).
The aforementioned changes have serendipitously led to several advancements in item development, including MCQs (Pugh, De Champlain, & Touchie, 2019), of which one of the most promising has been AIG. In brief, AIG relies on the use of computer algorithms to generate a large number of MCQs by inputting and coding information derived from a cognitive model (Gierl et al., 2012). The cognitive model approach requires content experts to deconstruct and document their thought processes before developing the test item (Pugh, De Champlain, Gierl, Lai, & Touchie, 2016). While doing this, content experts are forced to articulate the factors that would lead them down a series of different paths to solve a clinical problem. For example, if a clinician is asked to articulate their approach to a patient presenting with hyponatremia, they will identify the factors that will allow them to diagnose and manage the patient. These factors may include historical features (e.g., recent fluid intake/losses or medication use), physical examination findings (e.g., volume status), and laboratory results (e.g., urinary sodium). Different diagnoses would be associated with a different set of presenting features (i.e., variables). In other words, the diagnosis and management will be very different in a patient who is taking a selective serotonin reuptake inhibitor, is clinically euvolemic, and has a high urinary sodium (i.e., syndrome of inappropriate antidiuretic hormone secretion) versus a patient who has a history of vomiting, is clinically hypovolemic, and has a very low urine sodium (i.e., dehydration). The resulting model accounts for these differences and can be translated into code to generate MCQs through linear optimization (Gierl & Lai, 2013).
One of the most apparent advantages of using AIG is that it allows for the production of a large number of test items to be developed in a relatively short period of time. In fact, one cognitive model, developed and coded over a 2–3-h period, can lead to the generation of dozens or even hundreds of MCQs (Gierl et al., 2012). This may be very appealing to educators and organizations who find that their need for content exceeds their ability to develop items using traditional methods, such as those introducing progress testing or competency-based models. In addition, because AIG produces items that look similar, but require different thought processes to arrive at different answers, the impact of sharing recalled items between test-takers may be attenuated.
Another potential advantage of AIG is that it may be more likely to result in items that assess clinical reasoning or application of knowledge rather than factual recall, because of its reliance on cognitive models. Cognitive models, by design, force item writers to focus on problem conceptualization. This is important as educators strive to better understand and assess examinees’ cognitive processes. Although once thought to be useful only in the assessment of lower-order skills (i.e., recall of facts), well-constructed MCQs have been shown to be beneficial in assessing clinical reasoning (Coderre, Harasym, Mandin, & Fick, 2004; Heist, Gonzalo, Durning, Torre, & Elnicki, 2014; Skakun, Maguire, & Cook, 1994). In fact, examinees have been shown to use both system I (automatic, non-analytic) and system II (analytic) cognitive processes when answering MCQs, which aligns with the processes that clinicians use in practice (Surry, Torre, & Durning, 2017). However, to date, there are no studies demonstrating that items developed using AIG do in fact target these higher-order skills.
Despite the many advantages of AIG, there is some concern that the items generated using this method may not be of the same quality level as those developed using traditional methods (in which each item undergoes rigorous committee review by a panel of content experts). Psychometrically, results from pretest items in a high-stakes exam suggest that items for health professionals developed using AIG display psychometric properties that are similar to those obtained using traditionally developed MCQs (Gierl et al., 2016). From a content expert perspective, a preliminary study was conducted and found that the quality of items generated using AIG was comparable to that of those developed using traditional methods (Gierl & Lai, 2013). In that study, researchers compared the quality of 15 MCQs developed using AIG to items developed using traditional methods, via eight pre-defined quality metrics. They found that items were comparable for seven of eight quality metrics. However, the quality of distractors (i.e., the incorrect options for MCQs) was significantly worse for items generated using AIG.
In response to the perceived concern on quality, much effort has been devoted to developing an approach to improve the quality of MCQ distractors generated using AIG. This approach has provided content experts with a framework for systematically developing a list of plausible distractors at the level of the cognitive model. In practice, this has led to the generation of high-quality distractors for MCQs, as evidenced by difficulty level and discrimination indices (Lai et al., 2016). However, although psychometrically sound, to date, there have been no follow-up studies that have examined the quality of these generated distractors from the perspective of content experts.
The Medical Council of Canada (MCC) develops and administers a written examination (MCC Qualifying Examination, Part I), that is one of the requirements for full licensure to practice medicine in Canada. Approximately, three quarters of this examination is comprised of MCQs. In the past few years, we have augmented our MCQ content development by introducing AIG (Gierl et al., 2012).
The purpose of this study was to evaluate the quality of the items generated as compared to those developed using traditional methods, as judged by a panel of experts. Specifically, this study (1) compared the constructs (i.e., knowledge versus application of knowledge) assessed by items developed using AIG versus traditional methods, and (2) compared the quality of items developed using AIG versus traditional methods. We hypothesized that the use of AIG would result in items of comparable quality to those developed using traditional methods but that AIG, because of its reliance on cognitive models, would result in items that would better assess higher-order skills.