Assessing Comprehension beyond Recognition: A Many-Facet Rasch Measurement Approach
DOI:
https://doi.org/10.47604/ijl.3600Keywords:
Rasch Model, Many-Facet Rasch Measurement, Comprehension Assessment, Summaries, Multiple-Choice QuestionsAbstract
Purpose: Comprehension is commonly assessed through single-task tests, particularly multiple-choice questions (MCQs). Although MCQs offer many advantages, a growing number of researchers have raised concerns that such measures may overestimate degree of understanding because they are based on recognition. A less-frequently used method are summaries, which are assumed to reflect a higher level of comprehension because they require learners to select and integrate information in order to create a mental representation of the input. This study proposes a combined comprehension measure that integrates summaries (assessing global comprehension) and MCQs (targeting detail-level comprehension) into a single measurement system. The purpose of the study is to collect validity evidence for the use of the combined measure through many-facet Rasch measurement (MFRM).
Methodology: Listening data were longitudinally collected from 290 EFL Japanese high school students over three separate waves, with each involving three measurement points (i.e., Pretest, Posttest 1, and Posttest 2). Comprehension was assessed twice at each measurement point through the combined measure consisting of a summary and a set of five MCQs, which were administered in paper-and-pencil format. The summaries were rated by two expert raters using a five-point rating scale in tandem with a list of main ideas and details that had been previously extracted from the target texts. The MCQs were dichotomously scored. All three waves of data were linked through a Rasch stacking design and were analyzed using MFRM with the analysis involving three facets: persons, items, and raters. Under the theoretical assumption that summaries are more difficult and entail a higher level of comprehension than MCQs, the summaries were given double weight when estimating learner ability.
Findings: The Wright map confirmed a hierarchy of item difficulty consistent with the theoretical expectation that the summaries were more difficult than the MCQs, providing support for the weighting scheme. The persons showed acceptable fit to the Rasch model with most participants being within parameters. Similarly, all summaries and multiple-choice items fit the Rasch model’s expectations with the exception of only two multiple-choice items, which were slightly above the recommended criteria. The analysis revealed fair person reliability and excellent item reliability, suggesting that the replicability of the person ability and the item difficulty hierarchies were fair and high, respectively. In addition, rater severity did not negatively impact the measurement and the response thresholds suggested that the rating scale functioned as intended. These findings indicate that the combined measure is a valid instrument for comprehension assessment.
Unique Contribution to Theory, Practice and Policy: Practically, this study contributes to the literature by providing a combined measure that mitigates the weaknesses of summaries and MCQs when used separately. In addition, it demonstrates how MFRM can model productive and receptive tasks, which may be differently weighted tasks, within a single measurement system. Regarding policy, this study advocates for tests that move beyond single tasks to provide a more precise picture of learners’ levels of comprehension across different educational settings.
Downloads
References
Alaofi, A. O. (2020). Difficulties of summarizing and paraphrasing in English as a foreign language (EFL): Saudi graduate students’ perspectives. International Journal of English Language Education, 8(2), 193–211. https://doi.org/10.5296/ijele.v8i2.17788
Alshehri, A. (2025). AI’s effectiveness in language testing and feedback provision. Social Sciences & Humanities Open, 12, 101892. https://doi.org/10.1016/j.ssaho.2025.101892
Bazan, B. (2024). Listening automaticity: A reduction of dual-task Interference and working memory demands. IPR Journals and Book Publishers.
Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28(1), 77–92. https://doi.org/10.1111/j.1745-3984.1991.tb00345.x
Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Erlbaum.
Chan, N. & Kennedy, P. E. (2002), Are multiple-choice exams easier for economics students? A comparison of multiple-choice and “equivalent” constructed-response exam questions. Southern Economic Journal, 68, 957–971. https://doi.org/10.1002/j.2325-8012.2002.tb00469.x
Chen, G., Cheng, W., Chang, T-W., Zheng, X., & Huang, R. (2014). A comparison of reading comprehension across paper, computer screens, and tablets: Does tablet familiarity matter? Journals of Computers in Education, 1, 213–225. https://doi.org/10.1007/s40692-014-0012-z
Council of Europe (2020). Common European framework of reference for languages: Learning, teaching, assessment–companion volume. Council of Europe Publishing.
https://www.coe.int/en/web/common-european-framework-reference-languages
Duncan, P. W., Bode, R. K., Lai, S. M., & Perera, S. (2003). Rasch analysis of a new stroke-specific outcome scale: The stroke impact scale. Archives of Physical Medicine and Rehabilitation, 84, 950–963. https://doi.org/10.1016/S0003-9993(03)00035-2
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang.
Fisher, W. P., Jr. (2007). Rating scale instrument quality criteria. Rasch Measurement
Transactions, 21(1), 1095. http://www.rasch.org/rmt/rmt211a.htm
Hughes, A. (2000). Testing for Language Teachers. Cambridge University Press.
Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge University Press.
Klufa, J. (2015). Multiple choice question tests: advantages and disadvantages. Mathematics and Computers in Sciences and Industry Journal, 3, 91–97. https://files.eric.ed.gov/fulltext/EJ1272114.pdf
Lee, H-S., Liu, O., & Linn, M. C. (2011). Validating measurement of knowledge integration in science using multiple-choice and explanation items. Applied Measurement in Education, 24(2), 115–136. https://doi.org/10.1080/08957347.2011.554604
Linacre, J. M. (1994). Many-facet Rasch measurement. Mesa Press.
Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85–106. https://europepmc.org/article/med/11997586
Linacre, J. M. (2007). A user’s guide to WINSTEPS: Rasch-model computer program. MESA.
Linacre, J. M. (2017). FACETS (Version 3.80.0) [Computer Software]. Winsteps.com.
Linacre, J. M. (2023). A user’s guide to Facets Rasch-model computer programs. https://www.winsteps.com/a/Facets-Manual.pdf
Little, J. L., Bjork, E. L., Bjork, R. A., & Angello, G. (2012). Multiple-choice tests exonerated, at least of some charges: Fostering test-induced learning and avoiding test-induced forgetting. Psychological Science, 23(11), 1337–1344. https://doi.org/10.1177/0956797612443370
Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218. https://doi.org/10.1207/s15326985ep3404_2
Mee, J., Pandian, R., Wolczynski, J., Morales, A., Panigua, M., Polina, H., Baldwin, P., & Clauser, B. E. (2024). An experimental comparison of multiple-choice and short-answer questions on a high-stakes test for medical students. Adv in Health Sci Educ 29, 783–801. https://doi.org/10.1007/s10459-023-10266-3
Qin, J., & Groombridge, T. (2023). Deconstructing summary writing: Further exploration of L2 reading and writing. Sage Open, 13(4). https://doi.org/10.1177/21582440231200935
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danmarks Paedagogiske Institut.
Rauch, D. P., & Hartig, J. (2010). Multiple-choice versus open-ended response formats of reading test items: A two-dimensional IRT analysis. Psychological Test and Assessment Modeling, 52(4), 354–379. https://files.eric.ed.gov/fulltext/EJ1272114.pdf
Riley, G. L., & Lee, J. F. (1996). A comparison of recall and summary protocols as measures of second language reading comprehension. Language Testing, 13(2), 173–189. https://doi.org/10.1177/026553229601300203
Sick, J. (2013). Rasch measurement in language education part 8: Rasch measurement and inter-rater reliability. Shiken, 17(2), 23–26. https://teval.jalt.org/sites/default/files/SRB-17-2-Sick-RMLE8.pdf
Wright, B. D., Linacre, M., Gustafsson, J.-E., & Martin-Löf, P. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. https://rasch.org/rmt/rmt83.htm
Zhou, Q. (2019) The feasibility of measuring reading ability by the format of short answer questions. US-China Foreign Language, 17(1), 1–9. https://doi.org/10.17265/1539-8080/2019.01.001
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Bartolo Bazan

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution (CC-BY) 4.0 License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.