Assessing Comprehension beyond Recognition: A Many-Facet Rasch Measurement Approach

Bartolo Bazan

doi:10.47604/ijl.3600

Authors

Bartolo Bazan Ryukoku University

DOI:

https://doi.org/10.47604/ijl.3600

Keywords:

Rasch Model, Many-Facet Rasch Measurement, Comprehension Assessment, Summaries, Multiple-Choice Questions

Abstract

Purpose: Comprehension is commonly assessed through single-task tests, particularly multiple-choice questions (MCQs). Although MCQs offer many advantages, a growing number of researchers have raised concerns that such measures may overestimate degree of understanding because they are based on recognition. A less-frequently used method are summaries, which are assumed to reflect a higher level of comprehension because they require learners to select and integrate information in order to create a mental representation of the input. This study proposes a combined comprehension measure that integrates summaries (assessing global comprehension) and MCQs (targeting detail-level comprehension) into a single measurement system. The purpose of the study is to collect validity evidence for the use of the combined measure through many-facet Rasch measurement (MFRM).

Methodology: Listening data were longitudinally collected from 290 EFL Japanese high school students over three separate waves, with each involving three measurement points (i.e., Pretest, Posttest 1, and Posttest 2). Comprehension was assessed twice at each measurement point through the combined measure consisting of a summary and a set of five MCQs, which were administered in paper-and-pencil format. The summaries were rated by two expert raters using a five-point rating scale in tandem with a list of main ideas and details that had been previously extracted from the target texts. The MCQs were dichotomously scored. All three waves of data were linked through a Rasch stacking design and were analyzed using MFRM with the analysis involving three facets: persons, items, and raters. Under the theoretical assumption that summaries are more difficult and entail a higher level of comprehension than MCQs, the summaries were given double weight when estimating learner ability.

Findings: The Wright map confirmed a hierarchy of item difficulty consistent with the theoretical expectation that the summaries were more difficult than the MCQs, providing support for the weighting scheme. The persons showed acceptable fit to the Rasch model with most participants being within parameters. Similarly, all summaries and multiple-choice items fit the Rasch model’s expectations with the exception of only two multiple-choice items, which were slightly above the recommended criteria. The analysis revealed fair person reliability and excellent item reliability, suggesting that the replicability of the person ability and the item difficulty hierarchies were fair and high, respectively. In addition, rater severity did not negatively impact the measurement and the response thresholds suggested that the rating scale functioned as intended. These findings indicate that the combined measure is a valid instrument for comprehension assessment.

Unique Contribution to Theory, Practice and Policy: Practically, this study contributes to the literature by providing a combined measure that mitigates the weaknesses of summaries and MCQs when used separately. In addition, it demonstrates how MFRM can model productive and receptive tasks, which may be differently weighted tasks, within a single measurement system. Regarding policy, this study advocates for tests that move beyond single tasks to provide a more precise picture of learners’ levels of comprehension across different educational settings.

Downloads

Download data is not yet available.

References

Alaofi, A. O. (2020). Difficulties of summarizing and paraphrasing in English as a foreign language (EFL): Saudi graduate students’ perspectives. International Journal of English Language Education, 8(2), 193–211. https://doi.org/10.5296/ijele.v8i2.17788

Alshehri, A. (2025). AI’s effectiveness in language testing and feedback provision. Social Sciences & Humanities Open, 12, 101892. https://doi.org/10.1016/j.ssaho.2025.101892

Bazan, B. (2024). Listening automaticity: A reduction of dual-task Interference and working memory demands. IPR Journals and Book Publishers.

Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28(1), 77–92. https://doi.org/10.1111/j.1745-3984.1991.tb00345.x

Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Erlbaum.

Chan, N. & Kennedy, P. E. (2002), Are multiple-choice exams easier for economics students? A comparison of multiple-choice and “equivalent” constructed-response exam questions. Southern Economic Journal, 68, 957–971. https://doi.org/10.1002/j.2325-8012.2002.tb00469.x

Chen, G., Cheng, W., Chang, T-W., Zheng, X., & Huang, R. (2014). A comparison of reading comprehension across paper, computer screens, and tablets: Does tablet familiarity matter? Journals of Computers in Education, 1, 213–225. https://doi.org/10.1007/s40692-014-0012-z

Council of Europe (2020). Common European framework of reference for languages: Learning, teaching, assessment–companion volume. Council of Europe Publishing.

https://www.coe.int/en/web/common-european-framework-reference-languages

Duncan, P. W., Bode, R. K., Lai, S. M., & Perera, S. (2003). Rasch analysis of a new stroke-specific outcome scale: The stroke impact scale. Archives of Physical Medicine and Rehabilitation, 84, 950–963. https://doi.org/10.1016/S0003-9993(03)00035-2

Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang.

Fisher, W. P., Jr. (2007). Rating scale instrument quality criteria. Rasch Measurement

Transactions, 21(1), 1095. http://www.rasch.org/rmt/rmt211a.htm

Hughes, A. (2000). Testing for Language Teachers. Cambridge University Press.

Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge University Press.

Klufa, J. (2015). Multiple choice question tests: advantages and disadvantages. Mathematics and Computers in Sciences and Industry Journal, 3, 91–97. https://files.eric.ed.gov/fulltext/EJ1272114.pdf

Lee, H-S., Liu, O., & Linn, M. C. (2011). Validating measurement of knowledge integration in science using multiple-choice and explanation items. Applied Measurement in Education, 24(2), 115–136. https://doi.org/10.1080/08957347.2011.554604

Linacre, J. M. (1994). Many-facet Rasch measurement. Mesa Press.

Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85–106. https://europepmc.org/article/med/11997586

Linacre, J. M. (2007). A user’s guide to WINSTEPS: Rasch-model computer program. MESA.

Linacre, J. M. (2017). FACETS (Version 3.80.0) [Computer Software]. Winsteps.com.

Linacre, J. M. (2023). A user’s guide to Facets Rasch-model computer programs. https://www.winsteps.com/a/Facets-Manual.pdf

Little, J. L., Bjork, E. L., Bjork, R. A., & Angello, G. (2012). Multiple-choice tests exonerated, at least of some charges: Fostering test-induced learning and avoiding test-induced forgetting. Psychological Science, 23(11), 1337–1344. https://doi.org/10.1177/0956797612443370

Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218. https://doi.org/10.1207/s15326985ep3404_2

Mee, J., Pandian, R., Wolczynski, J., Morales, A., Panigua, M., Polina, H., Baldwin, P., & Clauser, B. E. (2024). An experimental comparison of multiple-choice and short-answer questions on a high-stakes test for medical students. Adv in Health Sci Educ 29, 783–801. https://doi.org/10.1007/s10459-023-10266-3

Qin, J., & Groombridge, T. (2023). Deconstructing summary writing: Further exploration of L2 reading and writing. Sage Open, 13(4). https://doi.org/10.1177/21582440231200935

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danmarks Paedagogiske Institut.

Rauch, D. P., & Hartig, J. (2010). Multiple-choice versus open-ended response formats of reading test items: A two-dimensional IRT analysis. Psychological Test and Assessment Modeling, 52(4), 354–379. https://files.eric.ed.gov/fulltext/EJ1272114.pdf

Riley, G. L., & Lee, J. F. (1996). A comparison of recall and summary protocols as measures of second language reading comprehension. Language Testing, 13(2), 173–189. https://doi.org/10.1177/026553229601300203

Sick, J. (2013). Rasch measurement in language education part 8: Rasch measurement and inter-rater reliability. Shiken, 17(2), 23–26. https://teval.jalt.org/sites/default/files/SRB-17-2-Sick-RMLE8.pdf

Wright, B. D., Linacre, M., Gustafsson, J.-E., & Martin-Löf, P. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. https://rasch.org/rmt/rmt83.htm

Zhou, Q. (2019) The feasibility of measuring reading ability by the format of short answer questions. US-China Foreign Language, 17(1), 1–9. https://doi.org/10.17265/1539-8080/2019.01.001

Assessing Comprehension beyond Recognition: A Many-Facet Rasch Measurement Approach

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

mainnav

Current Issue

Information