VLI 11(2): Uehara et al. (2022)

Developing a Discipline-Specific Corpus and High-Frequency Word List for Science and Engineering Students in Graduate School
Suwako Ueharaa, Hibiya Harakia, and Stuart McLeanb
aThe University of Electro-Communications; bMomoyama Gakuin University
doi: https://doi.org/10.7820/vli.v11.2.uehara
Japanese graduate school students in the field of science and engineering need to read academic research in their second language (L2), and such tasks can be challenging. Studies showed a strong (0.78) correlation between vocabulary size and reading comprehension (McLean et al., 2020), and providing high-frequency word lists could enhance comprehension. In this work-in-progress, 1.35 million tokens of professor-recommended reading materials were used to investigate a method to create a vocabulary list that would benefit science majors in graduate school; the procedures to create a corpus and a high-frequency word list efficiently; and the steps required to create a cleaner corpus. This paper outlines a systematic literature-informed method that includes input from professors in the field; the combined use of tailored script in MATLAB and AntCont (Anthony, 2022) generated corpus and high-frequency words efficiently; and repeated comparison of original PDFs and the matching text files, then adding MATLAB script to deal with specific issues created by a cleaner text. This proposed method can be applied in other contexts to enhance the generation of high-frequency word lists.

