Evaluation of L2 Listening Comprehension: in Pursuit of New Measurement Units

Purpose. Second language (L2) listening comprehension remains one of the most under researched problems in field largely due to the corresponding methodological difficulties. This study focuses on the development of such a measurement unit that considers the actual grammatical completeness and does not distort the factual semantics of the perceived text – listening comprehension unit. Methods and procedure of research. The concept of listening comprehension


Introduction
Second language (L2) listening comprehension remains one of the most under researched problems in the L2 learning field. Many of the early and modern researchers generally agree on the critical role of the listening comprehension for L2 learning and on methodological difficulties of studying the phenomenon (Chang, 2012;El-dali, 2017;J. Field, 2009;Goh, 2016;Graham et al., 2011;Hansen & Jensen, 1994;Joyce, 2019;Long, 1985;Ockey & Wagner, 2018;Wang & Treffers-Daller, 2017;Yeldham & Gruba, 2016). Aptly pointed out by Yeldham as "most ephemeral of the macro-skills" (Yeldham, 2017), the L2 listening comprehension still remains discussed whether it relates to a language problem or listening problem (El-dali, 2017) students typically have limited exposure to the language outside formal classrooms. Therefore, their ability to comprehend spoken English may be limited. To add to this problem, L2 learners often regard listening as the most difficult language skill to learn. On the other hand, it is noticeable that L2 listening remains the least researched of all four language skills. Accordingly, the present study is based on the commonly believed premises that (1. This uncertainty is extrapolated to the empirical study domain, leading to a variety of research techniques (Yeldham, 2017), none of which seem to provide a way to a holistic understanding of the L2 listening comprehension subject matter field. Hence the issue of appropriate measurement units is of particular methodological importance and interest. In this regard, there are also gamut of approaches that are used throughout the L2 studies and adjacent research areas for measurement purposes -words (Yang et al., 2020), sentences (Bardovi-Harlig, 1992), propositions (Sato, 1988), clauses (Taboada & Zabala, 2008), T-units (Sotillo, 2000)b, C-units (Pica et al., 1989)but that opportunities for nonnative speakers (NNSs, AS-units (Foster et al., 2000) there must be agreement on the nature of the unit, and it must be possible to apply this unit reliably to a range of different types of speech data. There are a number of different units in use, the various merits of which have been discussed by Crookes (1990, prosodic units (Izre'el et al., 2020, idea units (Richards et al., 2016), discourse units (Houtkoop & Mazeland, 1985), turn-constructional units (Liddicoat, 2004) etc.
These approaches reflect the evading multifaceted nature of the comprehension phenomenon and discussions on this matter are most likely to continue. However, along with the theoretical question of how the comprehension occurs, at the same time, the practical question of what exactly is comprehended in each specific pedagogical, communicative or interactive situation continues to remain relevant. This is especially important at the more advanced stages of language acquisition, when the challenge of the L2 grammatical complexity mastering somewhat recedes into the background and further growth is more and more associated with the cognitive sensitivity of students to the semantic nuances of the language, the ability to understand subtle humor, puns and take into account many linguacultural factors. In our opinion, the possible solution to the problem lies beyond a predominantly linguistic model of T-units (Bardovi-Harlig, 1992;Gaies, 1980;Hunt, 1977;Sotillo, 2000;Taboada & Zabala, 2008) the sentence-based analysis is superior for examining syntactic complexity in at least three ways: (1 towards the idea units concept, founded mostly on a cognitive base (Cocklin et al., 1984;Dunlosky et al., 2011;Granaas, 1985;Kroll, 1977;Vandergrift & Goh, 2011;Yeldham & Gruba, 2016). Syntactic emphasis in the process of identifying T-unit might "suppress" or involuntary "correct" some implied semantics, which seems counterproductive in the situation of assessing listening comprehension. On the other hand, an idea unit might "omit" or "misread" grammatical elements. Thus, such a measurement unit is needed that takes into account the actual grammatical completeness and does not distort the factual semantics of the perceived text. The main goal of this article is to present results of our corresponding attempt to create a new technique for assessing L2 listening comprehension.

Materials and Methods
To develop a theoretical model of the above technique, we proceeded from the following definition of an idea unit: "chunk of information corresponding to a cognitive or psychological reality of the speaker" (Kroll, 1977). It is obvious that the structural unit of this reality is a semantic one. How might such a semantic unit be organized? To answer the question we used a structural-ontological approach (Shymko, 2019b). The approach has been developed and probed in some research scopes (philosophy of mind and AI, psycholinguistics, organizational studies) as a methodological tool for a systemic description of the investigated multidisciplinary subject matter field (Shymko, 2018(Shymko, , 2019a(Shymko, , 2019cShymko & Babadzhanova, 2020). The description involves a construction of a structural-ontological matrix that reflects the system-forming factors (i.e., a primary process and a material of the system under the study) and allows one to conceptualize theoretical hypotheses and appropriately build up an empirical research architecture. Here we omit the relevant procedural aspects, which are detailed in the publications mentioned above, and move on to the visualization directly related to this study ( Fig. 1).
We considered a listening comprehension unit (LC unit) as the system with its material component as a subject (the horizontal axis of the matrix, Fig. 1). The subject can be any entity: animate or inanimate, human or organizational, tangible or abstract, singular or groupwise, natural or artificial, real or imaginary, well defined or vague, specific or implied, etc. The subject is who or what is being talked about in each particular fragment of the message. Also, the subject is the author of the direct or quoted speech. The subject can appear alone in a text fragment or have some characteristics (in the broadest possible sense -have any properties and peculiarities, carry out or undergo some actions, have certain experiences, expectations, intentions, origin, focus, be somehow located in space and time, etc.). Typically, the subject is a noun or pronoun and may consist of one or more words. The subject can be the name of an organization, a field or type of activity, any kind of situation or state of affairs, etc. Some characteristics of one subject can be detected in semantic interaction with another one. In this case, the latter subject acts as a semantic actualizer of the characteristics of the former one. For example, in the sentence: "She described the security operation underway in a city", the subject is "she", and the actualizer is "security operation". Note that a subject in one fragment can become an actualizer in another fragment and vice versa. Primarily, the textual size of the listening comprehension unit is determined by one subject and one actualizer. So, if the properties of one subject are revealed in the text by two or more actualizers, it is necessary to multiply the number of the units accordingly. As the primary process of the above-mentioned system, we considered the semantic unfolding of the subject and the actualizer (vertical axis) in the text fragment. Such unfolding, on the one hand, presupposes endowing the subject and the actualizer with various properties and characteristics, as well as reporting on their activity. We categorized the listed semantic parameters as -specification. On the other hand, the subject and the actualizer can be revealed in a text fragment as having spatial and/or temporal characteristics, which we classified as -localization. Specifiers and localizers also take part in sizing the unit. So if a text fragment is replete with them, then we consider two possibilities: (a) we leave synonymously close specifiers or localizers within one unit (see the example in line 32, Table A1); (b) in other cases, we form a new unit, in which, when calculating points, we do not take into account the subject, actualizer as well as previously taken into account specifiers and localizers (see the example in rows 12-13, Table A1). Same as the last step we solve the anaphora problem.
We consider a unit as such that can be complete (i.e., it can consist of a subject, an actualizer and their specifiers and localizers) or partial. It might comprise only one out of the six components of the structuralontological matrix (Fig. 1) if such unit is semantically autonomous in the text. For example, in: "Twilight. It is getting dark all around the valley" we have two units. First one -"twilight" (consist of the only subject). And in the second unit we have as a subject -"it", specifier -"getting dark", localizer -"all around the valley".
When calculating results, each structural component of a unit corresponds to one point. Thus, the maximum possible number of points for one unit is six. We believe that the convergence of the structural components of the unit constitutes semantic roles (Donohue & Wichmann, 2008) of the subject and the actualizer (Fig. 1). A recognition of the roles is an important qualitative indicator of listening comprehension. Therefore, semantic role errors were considered as gross and on a par with omissions, they were assessed with zero points. In other cases of non-critical inaccuracies in the content comprehension, we estimated the corresponding component of the unit at half a point if the inaccuracies did not distort the fragment meaning essentially.
To validate the described technique, we needed to answer two research questions: RQ 1. Does this technique measure and differentiate levels of L2 listening comprehension? RQ 2. How effective is the use of proposed listening comprehension units compared to idea units? In order to address above questions, we conducted a study. The participants were 38 female and 7 male undergraduate and graduate students (with different learning period: from 4 to 8 semesters) majoring English Philology and Translation in Department of Foreign Philology, Translation and Teaching Methods at Hryhorii Skovoroda University in Pereiaslav, Ukraine. All of the students were Ukrainian native speakers. The study was conducted two-three weeks after the participants passed the annual IELTS exam. Based on the IELTS listening band score level, the participants were divided into three groups: B1, B2 and C1. For such differentiation some benchmarking results were taken into account (Hidri, 2020).
An excerpt from one of the issues of BBC Global News Podcast (Miles, 2021) was used to collect the necessary data. The duration of the excerpt was 1 minute 23 seconds. Students were instructed to reproduce the content they heard as completely and accurately as possible and were warned to avoid paraphrasing as well as their own contributions, comments, speculations, etc. After that, they listened to the excerpt once, during which they were allowed to take notes. Verbal reports (Yeldham, 2017) of the participants were registered immediately after listening using the protocol based on a table with the excerpt transcription parsed into listening comprehension units (Appendix A). When it was necessary, the experimenters asked clarifying questions in order to make sure that semantic roles of subjects and actualizers were correctly understood. Each structural component of every listening comprehension unit was estimated at 1 point if it was accurately reproduced in the verbal report. If there were insignificant errors, the score was reduced to 0.5 points. Significant errors, omissions and appearance of irrelevant semantics were scored 0 points.
In addition to the above, the participants' verbal reports were processed and evaluated using another protocol for calculating idea units scores (the breakdown into these units can also be found in Appendix A). For this purpose, the following rating system was used: 3 points -the content is reproduced accurately, completely; 2 point -the content is reproduced either somewhat incompletely or with insignificant distortions of semantics; 1 point -gross errors in understanding were made and the content of the unit was reproduced only partially to the detriment of unit comprehension; 0 points -content omitted or substituted by irrelevant one.
Statistical analysis of obtained results was done using IBM SPSS Statistics V26. All materials, data, audio recordings, transcriptions and protocols associated with this study are in open access availability on Harvard Dataverse Repository (Name deleted to maintain the integrity of the review process).

Descriptive Statistics and Normality Tests
The sample as a whole was relatively young (M = 21.42, SD = 1.97) and predominantly female (81.6%). Scored points on the excerpt listening comprehension were tested for normality (Fig. 2).

Figure 2 Normal Q-Q Plots of Scores Distribution (a) Idea Units (b) Language Comprehension Units
In both cases, the data is normally distributed. Along with the graphs, this is evidenced by parameters of Skewness (-.169 for idea units and .089 for LC units) and Kurtosis (-1.251 and -1.240, respectively). Also, the null hypothesis of the normality test is confirmed by the Kolmogorov-Smirnov criterion (for idea units p = .146 and for LC units p = .200).
In accordance with the results of the IELTS, the B1 group included 12 students (31.6%), and the B2 and C1 groups each included 13 students (34.2%). The corresponding descriptive statistic are given in the table 1.

One-Way ANOVA Test
A one-way analysis of variance was conducted to evaluate the null hypothesis that there is no difference of listening comprehension measurements using idea units and LC units based on L2 learners IELTS scores. The assumption of normality for both units was proven above.

Idea Units
The assumption of homogeneity of variances was tested and found tenable using Levene's test F (2, 35) = 1.224, p = .306. The ANOVA was significant F (2, 35) = 16.259, p < .001. Thus, there is ground to reject the null hypothesis and conclude there is a significant difference of listening comprehension measurements using idea units based on L2 learners IELTS scores. Post hoc comparison (Table 2) to evaluate pairwise differences among group means were conducted with the use of Tukey HSD test since equal variances were tenable. © Shymko Vitalii As can be seen, tests revealed significant pairwise differences between the mean scores of L2 learners listening comprehension measurements for B1 IELTS group and idea unit measurements for B2 and C1 students, p < .005. However, measurements in these two latter groups do not significantly differ from each other, p > .05.

LC Units
The assumption of homogeneity of variances was violated according to Levene's test result F (2, 35) = 3.534, p = .040. Thus, we used alternative F statistics for determining significance (A. Field, 2013). Welch's F (2, 20.61) = 48.566, p < .001. Games-Howell post hoc procedure was used since the homogeneity of variance assumption was not met. The results (Table 3) prove significant pairwise differences between the mean scores of L2 learners listening comprehension measurements for each of the three groups, p < .05.

Discussion and Conclusions
Anticipating a discussion of the results obtained, we would like to focus on the most essential limitations of our study. These are the quantitative and qualitative characteristics of the sample. For more confident conclusions regarding the validity of LC units, it is necessary not only to increase the number of participants, but also to diversify the sample in a balanced manner according to such criteria as gender, age, duration of L2 learning, frequency and volume of texts listening, etc. It is as well necessary to further study the effectiveness of LC units on longer audio texts and in situations of real communication. It would also be informative to compare the effectiveness of LC units not only with idea units, but also with T-units and other instruments of L2 listening comprehension evaluation. Of particular interest is the validity of LC units for students from various ethnolinguistic groups who study various foreign languages.
Returning to the research questions and considering the limitations outlined above, we believe it reasonable to draw the following tentative conclusions. First, LC units can be used to measure and differentiate levels of L2 listening comprehension of short texts. Secondly, our study demonstrates a higher differential sensitivity of LC units compared to idea units in L2 listening comprehension levels diagnostics. In our opinion, another advantage of using LC units is that this tool also allows one to look differently at the semantic structure of the listening comprehension process. In turn, this might open new psycholinguistic, pedagogical and didactic perspectives in the field. Author contribution statement. The author is the only person that contributed to all parts of this paper.

ADHERENCE TO ETHICAL STANDARDS
Consent for publication. The author approve of this submission and, conditional upon the decision made by the editorial board from the peer-review process, consent to the publication of the current work. The work has not been submitted to other journals in consideration for publication.
Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Table A1
Audio Notes: