Название: The Concise Encyclopedia of Applied Linguistics
Автор: Carol A. Chapelle
Издательство: John Wiley & Sons Limited
Жанр: Языкознание
isbn: 9781119147374
isbn:
engagement with the interaction: displaying understanding of interlocutor talk;
turn organization: providing responses without excessive pausing.
A Rasch analysis showed that the test spread test takers out well and that the criteria functioned independently and were easy for raters to implement. Youn's study was a significant step forward as it was the first that clearly demonstrated the feasibility of assessing interactional competence.
Ikeda (2017) also investigated measurement of interactional competence but employed three role plays and three monologues with six rating criteria. Similar to Youn, he found a good spread of test takers and high inter‐rater reliability. There was significant overlap between scores on the monologic and dialogic tasks, raising the possibility of capturing a large amount of variance attributable to interactional competence with monologue tasks, which would greatly increase practicality.
Focusing on another aspect of interaction, Galaczi (2014) described differences in topic management, listener contributions, and turn‐taking management between learners at different levels of the Common European Framework of Reference (Council of Europe, 2001). She found that these interactional abilities improved with increasing proficiency and argued for their greater inclusion in rating scales. It must be noted that a feature like “topic management” was more likely to figure prominently in Galaczi's data, which involved test taker dyads discussing a set topic, than in Youn's and Ikeda's work, where interactions were based around requests.
Two other interaction‐focused assessment studies have been conducted which did not situate themselves in an interactional competence framework. Grabowski (2009, 2013) employed role plays and rated test taker performance based on criteria derived from Purpura's (2004) model of communicative language ability. Timpe (2013) employed Skype‐delivered role plays as part of a larger testing battery of intercultural competence (Byram, 1997). She scored test taker performance on two large holistic criteria, discourse management and pragmatic competence.
Challenges in Testing L2 Pragmatics
Fundamentally, tests of L2 pragmatics have the same requirements and pose the same development challenges as other language tests. They must be standardized to allow comparisons between test takers, they must be reliable to ensure precise measurement, they must be practical so that they do not overtax resources, and, above all, they must allow defensible inferences to be drawn from scores that can inform real‐world decisions (Messick, 1989; Kane, 2006). Some of these requirements are particularly difficult to meet for tests of pragmatics, which probably accounts for their very limited uptake.
Most importantly, practicality is a serious challenge for testing pragmatics. While some instruments in the speech act tradition were designed to be administered online and to allow automatic scoring (Roever, 2005; Itomitsu, 2009; Roever et al., 2014), tests under the interactional competence construct by their very nature include interaction and therefore currently require time and resource‐intensive involvement of a live interlocutor and scoring by raters. Work is underway to assess interaction through the use of intelligent agents backed by automatic speech recognition engines (Suendermann‐Oeft et al., 2017; Litman, Strik, & Lim, 2018) but this work is still in its infancy and requires nothing short of modeling language users' commonsense members' knowledge (Garfinkel, 1967), which is a daunting prospect. While other aspects of pragmatics, especially some pragmalinguistic abilities, are more easily measurable, it would be a case of serious construct underrepresentation to only include them and then argue that “pragmatics” as a whole is being measured. However, it would be much more feasible for tests that already include face‐to‐face speaking components, such as the International English Language Testing System (IELTS) or the American Council on the Teaching of Foreign Languages (ACTFL) oral proficiency interviews (OPI), to alter their tasks, procedures, and rating scales to measure interactional aspects of pragmatics.
The issue of practicality is further complicated by different types of interactional activities making different abilities visible. For example, two test takers discussing a set topic, as in Galaczi's (2014) study, will by necessity demonstrate their management of topical talk and allow conclusions as to relevant abilities, such as extending interlocutor contributions and managing topic changes. However, these abilities are much less transparent in role plays such as Youn's (2013, 2015), which are more suitable for making test takers' ability to do preference organization visible. This raises the specter that a test would need to involve several different interactional activities, compounding the practicality problem, though research will need to show whether conducting separate measurements of different interactional abilities is necessary.
However, even if the practicality issue can be resolved, measuring of interactional aspects of pragmatic competence is not an easy endeavor. Two related challenges are the co‐constructed nature of interaction (Jacoby & Ochs, 1995) and the standardization of the test. While tests need to be standardized to allow comparison between test taker performances, this is chronically difficult for spoken interactions, which have their own dynamic (Heritage, 1984; Kasper, 2006) and can unfold in unpredictable ways. Youn (2013, 2015) was the only one trying to address this problem by providing both the interlocutor and the test taker with an outline of the conversation. This makes the interaction somewhat more predictable and allows better comparison between different test takers, but it arguably distorts the construct since real‐world interactions are not usually scripted.
A significant amount of research is still necessary to understand how generalizable specific instances of role play performances in testing situations are across all possible performances, and to what extent they can be extrapolated to real‐world performances (Kane, 2006; Chapelle, Enright, & Jamieson, 2010). Findings like Ikeda's (2017) about the large degree of overlap between dialogic role play performances and monologue tasks are promising, and so is Okada's (2010) argument that abilities elicited through role plays are also relevant in real‐world interaction (though see Ewald, 2012, and Stokoe, 2013, for differences between role plays and real‐world talk). Still, comprehensive measurement of a complex construct such as interactional competence is one of the big challenges facing testing of L2 pragmatics.
From a test design perspective, it is also important to know what makes items difficult so they can be targeted at test takers at different ability levels. This is a challenge for many pragmatics tests, which tend to not have sufficient numbers of difficult items, and it is true for tests in the speech act tradition and assessing interactional competence. For example, Roever et al.'s (2014) battery was overall easy for test takers, and so were Youn's (2013, 2015) and Ikeda's (2017) instruments. We know relatively little about what makes items or tasks difficult, though Roever (2004) put forward some suggestions for pragmalinguistically oriented tests. For measures of interactional competence, it might be worth trying interactional tasks that require orientation to conflicting social norms, for example, managing status incongruent talk as a student interacting with a professor under institutional expectations of initiative (Bardovi‐Harlig & Hartford, 1993), or in a workplace situation persuading one's boss to remove his son from one's project team (Ross, 2017). However, much more research is needed here as well.
A challenge specific to tests using sociopragmatic judgment is establishing a baseline. Put simply, testers need a reliable way to determine correct and incorrect test taker responses. The usual way to do so is to use a native‐speaker standard and this has been shown to work well for binary judgments of correct/incorrect, appropriate/inappropriate, and so on (Bardovi‐Harlig & Dörnyei, 1998; Schauer, 2006). However, native‐speaker benchmarking is much more problematic when it comes to preference judgments. For example, in Matsumura's (2001) benchmarking of his multiple‐choice items on the appropriateness of advice, there was not a single item where 70% of a native‐speaker benchmarking group (N = 71) agreed on the correct response, and only 2 items (out of a pretest and posttest total of 24) where more than 60% of native speakers agreed. On 10 СКАЧАТЬ