Introduction to Corpus Linguistics. Sandrine Zufferey
Чтение книги онлайн.

Читать онлайн книгу Introduction to Corpus Linguistics - Sandrine Zufferey страница 11

Название: Introduction to Corpus Linguistics

Автор: Sandrine Zufferey

Издательство: John Wiley & Sons Limited

Жанр: Учебная литература

Серия:

isbn: 9781119779704

isbn:

СКАЧАТЬ between two variables, such as the fact of being stressed and producing more errors. Corpus studies do not make it possible to draw this type of conclusion. Second, while an experimental paradigm can be developed to test almost any kind of phenomenon, there are some rare linguistic phenomena which may be absent or too little represented in a corpus to be examined in this way. For example, if we want to decide whether learners are fluent in French idioms such as “mettre le feu aux poudres” (to stir up a hornet’s nest) or “avoir un poil dans la main” (to be extremely lazy) through a corpus study, we will have to look for them in a corpus of learners’ productions. Now, it is quite possible that these expressions are never found there, but this does not necessarily mean that the learners do not know how to use them. It only means that they did not have an opportunity to produce them in the corpus. Using experimental methodology, we will be able to test whether learners have mastered these expressions. For instance, we can encourage them to read the expressions and then ask them to choose, from among several definitions, the one corresponding to their meaning. Finally, experimental linguistics makes it possible to study the linguistic competence of speakers, through different language comprehension tasks which can be more or less explicit or implicit, such as the conscious evaluation of sentences, their intuitive reading, etc. Corpora can only reflect the linguistic productions of speakers.

      As we will see in the following chapters, corpora represent linguistic samples of a very varied nature, and it is precisely this variety that makes it possible to answer diverse research questions in all fields of linguistics. In this last section, we will introduce a first classification of the types of existing corpora, in order to be able to refer back to it in the following chapters.

      The first distinction we can make among all the existing corpora is the one that classifies them into a sample corpus and a monitor corpus. Sample corpora are those in which data have been collected once and for all, and which no longer evolve thereafter. For this reason, they are also known as closed corpora in the specialized literature. The advantage of these corpora is that they have been designed to contain a set of texts representative of the language, or a part of the language to be studied, with a balanced representation of the different text genres, for example. Thus, these corpora make it possible to draw conclusions which can be generalized. On the other hand, their main defect is that they age quickly and do not follow changes in the language. Therefore, sample corpora need to be recollected at regular intervals.

      The second major distinction to be made among existing corpora differentiates general language corpora from specialized language corpora. General language corpora aim to offer a panorama of the whole of a language at a given time. It is evidently impossible to collect a sample of the whole language, but in the same way that a general language dictionary aims to describe the common lexicon of a language, the general corpus seeks to offer a global image, including the main textual genres found in language. These corpora are really valuable when it comes to studying a language as a whole, but they cannot offer precise answers on linguistic phenomena present in certain specific communication means, such as mobile texting, social media, medical reports, etc.

      In order to study one of these areas specifically, it is preferable to resort to a specialized corpus. In fact, there are corpora especially devoted to texting, social media, etc. In addition, general corpora include productions by adults who are native speakers of the language represented. Other corpora specialize in representing other population categories, regardless of whether they are monolingual children in the process of acquiring their mother tongue, bilingual children, foreign-language learners, or even children with neuro-developmental disorders influencing language acquisition, such as autism and specific language impairment. Finally, by default, a general corpus includes examples of the variety considered as a language standard, or one of its main varieties. In French, it generally refers to the French language from France and, more precisely, from the Parisian region. In English, general corpora can refer to the English language from the UK or to American English. Conversely, some corpora specialize in the productions of speakers of a certain language variety, such as French from French-speaking Switzerland, Belgium, Canada, etc.

      Another distinction that can be made regarding the types of existing corpora relates to the type of processing carried out on the linguistic data of the corpus. On the one hand, raw corpora contain nothing but language samples. This scenario represents the majority of the French corpora. On the other hand, some annotated corpora contain specific linguistic information, apart from the language samples. The most common type of annotation is the assignment of a grammatical category to each word in the corpus, as we have already mentioned. More rarely, certain corpora contain a syntactic analysis of all of their sentences, as well as other types of information, such as an annotation of the discourse relations (cause, condition, etc.) which interconnect the sentences within the text corpora. Finally, certain corpora, which have been transcribed with the aim of studying phonological phenomena, may end up being transcribed using the International Phonetic Alphabet.

      Finally, many corpora are drawn from contemporary written or spoken data. However, there СКАЧАТЬ