Название: Introduction to Corpus Linguistics
Автор: Sandrine Zufferey
Издательство: John Wiley & Sons Limited
Жанр: Учебная литература
isbn: 9781119779704
isbn:
The problem of manual tracking and counting of occurrences is all the more acute since corpus linguistics is often based on large amounts of data which have not been drawn from a single book, in view of observing the multiple occurrences of a certain linguistic phenomenon and thus apprehending its specificities. For example, let us suppose that we wish to know whether Flaubert talks about love in his work. In this case, focusing solely on Madame Bovary would induce a bias, because this novel is not representative of the whole of his work. So, in order to be able to answer this question, it is necessary to go through the entirety of his novels, making the task even more complex to perform manually. Let us now imagine that this time we want to know whether the French authors of the 19th Century all deal with the question of love as much as Flaubert does. In this case, it would be impossible for us to look up the occurrence of terms related to love in all of the novels written by French authors in the 19th Century. In order to avoid this problem, it would be necessary to collect a sample of texts, representative of the works of this period. We will discuss this topic in Chapter 6, which is devoted to the methodological principles underlying the construction of a corpus. For the moment, the important point to bear in mind is that corpus linguistics often resorts to a quantitative methodology (see section 1.5) so as to be able to generalize the conclusions observed on the basis of a linguistic sample to the whole of the language, or belonging to a particular language register.
As we will see in the following chapters, corpus linguistics may be of use in all areas of linguistics, for instance in fundamental (see Chapter 2) or applied (see Chapter 3) linguistics. For example, it is crucial in lexicography, since it makes it possible to make an exhaustive inventory of a language’s lexicon. It also makes it easy to find examples of uses in different types of sources (literary, journalistic and others), while bringing to light the expressions in which a word is frequently used. In other words, it makes it possible to establish very useful phraseology elements for dictionaries. For example, it is useful to know what the word “knowledge” means, but it is just as important to know that this word is frequently used in phrases such as “acquire knowledge” or “having good knowledge of”, etc. Corpus linguistics is a particularly effective method for establishing the frequent contexts in which a word or an expression is used. But corpus linguistics is also used for conducting research in fundamental areas of linguistics such as the study of syntax, since it makes it possible to identify the types of syntactic structures used in different languages. For example, by making a corpus study, it is possible to determine in which textual genres the passive voice is most commonly used. Finally, thanks to the existence of a corpus of oral data, corpus linguistics also makes it possible to answer questions related to phonology and sociolinguistics. For instance, it makes it possible to establish the area of geographical distribution of certain pronunciation traits, such as differentiating the short /a/ form in the French word “patte” (paw), from the long /ɑ/ form in the word “pâte” (pastry). Answering these different questions requires the use of different types of corpora, as well as having available data regarding their contents. For example, in order to determine the geographical area of diffusion of a certain pronunciation trait, it is necessary to know where each speaker having contributed to the corpus came from. This type of information is called corpus metadata. We will review the main types of existing corpora at the end of this chapter, and discuss the issue of metadata in Chapter 6.
To sum up, in this section, we have defined corpus linguistics as an empirical discipline, which observes and analyzes quantitative language samples gathered in a computerized format. In the following sections, we will discuss in depth the different central points of the definition, indicated in bold, in order to better understand the theoretical and methodological anchoring of corpus linguistics.
1.2. Empiricism versus rationalism in linguistics
Corpus linguistics is an empirical discipline, which means that it uses data produced by speakers in order to study language. This methodology is opposed to the rationalist method, which functions by looking for answers by relying on one’s own linguistic knowledge, rather than looking for it in external data. Let us take an example. In order to determine whether the phrase “When do you think he will prepare which cake?” is grammatically correct or not, the use of empirical methodology would go through large corpora to find whether this syntactic structure is used by English speakers or not.
If sentences following such a syntactic structure never or almost never appear in the corpus, linguists might conclude that this sentence is only rarely used in English. Rationalist methodology, on the contrary, might respond to the same issue by relying on the intuitions of linguists. In this particular case, they might wonder whether they could produce such a sentence or not, whether it seems correct or incorrect depending on their knowledge of the language and might infer a grammaticality judgment from it. Grammaticality judgments are often classified into three types: correct, incorrect or marked, in the event that a sentence may seem possible, but sounds unnatural.
This example illustrates a fundamental difference between empirical and rationalist methodology. While the rationalist methodology leads to the formulation of categorical judgments, the empirical methodology provides a more refined answer to this question, since the observation of corpus data offers a precise indication of frequency, rather than a result in terms of absence or presence. This is one of the reasons why many linguists currently consider that the empirical methodology better matches a scientific approach (in the sense of confrontation against the facts) than a purely rationalist method for studying language.
Nonetheless, the choice between the use of empirical or rationalist methods is not limited to the field of linguistics. Certain scientific branches such as physics, chemistry, as well as sociology and history are essentially empirical disciplines. In fact, both physicists and historians base their insights on external data, which they collect in the world, in order to build a theory, test it and draw conclusions from it. On the other hand, other disciplines such as mathematics or philosophy are traditionally based on a rationalist approach, since mathematicians and philosophers use their own reasoning to build theories and to draw conclusions, rather than from the collection and observation of external data. Philosophers often resort to thought experiments, but these are not experiments in the empirical sense of the term, because they are based on the reflective abilities of researchers.
1.3. Chomsky’s arguments against empiricism in linguistics
Although corpus linguistics has experienced a strong growth over the past 20 years, the empirical grounding of linguistics is not new. Linguists have long used observational data. In the 19th Century, for example, linguists used to work on the comparison of Indo-European languages in an attempt to reconstruct their common origin. Research was based on existing data about the languages spoken in Europe such as German, French and English. Similarly, in the first half of the 20th Century in the United States, the so-called distributionist approach to syntax focused on the study of sentence formation in syntactic structures as they appeared in text corpora, and from there, tried to infer language’s general functioning. Around the late 1950s, the use of corpora in linguistics was almost completely interrupted in certain fields such as syntax, following the works of the American linguist Noam Chomsky. In fact, Chomsky defended a strictly rationalist methodological approach to linguistics, and fiercely opposed any use of external data. The objections made by Chomsky against the use of external data in linguistics have been numerous. We will briefly review them, to show in what ways most of them have lost their raison d’être in the context of current research.
Chomsky’s СКАЧАТЬ