Sunday, June 17, 2012

Corpus Linguistics Introduction: COCAS and The Babel English-Chinese Parallel Concordancer

Corpus linguistics uses computer software (concordancers) to look at very large samples of real language. These samples are called corpora (singular = corpus). A corpus is a collection of texts. Some corpora only contain one genre: spoken English, newspaper English, scientific English. Other corpora try to use samples from many different types of language use. Some corpora are bilingual: two languages side by side.

balanced corpus of American English











balanced corpus contains texts from many different genres. A good example of a balanced corpus is COCA, the Corpus of Contemporary American English.



The Babel English-Chinese Parallel Concordancer is a bilingual corpus. More Chinese corpora are available here


Word vs. Lemma

Computers are very fast, but also very stupid. For people, "speak" and "speaks" feel like the same word. For a computer, however, a word is a group of letters with a space on either side, so "speak" and "speaks" are different "words" (different group). In corpus linguistics, we often talk about lemmas. A lemma represents all of the different inflectional forms of a word: tense, singular/plural, comparative etc. We can use a pipe symbol to separate the different forms in a search.

"|", the pipe symbol, is on the right hand side of the Q row (above the Enter key) on a keyboard: QWERTYUIOP{}|

Nouns: box|boxes; knife|knives
     man: man|men
     (NOTICE: manly, mailman, manned are not part of the lemma. These are different words)

Verbs: speak|speaks|speaking|spoke|spoken;
     write: write|writes|wrote|written
     (NOTICE: writer, underwrite, writ are not part of the lemma. These are different words)

Adjectives: big|bigger|biggest;
     bad: bad|worse|worst
     (NOTICE: worsen and worsening are not part of the lemma. They are different words)