A balanced corpus of American English |
A balanced corpus contains texts from many different genres. A good example of a balanced corpus is COCA, the Corpus of Contemporary American English.
The Babel English-Chinese Parallel Concordancer is a bilingual corpus. More Chinese corpora are available here.
Word vs. Lemma
Computers are very fast, but also very stupid. For people, "speak" and "speaks" feel like the same word. For a computer, however, a word is a group of letters with a space on either side, so "speak" and "speaks" are different "words" (different group). In corpus linguistics, we often talk about lemmas. A lemma represents all of the different inflectional forms of a word: tense, singular/plural, comparative etc. We can use a pipe symbol to separate the different forms in a search.
"|", the pipe symbol, is on the right hand side of the Q row (above the Enter key) on a keyboard: QWERTYUIOP{}|
Nouns: box|boxes; knife|knives
man: man|men
(NOTICE: manly, mailman, manned are not part of the lemma. These are different words)
Verbs: speak|speaks|speaking|spoke|spoken;
write: write|writes|wrote|written
(NOTICE: writer, underwrite, writ are not part of the lemma. These are different words)
Adjectives: big|bigger|biggest;
bad: bad|worse|worst
(NOTICE: worsen and worsening are not part of the lemma. They are different words)