intouea.com

Corpora

Introduction

A corpus is a large collection of texts which is used for studying a language.

For example, Google has scanned English books from the year 1500 to 2019. The data from these books can tell us how common certain words and phrases are now (and were in the past).

You can see this for yourself at Google Ngram Viewer. Perhaps you are not sure if the usual plural of corpus is corpora or corpuses. Enter corpora,corpuses in the search box and you will see which word is more common:

Google Ngram Viewer graph showing ‘corpora’ as far more frequent than ‘corpuses’

Corpora can also tell us which words often go together (collocations). For example, which of these 4 adjectives is best to use with problem? A comparison suggests major or serious. While big seems OK, large must be wrong:

Google Ngram Viewer graph showing adjectives with ‘problem’

Another way is to ask for the most common adjectives used with problem:

Google Ngram Viewer graph showing adjectives with ‘problem’

See the sections on Google Ngram Viewer, SKELL and English-Corpora.org for instructions on using these online tools.

Video

Exploring corpora with Google Ngram Viewer, SKELL and English-Corpora.org

Background music is ‘Effects of Elevation’ from the album Effects of Elevation by Revolution Void, licensed under an Attribution Licence.

Google Ngram Viewer

Google Ngram Viewer is free to use, with no registration or login. There is a detailed help page.

To compare the frequency of words or phrases, enter them in the search box, separated by commas.

Comparing words

Which of these words is more common: undeniable or indubitable? (We have set the years to 2000-2019.)

Google Ngram Viewer graph comparing ‘undeniable’ with ‘indubitable’

Clearly undeniable is more common.

Comparing phrases

Which of these phrases is most common: Third World, developing country, least developed country, Global South?

Google Ngram Viewer graph comparing 4 phrases

Third World is still the most common, though less so than before. Global South has overtaken developing country.

Comparing collocations: 1

Which of these three adjectives is most common with outcome?

Google Ngram Viewer graph showing adjectives with ‘outcome’

Most common is positive outcome, followed by good outcome. Great outcome is rare.

Comparing collocations: 2

We are not sure whether bored with or bored of is correct, so we compare the two:

Google Ngram Viewer graph comparing ‘bored with’ and ‘bored of’

Bored with is still the winner, but bored of is catching up.

Finding the most common collocations

To find the top ten collocations, use *. For example, which words commonly follow positive?

Google Ngram Viewer graph showing words following ‘positive’

We see from this that the most common words after positive are and and or.

Restricting collocations to nouns, verbs, adjectives, etc.

Let’s try to limit the words after positive to nouns. We do this by adding a part-of-speech tag (in this case _NOUN):

Google Ngram Viewer graph showing nouns following ‘positive’

For more information on part-of-speech tags, see the help page.

Finding examples

Under the graph you may see the heading Search in Google Books and some date ranges. Click on one to see examples of your word in sentences and book titles. (You will probably find better examples through SKELL or English-Corpora.org.)

Google Ngram Viewer: Search in Google Books

SKELL

SKELL is free to use, with no registration or login.

Enter a word or phrase in the search box.

Examples

Enter evaluate in the search box and under the Examples tab you will see sentences that contain evaluate, evaluates, evaluated or evaluating.

SKELL showing examples of sentences that contain ‘evaluate’

Word sketch

Switch to the Word sketch tab and you will see examples of collocations under various headings, such as:

  • subject of evaluate
  • object of evaluate
  • adjectives with evaluate
SKELL showing collocations of ‘evaluate’

Similar words

Switch to the Similar words tab and you will see words with similar meanings to evaluate.

SKELL showing words similar to ‘evaluate’

‘The word cloud shows how similar each word is. The words in the centre are the most similar. The size indicates how frequent the word is.’

English-Corpora.org

English-Corpora.org is free to use with a UEA login. You will need to register.

The website has several corpora. Probably the most useful for academic English is the Corpus of Contemporary American English (COCA).

Click on Word and enter your word in the search box:

COCA search box

You will then see detailed information on this word:

COCA: detailed information on the word ‘community’

Note: Topics are words found in the same texts as your word. For example, if you search for vampire, topics include demon, creepy, corpse, curse, werewolf, zombie and twilight.