Using Corpora

Tuesday, August 2, 2011


How to

How corpora are used

Benefits to us

What can we use it for?

Getting students involved

Getting teachers involved

Making your own corpus

Some drawbacks

By Jim Carroll and Dale Coulter


A corpus is any collection of texts (written or spoken), collected from natural sources. In the old days people used to collect and analyse these texts manually (indeed Samuel Johnson used his own corpora to help prepare his dictionary), but mercifully these days they can be stored and analysed electronically. The most well known corpus is probably COBUILD which started in the 1980s and now has over 400 million words. The corpus can then be examined to notice linguistic trends or patterns.



How to

Below we’ve prepared a downloadable step-by-step guide for signing up for the BNC corpus and how to make your first search.

How to use the corpus



How Corpora Are used

Well, you might have been using corpora more than you think. Or to put it more accurately, the materials you’ve been using have almost certainly been based on corpora analysis:

  • Course books
  • Reference books
  • Teacher and learner grammars
  • Learner dictionaries

Longman have a corpus of English language learners that track the most common errors for students of English. These are then analysed so they can particularly target the most commonplace mistakes that learners at certain levels might make.

Benefits to us

Fair enough, but that doesn’t mean I need to use it…

“I’ve had students in the past with a low tolerance to ambiguity…”

Nothing was more certain to make them roll their eyes and look at me suspiciously than when I tried to assure them that ‘these two words go together, it’s a collocation.

“There’s no reason for it, it just is!”

This doesn’t sit well with a certain type of student who likes to compartmentalise language learning into rules and maxims. But the corpus can act as validation. Next time you’re trying to assure a student about a certain preposition, whack it into the corpus on the IWB and you have instant real-life evidence, that, in the real world, native English speakers ALWAYS say, for example, ‘married to’, not ‘married with’.


  • It can really help you with your hunches and intuitions.

I know that I’m sure that people usually say ‘talk to’, much more regularly than ‘talk with’. After a student asked me in class I whacked it into the corpus, and was able to work out that talk to refers to the actions whereas talk with emphasises long conversations. Not only that, but I had 50 ready-made authentic examples to share with the class.

  • You can also tailor the language introduced to the students.

Teaching a class who are going to use the language predominantly in an international context as a lingua franca? It will soon be possible to use non-native speaker corpus so that you are teaching language specifically applicable in conversations between non-native speakers.

  • It’s all authentic

The texts are, by definition authentic examples of language used in natural contexts. What is more, vocabulary acquisition can be aided by introduction to a lexical item in a number of different contexts. This can be motivating for the students who see approximation to native speakers as a main goal in language acquisition.

What can we use it for?

  • Word frequency

Are polemics more common the controversies?

  • Collocations

Why is a tree tall and not high?

  • Common chunks of language
  • Commonly used lexical items with functions in spoken and written discourse

Shifting topic in conversation: “oh guess what!”

Written and spoken lexical phrases

  • Idioms


  • Syntactic restrictions

Are there any dependent prepositions?

Dependent prepositions

  • Semantic/stylistic restrictions

Is this adjective only ever used when talking about people, or maybe only in a formal register?

  • Prosody of a word

What sort of ‘environment’ does a word usually appear in? For example, the phrasal verb ‘set in’ is almost exclusively used in negative collocates – bad weather, panic, disease. On the other hand, the verb ‘provide’ is used almost exclusively in positive situations e.g. care, food, help, jobs, relief, and support.


  • The grammar of words

Does this word have a particular preference for a grammar structure?

Word grammar

  • Words with different meanings

For example the word ‘bow’. You can analyse all the different meanings of a word, and which meaning is most prevalent in everyday use.

  • Written language

The big differences between the grammars in speech and writing like the uses of long noun phrases in written language, especially helpful for IELTS classes or EAP.

  • Spoken language

For spoken language, corpora are really good for looking at particular facets of speech, especially discourse markers, vague language, ellipsis and hedges.



Getting students involved

“The goal is to excite and empower the student to become an investigator of language in their own right, independently and in their own time.”


Just as you might encourage students to use their dictionary effectively with exercises you can also explain how to use a corpus effectively.

  • The how-to activity can be used with students as well and could be graded for lower levels.
  • Organise a session in the SAC to introduce the students to corpora
  • Organise a Task Based Learning session to show them what is possible.

TBL Corpora

Some ideas:

1. If there’s a word a student is struggling with, they can check the definition in the dictionary, and then find a ready-made authentic example in the corpus, as well as seeing any collocations that might occur. A teacher might organise a session in the SAC to introduce the students to corpora and organise a Task Based Learning session to show them what is possible

2. Why not make an error-correction session into a corpus lesson. Learners are given errors from a previous class or written homework and are encouraged to use corpus information to correct them. This activity would work particularly well with collocation errors.



Getting teachers invovled


Maybe you want to find what kind of words collocation with problem, or you want to know why a celebration doesn’t set it – It’s a great place to find these answers.

  • Great for brushing up on your language awareness
  • Really good to consult while correcting students’ homework
  • Good for American Idioms students bring to class which can be checked in the COCA corpus on the BNC website.


We could pool our resources of lexical gurus and grammar-savvy teachers in the staff room to create some language-awareness resources. Below is one I wrote on direct and indirect objects with certain verbs. I used ideas found in Explaining English Grammar, by G. Yule (1998).

Direct and indirect objects


Have you ever heard of a dictionary race? You give students a word and they battle against the clock to find the meaning/collocations of the word. We could do the same with in a corpus race!


Making your own corpus


Yes, Andy has some great ideas about using textstat, a DIY corpus. A spoken corpus might be a bit time consuming. For a written corpus, commercially available software like Wordsmith Tools and Monoconc Pro exist exactly for  this purpose.

Well, that’s all fine and dandy, but what’s in it for me making my own corpus?

Perhaps you are teaching ESP, or a lesson in a certain genre style or convention, and you want to know certain phraseologies to teach, or particular vocabulary that’s particularly prevalent in that style of writing. Making your own corpus allows you to do just that.

Or maybe you wanted to check the vocabulary acquisition of your class. If you asked them to submit all their homework electronically and entered the data into the corpus over the month you would actually be able to track how many of the students and how regularly they have been including the target language you’ve been discussing in class!



Some drawbacks

I’ve read about story that a non-native English speaker was trying to work out which noun collocated with the verb ‘thaw’ and was completely baffled when the top 20 returns he got back were about some bloke called ‘John Thaw’. Because they are real texts some of the examples might require a lot of cultural background knowledge. But like any authentic material, as a teacher you need to mediate the text or grade the task so that is palatable for a student and not too culturally opaque.

The other thing to be wary of course is not to become a slave to frequency. Just because a word in relatively infrequent doesn’t mean it’s not useful. For example, a person who asks for salt for their meal will need to know ‘salt’ to convey their meaning and intent far more than the more frequent ‘could’, ‘you’, ‘pass’, ‘me’, the’.

Ultimately, it’s a resource to be tapped into, to help validate or deny the hunches you have that in the huge corpus you have in your own brain!

Also, the BNC free corpus is quickly becoming out-dated. Because the collection is only until 1993, many of the new words included in modern dictionaries are not part of the corpus. A membership to a more up-to-date corpus would be of great benefit to teachers and students alike.


  1. Justin Vollmer says:

    Hi Jim
    Love the work you have done on the corpora. Something I wondered if Andy or you had done or considered. If you put all your students work into class corpora without correction, could you work out the most common mistakes by looking at various words?

    1. Jim Carroll says:

      Hi Justin,

      I think it would be feasible to measure the word frequencies and by looking at those lists you could check/spelling errors, concordances and so on to find the most common mistakes . I’m not sure how you might go out specifically searching for errors though, although Andy might have an idea.

  2. Melissa says:

    Want to see how to do a search? Check out this Jing that Jo made….

Leave a Reply