Making Friends with Corpora

We teach what is in the course book, or we teach what we know. We draw upon our own highly personal experience and assume that we are teaching English as it is used every day. What if it isn’t? What if we are teaching  grammar structures or lexical items which are actually not used in the way we are telling students they are used? What if what we are teaching doesn’t accurately reflect the English of the ‘real world’?

I recently wrote about ‘The Latest Trends in ELT‘ and ‘The Future of English-language Teaching and Learning‘ and among the hyperbolic and prophetic statements I mentioned the use of language corpora in aiding English-language teaching.

The problem, I realised, is that very few of us really understand the possible impact language corpora and concordancers might have on ELT and additionally, most of us are unable to use them!

A Corpus

A language corpus (corpora as a plural) is a collection of samples of real-world language. Usually they contain millions of examples and may be from spoken or written sources, or a combination of both. The corpus serves as the basis for analysis of how language is used.

A Concordancer

A concordancer is a computer program tool, which can be accessed online. It is used in conjunction with a language corpus to analyse the language examples. Essentially, the corpus is the data (language samples) and the concordancer is the tool used to analsye that data.

International House Journal has a good article here on how teachers can use corpora and concordancers, and the British Council has a good post here on concordancers.

Corpus Linguistics

Khojasteh and Shokrpour (2014) wrote about the wider implications for teaching and learning from the field of Corpus Linguistics, which is the academic field which employs language corpora for research.

The literature review threw up some startling insights. For example, although we often teach that any is usually used for negative statements or for questions, it was found that in fact in between 42 – 51% of cases it is actually used in positive structures.

The paper offered several other examples of instances when what appears in text books and what we teach are not faithful descriptions of how language is really used.

What We Do Now

Corpus Linguistics is concerned about the inconsistencies in how language is used and what we teach.

According to the research paper traditional course books are ‘often largely based on the personal judgements of the materials writers’, while Scott Thornbury said that many course book creators are ‘still largely base content selection on intuition and they neglect the important and frequent features of the language spoken or written by real language users’ (Thornbury, 2004).

Although materials creators are usually skilled, experienced, and dedicated to what they do, they are not supercomputers or all-knowing beings, and so their  ‘personal judgements’ are not going to accurately reflect authentic English usage. Because of this many researchers have recommended ‘the use of corpus-based findings to inform material writers’.

It makes sense that the people who create books are basing the content on something other than their intuition.

Would we be so forgiving if our doctor, dentist, or electrician relied entirely on his or her intuition when carrying out his or her daily tasks?

Science-free Zone

When we consider that over 1,000,000,000 people are currently learning English it is quite shocking to think we are basing the education of these 1 billion people on what a handful of predominantly Western men and women think English is.

It has been accepted that course creators and textbook writers use their intuition rather than actual data when deciding what content to create because previously there was nothing else to go on.

The Four Insights

There are four areas in which corpora can help us [to become better teachers] – Khojasteh and Shokrpour, 2014


Using a corpus we can see how often words occur in relation to others. Research suggests (for example Kennedy, 1998, Conad, 2000) that teaching words which occur more frequently is advantageous for the language learner. It allows the learner to learn first the words which are central to the language, and helps the teacher decide which words to put more emphasis on when teaching.

For example, when you teach modal verbs, which do you teach first? Which do you put most emphasis on? According to four English language corpora the most commonly occurring modal verbs in descending order are:

will, would, can, and could

Will knowing this drastically change the way you approach teaching modals in the future? Probably not. But for the people who write course books it can be a valuable insight which can help them to decide which modal verbs to give greater importance to when creating the content.

Register Variation

In linguistics the type of English which is used in any given situation is called the register. As English teachers, we often draw our student’s attention to the register in terms of formal or informal, or speaking or writing. These factors influence the grammatical structures employed, the vocabulary, and even the pronunciation. These are, no doubt, all things we are aware of as teachers.

By using a corpus it is possible to see accurately what type of English is used in a given situation. It can tell us how often the connectives however and therefore are used in academic writing compared to but and so, for example; and shed light on the appropriacy of certain words and grammatical constructions in certain instances. This could make English for Specific Purposes and Academic English teaching much more effective.

Reliability and Scope

Corpora can be used to help us see how generalisable a word is and how much scope it has. In this sense, scope is ‘the amount of times a rule is applied’.

An example of this is our use of ‘s‘ to form a plural. It is a very reliable rule and has a large scope for application.

This information can be useful because we can prioritise the learning of rules which have a broad scope. This will enable our learners to gather momentum faster and will get them interacting in English quicker. Of course there is a time for rules which throw up a lot of irregularities, but these can be presented after those with strong reliability and broad scope.


To my mind, this is where corpora and the concordancers come into their own and can offer a practical use for learners and teachers. Collocations are the words which are most often used with other words.

If you see: ____ a deal you will probably mentally fill in the blank with make. This is a strong collocation. Language corpora can help us out with identifying some of the most regular collocations but also serving as a reliable source when asked by a student (or even another teacher!) questions such as: Do you have or take a shower? 

Of course we can use either of those but with a corpus to hand we can give a definitive answer as to which is the more commonly used.


I might sound like I am selling English-language corpora for a living. In fact, I find them difficult to use, and I have doubts about whether we should be teaching just the highest frequency words because then words become self-perpetuating and fads become mainstream for longer.

Imagine a slang word that might be highly used and therefore is taught to students more often. Instead of being a funny word which passes through the language after a year it gets taught, memorised, used, and passed on; inadvertently cementing the funny slang word in the language for years to come. Sick.

In conclusion, corpus linguistics and the tools it uses are a way we can refine our teaching to better reflect the way in which English is used. It can provide us with a more realistic and reliable source of how language is used. It can provide shortcuts for learners and teachers. But, for this to happen, language corpora and concordancers need to be adapted so they can be easily accessed and effectively employed.

If, as they are for me, these ideas and the tools associated with corpus linguistics are tricky then that should not stop us trying to understand their relevance and application in language teaching. This a relatively new area of linguistics and an unfolding aspect of teaching. But to start, materials creators have a responsibility to consult with language corpora before designing courses and text books for our learners, who ultimately deserve to be taught with considered, effective, and authentic teaching resources. For our learners we need to make friends with the corpora.

The full research paper which I drew the majority of this information from is freely available here.

Plato’s iPad: The Latest Trends in ELT

The DoS Within: Being a Director of Studies

The Future of English-language Teaching and Learning

Is There a Need for Course Books in ELT?

Please follow and like


thanks for this post it’s a nice overview of reasons to use corpora;
your readers may be interested in the G+ Corpus Linguistics community –

Hi Mura,

Thank you for your comment and for the link to your group.

We, as teachers, rely on lot on the work you are doing so thank you. It is an interesting and important field.