Corpus analysis techniques

As I mentioned in a blog earlier this year, one of my projects for 2016 is to develop my skill set in corpus analysis, intending to use this to develop my translation skills and also to build terminology bases and to identify the grammatical characteristics of the language used in my specialist areas.

In this blog I want to go into more detail about different analyses that can be performed using corpus tools and what they can show us. For this post I used a corpus that I built for a recent translation assignment, using the WebBootCat feature, which I described in a previous post.

Today I will introduce another corpus analysis tool, AntConc, developed by Laurence Anthony. It is open source and can be freely downloaded, along with other related tools.

Building the corpus

As I explained in my earlier post, I used the WebBootCat function to create this ad hoc corpus. To do this you need to access SketchEngine. This is the process I use:

  • Select seed words using terms/words that are used in the target subject area (for example, in this case: subsidies, FIT, premiums, installed, capacity, margin, power, etc.).
  • WebBootCat trawls the internet and produces a list of different URLs that match the search criteria.
  • Check the data that came through to remove any sources that may not be reliable.

If you do not have a subscription to SketchEngine, you can create your own corpus using documents you have selected yourself. To use these in AntConc, they must all be in text file (.txt) format in UTF8 (check out the AntFile Converter to convert).

Below are the basic types of analysis that you can perform using AntConc (and corpus tools in general). For more information on how to use these features in the AntConc tool, please refer to Laurence Anthony’s website, where there are a number of tutorials available.

Word lists

It produces a list of all the words included in the corpus, ordered by frequency. While this can be useful, often it is used as a basis for other analyses. You will find when you create word lists that prepositions and articles often come at the top of the list before any nouns, adjectives and verbs.

Keyword lists

Here you have to load a word list of your choice (in this case the British National Corpus word list). This function then creates a list of keywords that are comparatively more frequent in the corpus being analysed. Another example of where this might be useful is if you want to compare vocabulary used in two different genres, or different registers within a genre.

In my case, I created an adhoc corpus from seed words, so there is some bias to these words. However, I was looking for the usage of these specific terms for the translation I was doing, so it is not a problem. However, it is worth being aware of this in case you are interested in building a corpus for other research purposes.

As you can see, some of the seed words are up in the most comparatively frequent words, but there are also other words that are unusually frequent in the corpus, which can give us insight into the use of vocabulary in a certain area, and can give indications of collocations and clusters to look at.

Collocations, clusters and N-grams


N-grams demonstrate the frequency of two-, three- or four-word clusters in a text. This can help to identify possible multiword expressions (MWE), as well as common grammatical formations. In translation, for example, if you are looking for a possible term in a target language, but you are not necessarily sure of the correct translation, this might be a good place to look. It can also help you to identify grammatical patterns. Contrary to collocations, n-grams are shown without context, but give frequency as a number (see second column below). If you have been looking for suitable terms, once you identify a possible term you may want to then use the collocation function to look at it in context.


This feature looks at usage of a specific word in context, and can be used to identify common collocations of words, either to identify multiword elements or also grammatical collocations such as verb-noun collocations, or adjective-noun collocations, verb-preposition collocations, etc.

Example of how these analyses work

For the purposes of this post I am going to look at the use of the word ‘margin’. When you search for collocations, you can search aligning to the right or the left, up to three places each side. With a noun such as ‘margin’, if you are looking for common noun collocations, it is likely a good idea to search left – if you want to see verb-use patterns, then search right.

Margin – 481 hits

  • Common collocations

Capacity margin

Definition (The capacity margin is difference between capacity and peak load, expressed as a percentage of capacity (instead of peak load).

This was a term that formed part of the seed words for compiling the glossary, but the frequency and also spread of its use added to its viability. A number of variations of this term came up, but also different terms, such as:

Reserve margin

Definition (The reserve margin is the difference between generating capacity and peak load expressed as a percentage of peak load).

As you can see, the collocation tool allows you to not only identify and see the context in which certain phrases/terms are used, but also potentially identify other terms, and determine whether these terms are used in specific companies, or specific contexts. I had not used the term ‘reserve margin’ in my seed words, as it was not a term that had come up in my translation. However, it did come up in the corpus. When I first saw this term I was unsure if it was a synonym of capacity margin, given the context in which I found both terms used. However, from further research I found out that they are two ways of referring to the same thing, but expressed using different criteria (as can be seen in the definitions).

Another use of the collocation tool is to see which verbs are commonly used with the terms you are searching for – as you can see in the screenshot, the verbs ‘provide’, ‘meet’ and ‘retain’ seem to be common collocations with the term ‘capacity margin’. This can be useful when translating as the verb used in the source language does not always directly correspond with the use in the target language. This tool can also be used to see typical tenses used in certain contexts, which is another area in which there are often differences between source and target texts.

Concordance plotter

Concordance plotters show where in the corpus terms appear. I decided to contrast the use of ‘reserve margin’ with ‘capacity margin’. This works better if each file is separate as you can see in which files the term appears, but even so it will give you an idea if a term is specific to one file or is used generally.

“Reserve margin”

“Capacity margin”

I hope this brief introduction to different analytical features will have given you some insight into the different ways in which corpus tools can help you in your translations and other language work.

Using corpora in translation

by Sandra Young

With the beginning of a new year come new ideas, challenges and resolutions. For the first blog of 2016 I wanted to invite you to explore what I consider to be an invaluable tool for our work as translators, particularly when working in technical fields with very specific terminology. One of my professional resolutions for the year is to succeed in fully harnessing the benefits of corpora for my work.

Corpus: “A collection of written or spoken material in machine-readable form, assembled for the purpose of linguistic research.” (Oxford English Dictionary)

I first came across corpora in a professional sense when working on a dictionary project with the Oxford University Press (OUP). The examples for each sense (the different meanings of a single word in specific contexts) in the dictionary entries (the collection of these senses under one headword) had been extracted from a European and Brazilian Portuguese corpus, purpose-created by the OUP. To search this corpus the translation team had access to an online corpus building and mining tool called Sketch Engine.  We used this tool to find entry words and phrases in context, search for additional or more appropriate examples for senses of words and suggest further meanings, which was essential to producing appropriate translations. Words without context have no meaning at all, any choices of translation without this would be arbitrary.

On the target language side, we could also use the British National Corpus (BNC) to search for examples of our suggested translations in context and to cross-check against contexts and usage in the original language, in this case Portuguese. This made us confident that our choice of translation was fit for purpose.

Throughout the two-year dictionary project I found working with corpora not only useful, but fascinating. With very little effort you can produce lists of in-context words or collocations that appear in your conglomeration of text (which is about 100 million words in the case of the BNC), facilitating the quick analysis of information. For the dictionary project I used corpora to check the usage of specific words in context to be able to make informed decisions on the correct translation of said words, their most common grammatical forms and common collocations; however corpora can be used for many other purposes too.

When the dictionary project drew to a close, I continued to dabble with corpora in my work, but for some time I failed to follow a clear path. I started a MOOC course on Corpus Linguistics but, as with many free courses, I found it difficult to juggle both work and study and work won out. This course, run by Lancaster University, is of particular use to researchers, so there are elements that may not be directly applicable to our day-to-day work as translators.

However, last year at the MedTranslate Conference in Freiburg, I attended Anne Murray’s talk on corpus building and mining. In the talk, Anne took us through the steps to building our own corpora within Sketch Engine. It is a subscription-based tool costing £78/year, with a discount for MET members. The tool allows you to search existing official corpora, from Arabic to Yoruba, as well as building your own corpora up to a total capacity of one million words.

There are two main ways to build your own corpora within Sketch Engine. The first is WebBootCat, in which you input specific search terms that the program uses to dredge the internet for matching websites and files. The other option is to upload specific documents you have found (and vetted for reliability) and compile a corpus from them. The table below outlines the main tendencies of each.

WebBootCat File-based corpus
Quick to build Slow to build
Less reliable content More reliable content
Reliant on usage of appropriate and thorough search criteria Based on the assumption that with hand-picked documents you will have had more time to refine the search criteria and collate a sound base of information

As WebBootCat automatically dredges the internet, you gain quick access to a lot of information but you have less control over the content, so it can be assumed to be less reliable on the whole, as it is more difficult to check the quality of the information. You can vet the websites included in the final corpus to exclude any outliers, but this will not ensure same the quality as hand-picked material.

If you work from a file-based corpus, it will be considerably more time-consuming as you will have to search for and check each and every document for reliability and appropriateness before compiling (e.g. native author, correct spelling variation if required, correct subject matter and register). However, once you have built the corpus, you can be confident that the information within it is reliable.

Despite this, with Sketch Engine you should always be able to go back to the original text of each entry, which can help you to make a judgement on the reliability of the results produced whether using WebBootCat or your own file-based corpus. Also, as you can see, both styles offer viable options for different situations. Often we do not have the time to produce a specific, well-researched corpus for every single job we have.

How do I use corpora now?

I usually use corpora to analyse the usage of terms in the target language text, for correct translations of unfamiliar terms. Corpora are also very useful for familiarising yourself with a specific style of writing, or with common collocations in a specific subject area. In case you miss these on our twitter feed, here are some other blogs on corpora that you may find useful:

I often use WebBootCat for efficiency, but recently I had 35 thousand words of pharmaceutical regulatory reports to translate. It was a sizeable job, so I decided to compile my own file-based corpus on this subject. Given the subject matter, it was relatively easy to find official, reliable documents as the FDA publishes a great deal of food and drug product guidance, compliance and regulatory information. I selected documents and compiled a corpus in Sketch Engine.

As a result of the corpus, I was confident in my choice of vocabulary as I could see clear evidence of how terminology and collocations were used in verifiable English texts, and I could see how sentences were structured around these terms to mimic the style of the official texts. Also, if the client were ever to query my use of certain terms, I would be able use results from the corpus to provide evidence to support my choices.

There are many other corpus building and analysis tools out there. I use Sketch Engine for its ease of use (you can upload documents in a variety of formats, the interface is very user-friendly, I already knew how to use the tool, etc.), but you do have to pay for it. In a later post I will go into detail about AntConc, Laurence Anthony’s free corpus tool. This is an incredibly powerful and useful tool which I aim to master this year and further develop my corpus techniques. I attended his workshop at the MET Conference in Coimbra at the end of last year and in addition to the corpus analysis tool there are a number of other interesting tools he has developed that may be of use to translators. For those of you who are interested, the corpus linguistics course by FutureLearn uses AntConc, so you could learn to use the tool that way.

Do you use corpora? If so, what do you use them for? What are the advantages and disadvantages of corpora?

Thanks for reading and happy 2016! I wish you all a great year.