Corpus analysis techniques

As I mentioned in a blog earlier this year, one of my projects for 2016 is to develop my skill set in corpus analysis, intending to use this to develop my translation skills and also to build terminology bases and to identify the grammatical characteristics of the language used in my specialist areas.

In this blog I want to go into more detail about different analyses that can be performed using corpus tools and what they can show us. For this post I used a corpus that I built for a recent translation assignment, using the WebBootCat feature, which I described in a previous post.

Today I will introduce another corpus analysis tool, AntConc, developed by Laurence Anthony. It is open source and can be freely downloaded, along with other related tools.

Building the corpus

As I explained in my earlier post, I used the WebBootCat function to create this ad hoc corpus. To do this you need to access SketchEngine. This is the process I use:

  • Select seed words using terms/words that are used in the target subject area (for example, in this case: subsidies, FIT, premiums, installed, capacity, margin, power, etc.).
  • WebBootCat trawls the internet and produces a list of different URLs that match the search criteria.
  • Check the data that came through to remove any sources that may not be reliable.

If you do not have a subscription to SketchEngine, you can create your own corpus using documents you have selected yourself. To use these in AntConc, they must all be in text file (.txt) format in UTF8 (check out the AntFile Converter to convert).

Below are the basic types of analysis that you can perform using AntConc (and corpus tools in general). For more information on how to use these features in the AntConc tool, please refer to Laurence Anthony’s website, where there are a number of tutorials available.

Word lists

It produces a list of all the words included in the corpus, ordered by frequency. While this can be useful, often it is used as a basis for other analyses. You will find when you create word lists that prepositions and articles often come at the top of the list before any nouns, adjectives and verbs.

Keyword lists

Here you have to load a word list of your choice (in this case the British National Corpus word list). This function then creates a list of keywords that are comparatively more frequent in the corpus being analysed. Another example of where this might be useful is if you want to compare vocabulary used in two different genres, or different registers within a genre.

In my case, I created an adhoc corpus from seed words, so there is some bias to these words. However, I was looking for the usage of these specific terms for the translation I was doing, so it is not a problem. However, it is worth being aware of this in case you are interested in building a corpus for other research purposes.

As you can see, some of the seed words are up in the most comparatively frequent words, but there are also other words that are unusually frequent in the corpus, which can give us insight into the use of vocabulary in a certain area, and can give indications of collocations and clusters to look at.

Collocations, clusters and N-grams


N-grams demonstrate the frequency of two-, three- or four-word clusters in a text. This can help to identify possible multiword expressions (MWE), as well as common grammatical formations. In translation, for example, if you are looking for a possible term in a target language, but you are not necessarily sure of the correct translation, this might be a good place to look. It can also help you to identify grammatical patterns. Contrary to collocations, n-grams are shown without context, but give frequency as a number (see second column below). If you have been looking for suitable terms, once you identify a possible term you may want to then use the collocation function to look at it in context.


This feature looks at usage of a specific word in context, and can be used to identify common collocations of words, either to identify multiword elements or also grammatical collocations such as verb-noun collocations, or adjective-noun collocations, verb-preposition collocations, etc.

Example of how these analyses work

For the purposes of this post I am going to look at the use of the word ‘margin’. When you search for collocations, you can search aligning to the right or the left, up to three places each side. With a noun such as ‘margin’, if you are looking for common noun collocations, it is likely a good idea to search left – if you want to see verb-use patterns, then search right.

Margin – 481 hits

  • Common collocations

Capacity margin

Definition (The capacity margin is difference between capacity and peak load, expressed as a percentage of capacity (instead of peak load).

This was a term that formed part of the seed words for compiling the glossary, but the frequency and also spread of its use added to its viability. A number of variations of this term came up, but also different terms, such as:

Reserve margin

Definition (The reserve margin is the difference between generating capacity and peak load expressed as a percentage of peak load).

As you can see, the collocation tool allows you to not only identify and see the context in which certain phrases/terms are used, but also potentially identify other terms, and determine whether these terms are used in specific companies, or specific contexts. I had not used the term ‘reserve margin’ in my seed words, as it was not a term that had come up in my translation. However, it did come up in the corpus. When I first saw this term I was unsure if it was a synonym of capacity margin, given the context in which I found both terms used. However, from further research I found out that they are two ways of referring to the same thing, but expressed using different criteria (as can be seen in the definitions).

Another use of the collocation tool is to see which verbs are commonly used with the terms you are searching for – as you can see in the screenshot, the verbs ‘provide’, ‘meet’ and ‘retain’ seem to be common collocations with the term ‘capacity margin’. This can be useful when translating as the verb used in the source language does not always directly correspond with the use in the target language. This tool can also be used to see typical tenses used in certain contexts, which is another area in which there are often differences between source and target texts.

Concordance plotter

Concordance plotters show where in the corpus terms appear. I decided to contrast the use of ‘reserve margin’ with ‘capacity margin’. This works better if each file is separate as you can see in which files the term appears, but even so it will give you an idea if a term is specific to one file or is used generally.

“Reserve margin”

“Capacity margin”

I hope this brief introduction to different analytical features will have given you some insight into the different ways in which corpus tools can help you in your translations and other language work.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s