Corpus analysis techniques

As I mentioned in a blog earlier this year, one of my projects for 2016 is to develop my skill set in corpus analysis, intending to use this to develop my translation skills and also to build terminology bases and to identify the grammatical characteristics of the language used in my specialist areas.

In this blog I want to go into more detail about different analyses that can be performed using corpus tools and what they can show us. For this post I used a corpus that I built for a recent translation assignment, using the WebBootCat feature, which I described in a previous post.

Today I will introduce another corpus analysis tool, AntConc, developed by Laurence Anthony. It is open source and can be freely downloaded, along with other related tools.

Building the corpus

As I explained in my earlier post, I used the WebBootCat function to create this ad hoc corpus. To do this you need to access SketchEngine. This is the process I use:

  • Select seed words using terms/words that are used in the target subject area (for example, in this case: subsidies, FIT, premiums, installed, capacity, margin, power, etc.).
  • WebBootCat trawls the internet and produces a list of different URLs that match the search criteria.
  • Check the data that came through to remove any sources that may not be reliable.

If you do not have a subscription to SketchEngine, you can create your own corpus using documents you have selected yourself. To use these in AntConc, they must all be in text file (.txt) format in UTF8 (check out the AntFile Converter to convert).

Below are the basic types of analysis that you can perform using AntConc (and corpus tools in general). For more information on how to use these features in the AntConc tool, please refer to Laurence Anthony’s website, where there are a number of tutorials available.

Word lists

It produces a list of all the words included in the corpus, ordered by frequency. While this can be useful, often it is used as a basis for other analyses. You will find when you create word lists that prepositions and articles often come at the top of the list before any nouns, adjectives and verbs.

Keyword lists

Here you have to load a word list of your choice (in this case the British National Corpus word list). This function then creates a list of keywords that are comparatively more frequent in the corpus being analysed. Another example of where this might be useful is if you want to compare vocabulary used in two different genres, or different registers within a genre.

In my case, I created an adhoc corpus from seed words, so there is some bias to these words. However, I was looking for the usage of these specific terms for the translation I was doing, so it is not a problem. However, it is worth being aware of this in case you are interested in building a corpus for other research purposes.

As you can see, some of the seed words are up in the most comparatively frequent words, but there are also other words that are unusually frequent in the corpus, which can give us insight into the use of vocabulary in a certain area, and can give indications of collocations and clusters to look at.

Collocations, clusters and N-grams


N-grams demonstrate the frequency of two-, three- or four-word clusters in a text. This can help to identify possible multiword expressions (MWE), as well as common grammatical formations. In translation, for example, if you are looking for a possible term in a target language, but you are not necessarily sure of the correct translation, this might be a good place to look. It can also help you to identify grammatical patterns. Contrary to collocations, n-grams are shown without context, but give frequency as a number (see second column below). If you have been looking for suitable terms, once you identify a possible term you may want to then use the collocation function to look at it in context.


This feature looks at usage of a specific word in context, and can be used to identify common collocations of words, either to identify multiword elements or also grammatical collocations such as verb-noun collocations, or adjective-noun collocations, verb-preposition collocations, etc.

Example of how these analyses work

For the purposes of this post I am going to look at the use of the word ‘margin’. When you search for collocations, you can search aligning to the right or the left, up to three places each side. With a noun such as ‘margin’, if you are looking for common noun collocations, it is likely a good idea to search left – if you want to see verb-use patterns, then search right.

Margin – 481 hits

  • Common collocations

Capacity margin

Definition (The capacity margin is difference between capacity and peak load, expressed as a percentage of capacity (instead of peak load).

This was a term that formed part of the seed words for compiling the glossary, but the frequency and also spread of its use added to its viability. A number of variations of this term came up, but also different terms, such as:

Reserve margin

Definition (The reserve margin is the difference between generating capacity and peak load expressed as a percentage of peak load).

As you can see, the collocation tool allows you to not only identify and see the context in which certain phrases/terms are used, but also potentially identify other terms, and determine whether these terms are used in specific companies, or specific contexts. I had not used the term ‘reserve margin’ in my seed words, as it was not a term that had come up in my translation. However, it did come up in the corpus. When I first saw this term I was unsure if it was a synonym of capacity margin, given the context in which I found both terms used. However, from further research I found out that they are two ways of referring to the same thing, but expressed using different criteria (as can be seen in the definitions).

Another use of the collocation tool is to see which verbs are commonly used with the terms you are searching for – as you can see in the screenshot, the verbs ‘provide’, ‘meet’ and ‘retain’ seem to be common collocations with the term ‘capacity margin’. This can be useful when translating as the verb used in the source language does not always directly correspond with the use in the target language. This tool can also be used to see typical tenses used in certain contexts, which is another area in which there are often differences between source and target texts.

Concordance plotter

Concordance plotters show where in the corpus terms appear. I decided to contrast the use of ‘reserve margin’ with ‘capacity margin’. This works better if each file is separate as you can see in which files the term appears, but even so it will give you an idea if a term is specific to one file or is used generally.

“Reserve margin”

“Capacity margin”

I hope this brief introduction to different analytical features will have given you some insight into the different ways in which corpus tools can help you in your translations and other language work.

Séminaire d’Anglais Médical 2016: a review

By Claire Harmer

This March I attended my first Séminaire d’Anglais Médical (SAM) held in the beautiful city of Lyon. It was the 11th time the event had been held, which is organised by the Société Française des Traducteurs (SFT) every two years. The séminaire – which I’ll call a conference for the sake of convenience, but was more of a week-long workshop programme – is aimed at medical translators working from and into French. 49 people attended; the perfect size for a specialised conference: not so big that it was overwhelming but big enough to have lots of different people to talk to.

It took place in the Faculté de Médecine Lyon Est in a self-contained Médiathèque building and most of the sessions were held in a raked lecture theatre within the building. The university was in the 8th arrondissement, so not particularly central, but it was only 15-20 minutes away by tram/metro if you were staying in the centre. With fairly packed days at the conference I didn’t get to explore the city as much as I would have liked, but I’m hoping to go back for a trip later this year.

The days were well-structured, with half-hour coffee breaks in the morning and afternoon (which proved to be good networking opportunities), and a one and a half hour lunch break in the middle. At first I thought the lunch break was unnecessarily long but while I was there I realised you needed that time to disconnect and have a rest! Sitting and listening to lectures for five days straight made me realise that I am out of the habit of sitting and absorbing information for long periods of time like we did at university – so having those breaks was crucial! Even more so, considering that most of the workshops were given in French, so I had to concentrate even harder to absorb and process the information.

The programme was a mix of lectures, terminology sessions and travaux dirigés, all of which I’m going to give a bit more information on below – I hope this gives readers an insight in case anyone is interested in attending SAM 2018!


We were fortunate to have a wide variety of speakers present at the conference, from medical translators to doctors, medical researchers and founders of companies within the medical and pharmaceutical sectors.

Below are a few of the highlights from the conference:

  • Amy Whereat’s presentation on writing practices in the field of cosmetic dermatology
  • Dr David Cox’s presentation on the medical epidemiology of breast cancer
  • Sylvie Chabaud’s talk on the statistical aspects of a clinical trial
  • Dr Bernard Croisile’s presentation on Alzheimer’s disease.

Another firm favourite was Pippa Sandford’s presentation on cross-cultural differences and pitfalls in medical translation. I’m hoping to do a blog post on Pippa’s talk at some point soon, as I found it really useful and think other medical translators will too.

Terminology sessions

We had four terminology sessions where medical translator and terminologist Nathalie Renevier went over terms that had come up in the workshops. These were great for exploring tricky terms and their corresponding equivalents in the other language. It also meant we revisited topics spoken about earlier in the day or week, which served as a reminder of what we had learnt.

Travaux dirigés

The source texts for the travaux dirigés were sent out via email in advance for those who had time to read them and on Monday we were split up into groups of five to seven people, each of which was given one source text. We had two sessions on Monday where we had time to work on the text as a group and typed up our final translation to present to the rest of the attendees later in the week. The texts included a study on patients with hormone receptor-positive breast cancer, a fact sheet on Alzheimer’s disease for the general public, an article on premenstrual flares in adult women, as well as texts on chronic lymphocytic leukaemia, H5N1 influenza virus and the digestive system.

When the final translations were presented, a supervisor who had done a presentation on the same or a similar topic during the week, gave suggestions and advice to the translation team where needed. To be honest, I think the travaux dirigés were the only part of the conference where I felt I missed out a little by being an English native speaker. Of the 49 attendees only seven were English native speakers, with almost all of the remaining attendees being French native speakers – only to be expected as the course was held in France! This meant that only one out of the seven translations presented was a FR>EN translation (which was presented by our group). It was still useful to see how the English texts had been rendered in French, but obviously I didn’t take as much away from them as I did the FR>EN translation.


To end the conference with a bit of fun, Stephen Schwanbeck organised a translation duel, which proved to be very entertaining! Two people volunteered to translate each text (one was FR>EN and the other was EN>FR) in advance and then each translator presented their version, moving in turn and presenting a couple of sentences at a time. The rest of the attendees joined in with suggestions on how to improve the translations, as well as highlighting what they liked about each of them.

Both pieces were satirical, so were quite a departure from the texts we had been working on during the week. They were full of cultural references, plays on words, and tricky phrasing. The English text for translation into French, entitled ‘Doctors say average heart attack victim doesn’t clutch at chest nearly dramatically enough’ can be found here. It’s well worth watching the video as well as reading the article! The French text for translation into English, ‘La téléphonie mobile, nouveau vecteur de la democratisation du cancer’, can be found here.

In addition to the 9am – 5pm programme, the organisers also arranged a pre-conference meet-up on the Sunday evening, a tour of Lyon on the Monday night and a three course meal at a lovely restaurant during the week, all of which were thoroughly enjoyed.

In conclusion, I learnt a great deal about a wide range of medical and pharmaceutical subjects at SAM, met lots of interesting people, learnt about others’ experiences of translating for the medical and pharmaceutical sectors, experiences of working with agencies and direct clients (a conversation that seemed to come up a lot!) and how to cope with various terminological issues that often come up in medical and pharmaceutical translation.

The conference was a huge success and I’ll definitely be going back in 2018, if not before, as I’d like to visit Lyon again! A huge thank you to all the organisers!


Lyon at night!


The FR>EN team presenting their translation