Using corpora in translation

by Sandra Young

With the beginning of a new year come new ideas, challenges and resolutions. For the first blog of 2016 I wanted to invite you to explore what I consider to be an invaluable tool for our work as translators, particularly when working in technical fields with very specific terminology. One of my professional resolutions for the year is to succeed in fully harnessing the benefits of corpora for my work.

Corpus: “A collection of written or spoken material in machine-readable form, assembled for the purpose of linguistic research.” (Oxford English Dictionary)

I first came across corpora in a professional sense when working on a dictionary project with the Oxford University Press (OUP). The examples for each sense (the different meanings of a single word in specific contexts) in the dictionary entries (the collection of these senses under one headword) had been extracted from a European and Brazilian Portuguese corpus, purpose-created by the OUP. To search this corpus the translation team had access to an online corpus building and mining tool called Sketch Engine.  We used this tool to find entry words and phrases in context, search for additional or more appropriate examples for senses of words and suggest further meanings, which was essential to producing appropriate translations. Words without context have no meaning at all, any choices of translation without this would be arbitrary.

On the target language side, we could also use the British National Corpus (BNC) to search for examples of our suggested translations in context and to cross-check against contexts and usage in the original language, in this case Portuguese. This made us confident that our choice of translation was fit for purpose.

Throughout the two-year dictionary project I found working with corpora not only useful, but fascinating. With very little effort you can produce lists of in-context words or collocations that appear in your conglomeration of text (which is about 100 million words in the case of the BNC), facilitating the quick analysis of information. For the dictionary project I used corpora to check the usage of specific words in context to be able to make informed decisions on the correct translation of said words, their most common grammatical forms and common collocations; however corpora can be used for many other purposes too.

When the dictionary project drew to a close, I continued to dabble with corpora in my work, but for some time I failed to follow a clear path. I started a MOOC course on Corpus Linguistics but, as with many free courses, I found it difficult to juggle both work and study and work won out. This course, run by Lancaster University, is of particular use to researchers, so there are elements that may not be directly applicable to our day-to-day work as translators.

However, last year at the MedTranslate Conference in Freiburg, I attended Anne Murray’s talk on corpus building and mining. In the talk, Anne took us through the steps to building our own corpora within Sketch Engine. It is a subscription-based tool costing £78/year, with a discount for MET members. The tool allows you to search existing official corpora, from Arabic to Yoruba, as well as building your own corpora up to a total capacity of one million words.

There are two main ways to build your own corpora within Sketch Engine. The first is WebBootCat, in which you input specific search terms that the program uses to dredge the internet for matching websites and files. The other option is to upload specific documents you have found (and vetted for reliability) and compile a corpus from them. The table below outlines the main tendencies of each.

WebBootCat File-based corpus
Quick to build Slow to build
Less reliable content More reliable content
Reliant on usage of appropriate and thorough search criteria Based on the assumption that with hand-picked documents you will have had more time to refine the search criteria and collate a sound base of information

As WebBootCat automatically dredges the internet, you gain quick access to a lot of information but you have less control over the content, so it can be assumed to be less reliable on the whole, as it is more difficult to check the quality of the information. You can vet the websites included in the final corpus to exclude any outliers, but this will not ensure same the quality as hand-picked material.

If you work from a file-based corpus, it will be considerably more time-consuming as you will have to search for and check each and every document for reliability and appropriateness before compiling (e.g. native author, correct spelling variation if required, correct subject matter and register). However, once you have built the corpus, you can be confident that the information within it is reliable.

Despite this, with Sketch Engine you should always be able to go back to the original text of each entry, which can help you to make a judgement on the reliability of the results produced whether using WebBootCat or your own file-based corpus. Also, as you can see, both styles offer viable options for different situations. Often we do not have the time to produce a specific, well-researched corpus for every single job we have.

How do I use corpora now?

I usually use corpora to analyse the usage of terms in the target language text, for correct translations of unfamiliar terms. Corpora are also very useful for familiarising yourself with a specific style of writing, or with common collocations in a specific subject area. In case you miss these on our twitter feed, here are some other blogs on corpora that you may find useful:

I often use WebBootCat for efficiency, but recently I had 35 thousand words of pharmaceutical regulatory reports to translate. It was a sizeable job, so I decided to compile my own file-based corpus on this subject. Given the subject matter, it was relatively easy to find official, reliable documents as the FDA publishes a great deal of food and drug product guidance, compliance and regulatory information. I selected documents and compiled a corpus in Sketch Engine.

As a result of the corpus, I was confident in my choice of vocabulary as I could see clear evidence of how terminology and collocations were used in verifiable English texts, and I could see how sentences were structured around these terms to mimic the style of the official texts. Also, if the client were ever to query my use of certain terms, I would be able use results from the corpus to provide evidence to support my choices.

There are many other corpus building and analysis tools out there. I use Sketch Engine for its ease of use (you can upload documents in a variety of formats, the interface is very user-friendly, I already knew how to use the tool, etc.), but you do have to pay for it. In a later post I will go into detail about AntConc, Laurence Anthony’s free corpus tool. This is an incredibly powerful and useful tool which I aim to master this year and further develop my corpus techniques. I attended his workshop at the MET Conference in Coimbra at the end of last year and in addition to the corpus analysis tool there are a number of other interesting tools he has developed that may be of use to translators. For those of you who are interested, the corpus linguistics course by FutureLearn uses AntConc, so you could learn to use the tool that way.

Do you use corpora? If so, what do you use them for? What are the advantages and disadvantages of corpora?

Thanks for reading and happy 2016! I wish you all a great year.



5 thoughts on “Using corpora in translation

  1. Jonathan Beagley says:

    Corpora and parallel texts are just so useful. I first discovered corpora when I was studying French at Michigan State University, but I finally learnt a bit about AntConc and using concordancers when I was studying linguistics in Bordeaux. It made sense to me to use parallel texts and corpora, but I don’t know why I never thought of using AntConc or other software to analyse the data for my translation practice, not just research. Fascinating read, thank you so much for this post!


    • Sandra Young says:

      Hi Jonathan! Firstly, sorry for taking so long to reply! I read this message while I was travelling and obviously forgot to get back to you. As I said in the post, I discovered corpora through the dictionary work, we never touched on it at all during either of my university courses. I am not sure if that has now changed, but it seems like such a waste to not make the most of this incredible resource. I’m glad you enjoyed the post. Just to let you know, I will be posting next week about some more corpus-related issues!

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s