How To Learn About Fish


State-of-the-art language technology can give lexicography a boost, by going far beyond simple data on collocation and into the dramatically grammatical word of the ‘word sketch’.

by Adam Kilgarriff (Senior Research Fellow at the Information Technology Research Institute, University of Brighton)

Two of the great things about working with English are how much of it there is around, and how many other people are doing the same. This means there are lots of resources available, and the latest research ideas are generally applied to English before any other language.

Two areas where this is particularly relevant are dictionary-making, or lexicography, and Language Technology (LT). Language Technologies include any computerized systems that process human languages, such as the grammar checker on your word-processing software, automatic translation systems, and programs that allow you to speak to your computer instead of typing.

Both LT and lexicography work from large computerized collection of language (corpora), and since the 1980s English has led the world in corpus development. The Cobuild corpus (7 million words in the early 1980s) changed the face of dictionary-making, and was soon followed by the British National Corpus (BNC, 100 million words), and now by the World Wide Web, which is probably around 10,000 times bigger than the BNC – and over half of it is in English.

LT and lexicography also benefit from each other, and we are seeing more and more cooperation between these two fields. LT benefits from lexicography because LT systems need to know facts about words – from simple ones like how they are spelt and how frequent they are, to more complex ones like how they are pronounced and how they combine with other words (both syntactically and in terms of chunks of various kinds). The obvious place to find this sort of data is in a dictionary.

Using LT to write dictionaries

Lexicography benefits from LT, because LT provides tools for analyzing language. In recent years, all ambitious dictionary projects have been corpus-based, and none more so than those dictionaries created for learners of English. The basic tool for analyzing a corpus is a concordancer – a computer program that shows the lexicographer the data for a given word, displaying the word in the middle of a line of context from the corpus.

Dictionary-makers have been using this technology for around 20 years, but recent developments in LT mean that we can now do even better. The huge increases in available data – corpora have grown from a few million words to hundred of millions – has led to a situation where lexicographers are presented with hundreds or even thousands of concordance lines for a single word. There is just too much information to process, so some kind of intelligent summary is required. And this is where LT comes to the rescue.

The first move in this direction was the so-called ‘Mutual Information’ (MI) score, a statistical measure that shows how closely one word is associated with others. Thus ‘fish’ has a high MI score for ‘chips’ but not for ‘daffodil’. MI programs are available that can show lexicographers all the words that seem to occur most regularly close to a particular search word. For ‘fish’, the words with the highest MI score are: chips, slab, aquarium, water, sea, kettle, feed, reborn, meat, drink, sauce, and finger.

But the trouble with lists like this is that they lump together all sorts of words standing in different relations to ‘fish’. They are grammatically blind. Wouldn’t it be nice if, instead of one list, there were separate lists for different grammatical relations? Then, all the verbs that ‘fish’ was the subject of would be one list, verbs it was object of in another, adjectives describing fish in a third, and so on.

Achieving something as sophisticated as this would require state-of-the-art LT techniques, and this is exactly what has been done by the ITRI research group at the University of Brighton.

‘Word Sketches’ – a new kind of data

Using a large corpus and some very smart software, researchers have produced ‘word sketches’ for nearly every noun, verb and adjective in English. A word sketch is an automated summary showing how a word combines with other words, with the various combinations grouped into grammatical categories. This gives us an immediate – and extremely revealing – snapshot of how the word behaves. This is the word sketch for fish.

A recent collaboration

In a collaboration which brought together leading LT research and practical lexicography, the word sketches were used by editors working on the new Macmillan English Dictionary for Advanced Learners. Editor-in-chief Michael Rundell said: ‘Using the word sketches was a revelation. They provided my team tremendously detailed collocational data, and we have passed this on to users of the dictionary with a more comprehensive account of word combinations than anything seen before.’

Click here for more samples of word sketches.

This is an edited version of an article first published in EL Gazette, August 2002.