How words are represented

How words are represented - feature-sets

Next: Implementing morphological generation within the MA
Up: Morphological generation
Previous: Morphological generation

How words are represented - feature-sets

After that general introduction, I'm now reverting to the original text of my write-up. This is a description of how I modified the Alvey Morphological Analyser so it could also do morphological generation. Will it mean anything to you? Perhaps, or perhaps not. Anyway, it may give some impression of what Natural Language Processing (NLP) is about.

The Assistant stores syntactic information about words - their parts of speech, and so on - but no information about their meaning. It represents this syntactic information as feature sets, which are, essentially, sets of attribute-value pairs. In this section, I shall briefly describe how these are used, and how their representation differs between the Assistant and the MA: something which complicates the task of interfacing the two. If you are familiar with this, then skip to the next section.

There are various ways to map syntactic information onto features and values. One intuitive one is to have a main feature corresponding to the part of speech: noun, verb, and so on; and subsidiary features corresponding to the number of a noun, the number, person, and tense of a verb, and so on. Using such a mapping, and writing our feature-value pairs inside square brackets, the feature-sets for mouse and move look like this:

[ Root MOUSE, PartOfSpeech Noun, Number Plural ]
[ Root MOVE,  PartOfSpeech Verb, Tense Past ]

In fact, there are many different linguistic theories about how to represent syntactic information, and many corresponding ways of mapping such information into feature-sets. The Assistant uses a representation designed for use with its PATR grammar. This representation, described in EA report? , is designed to explicitly convey information needed when detecting and correcting errors, and is also intended for efficient parsing. The example below shows how the Assistant represents the noun abacus:

[ CATEGORY N, SUBCAT NULL, CONJ NULL, NFORM NORM,
  PER 3, PLU -, COUNT +, PN -, POSS -, NUM -, PRO -
]

This is actually stored as a structure with one field for each feature. The CATEGORY feature gives the main word class: noun, verb, and so on. This feature is special: it is not stored in a field, but given by the type (in the Lisp sense, type-of) the structure.

The MA comes with a pre-supplied lexicon. Rather than build our own from scratch, we have decided to use that, changing it as little as possible. It uses a different set of features and values, derived from Generalised Phrase Structure Grammar (GPSG). Whereas the Assistant has its CATEGORY feature for the main word class, the MA lexicon represents word-classes as clusters of N,V,BAR features. N +, V -, BAR 0 indicates a lexical noun; N -, V +, BAR 2 a phrasal verb; and so on. In future, when talking about the ``MA'', I shall include in its meaning the pre-supplied lexicon. Here is how it would represent abacus:

[ BAR 0, N +, V -, FIX NOT, SUBCAT NULL, INFL +,
  POSS -, PRO -, PN -, PLU -, NFORM NORM, PER 3,
  NUM -, CONJ NULL, AT +, LAT +, COMPOUND NOT,
  COUNT +
]

This is stored as a list of lists, each inner list being a pair (<feature> <value>).

In principle, the lexicon is independent of the analyser, and we could write a different one which uses the same mapping as the analyser . In practice, doing so would be a vast amount of work, and it is easier to translate the representations dynamically, as discussed in EA report?. This implies that we are stuck with the job of translating between MA and Assistant's feature-sets, which affects how we interface morphological generation to the Assistant.

In the next section, I shall describe how I modified the MA to do morphological generation. I have also had to implement some routines which translate the Assistant's feature sets into those used by the lexicon: these are described in section .

Next: Implementing morphological generation within the MA
Up: Morphological generation
Previous: Morphological generation

Jocelyn Ireson-Paine
Wed Feb 14 17:12:29 GMT 1996