Introduction

In this document, which I am writing as an internal report of the Department of Informatics, University of Minho, I want to explore an intuition that, by using category theory, generalisation can be formalised as an adjunction. I believe that this can be made to apply to all forms of generalisation, including logical induction, statistical curve-fitting, and learning in neural nets. But it remains to see whether this is true, and if true, whether it is useful. If true, it could be very interesting indeed, because it would provide a way to unify different kinds of machine learning.

I'll write this main line of the report assuming a knowledge of category theory. But if I find time, which I don't have much of at the moment, I'll hyperlink to explanations of some of the concepts.

Informal explanation

What is generalisation?

By generalisation, I mean taking a set of specimen cases and constructing a rule which is sufficient to "explain" all these, and to interpolate, extrapolate, or otherwise predict new cases which fit the rule. (I say "specimen" rather than "example", because I also need to talk about examples of the formalisation described here, and want to avoid ambiguity.) Instances of this include:

• The specimens are two-dimensional points (members of R2). Generalisation is least-squares fitting. The rule is a line giving the best least-squares fit to the points.

• The specimens are logical sentences, classified as positive or negative. Generalisation is logical induction. The rule is a new sentence from which we can infer as many of the positive specimens as possible (hopefully all) and as few of the negative specimens (hopefully none) as possible.

• The specimens are pairs <v,w> associating an element of the vector space V with an element of the vector space W. Generalisation is training a linear-associator neural net. The rule is a linear transformation which, if possible, maps all the v to the w, and if that's not possible, does the best it can by finding an optimum transform which minimises interference between different <v,w> associations.

• The specimens are pairs <p,l> where p is a point in Rn and l is a member of some set L of symbolic labels. For instance, the p might be elements of some space whose dimensions measure a voter's preference for various policies; the l could then be the possible parties, {green,labour,libdem,tory}. Generalisation is nearest-neighbour prediction. The rule is a function which extends the set of pairs <p,l>, mapping every point in Rn to a label. It can therefore be used to predict the preferred party of other voters. I've taken this example from Truth from Trash by Chris Thornton.

• The specimens are the instances of generalisation given in this report. Generalisation is fitting each to a notion in category theory. The rule is a categorical construct...!

Generalisation and limits

My intuition is that the generalisation of a set of specimens is, in some sense, a limiting case or least upper bound of them. Ideally, it should contain just enough information to reconstruct them and the other cases to be predicted, but no more. Category theory provides several notions which can be used to formalise this, of which the simplest is that of limit. (Actually, I suppose product is even simpler.)

Example: some very simple logical induction, or conjunction as limit

Here is one of the examples that inspired this intuition. Consider the category Ex whose objects are the atomic propositions P, Q, R, ... and their conjunctions. Let there be an arrow from P to P' if P implies P'. This makes the category into a partial ordering defined by implication.

It seems reasonable to say that forming the conjunction of a set of propositions is one possible (and very crude) way of generalising from them. And informally speaking, the conjunction contains just enough information to imply them all, but none of the others in the category (unless they were implied by the originals anyway). Now, we also know that in this category, their conjunction is their limit. (More correctly, it's a limit of the diagram containing the propositions and the implication arrows between them.) But this formalises the notion expressed in the "just enough information" sentence, because of the universal mapping property of the limit. (That is, any other proposition which implies the originals also implies their conjunction.)

The problem of multiple languages

Though this was one of the examples behind the original intuition, it is very simple, and we need more mathematical equipment to deal with other kinds of generalisation. One problem is that the language in which the generalisations are expressed will usually be different from that of the specimens.

As an aside (I'm not sure how relevant this is to the main point of the report), this isn't unreasonable. It's reasonable to argue that generalisation, in order to be useful, must achieve some compression. The generalisation formed from a set of specimens should take up less space, consume less computing time, or otherwise be easier to handle, than the original data, otherwise there's no point in having it. (Compression won't always be possible: see David Marr's distinction between type-1 and type-2 theories. Also, it's interesting to note that Nick Chater has argued that the saving of cognitive resources can be seen as a reason for the evolution of generalisation in animals.)

Anyway, one way to force compression is by having the language for generalisations more compact than that of the specimens. But this will probably be at a cost: the generalisation language won't be able to express all the information in the specimens, so some detail will be lost.

Functors and multiple categories

I'm going to try formulating this using more than one category, and functors. Let's now assume there are two categories. Ex is the category whose objects are sets of specimens. Gen is the category whose objects are generalisations. Then I claim that generalisation can be modelled as a functor G: Ex -> Gen. The functor will have special properties, which I'll describe later.

Example: more logical induction

Consider the category Ex whose objects are the non-empty sets of sentences e(I) where I is an integer. So one object would be { e(1), e(2) }. Interpret e(I) as meaning "the integer I is an specimen". Interpret the arrows in Ex as set inclusion.

Let Gen be the category whose objects are sentences: either the atomic sentences e(I) or the universally-quantified sentence A X: e(X). (Unlike the category of sentences in the earlier example, this category does not contain conjunctions.) Interpret the arrows as implication.

Now introduce the functor G: Ex -> Gen, defined as follows. G maps each singleton { e(I) } to the sentence e(I). It maps sets with more than one element to the universally-quantified sentence. It also reverses arrows, mapping set inclusion to reverse implication.

We could say that G implements a simple form of logical induction, rather more interesting than the earlier one. And there are two languages, that of Gen restricted compared to that of Ex, because Gen cannot express conjunctions, and so has to approximate them by quantifying over all possible cases. The functor G is "doing the best it can" in these circumstances.

Example: least-squares fitting

Another example with a similar structure follows. Let Ex be the category whose objects are the non-empty sets of colinear elements of R2. Once again, let the arrows be set inclusion.

Let Gen be the category whose objects are either the singletons { x in R2 } or infinite lines in R2. Let the arrows be set inclusion.

Then let the functor G map each singleton to itself, and map each set S with more than one element to the line going through all S's points. G maps inclusions to inclusions. As with the previous instance, G flattens most of Ex into Gen.

All the instances above can be formalised as adjunctions. G then becomes a functor which maps a set of specimens to an object which is in some sense the "completion" of that set. It acquires a right adjoint E which maps this object back to the set of all possible specimens derivable from this completion object.

The best way to think about what this means in each example may be to think in terms of the units and counits: an adjunction Ex to Gen is determined by the two functors G and E and by two natural transformations i:IEx=>G;E and e:E;G=>IGen. Given any object g in Gen, there is a morphism taking every GEg to g. Since E maps g to the set of all possible specimens, and G should map that back to the original generalisation, this is the identity. Hence we get one natural transformation.

In the other direction, given any object e in Ex, the functor EG will map it to the set of all possible examples of which e is a part. There is an inclusion from e to EGe, and since this respects the other inclusions in Ex, once again we get a natural transformation, i:IEx=>EG.

(I need to relate this to the notion of limiting amount of information, showing how that arises from the adjunction.)

The problem of noise

In the least-squares example, I stipulated that the sets of points in Ex must be colinear. This isn't very realistic: in real life, most sets of specimens will not exactly fit a line.

In general, given any useful generalisation method, some sets of specimens will not have an exact generalisation. You can say that this is due to experimental error, or to an inadequate statistical model, or to an insufficiently powerful description language, or to restricted storage capacity, or whatever.

In the line-fitting example, we can try fixing this by extending the generalisation functor G so that it maps each set of non-colinear points to the line with the least least-squares distance from them. [What if, in this or other examples, there is no unique best generalisation?] The problem with this is that G then no longer respects the arrows between the sets which are the objects in Ex. Originally, if S was a subset of S', then G(S) was a subset of (S'). But this is no longer necessarily true: if the points in S are colinear, and we make S' from S by adding a point that isn't on the line, then the best-fit line to S' will be different from that to S, so the inclusion won't hold.

One might argue that the way to fix this is not to decide a priori that the morphisms in Ex should be inclusions, but to let them be determined by the morphisms in Gen. Philosophically speaking, perhaps this is reasonable - we don't perceive the raw data directly, but always fit it to a preconceived model.

Conclusion

This report is incomplete in several ways, bibliography and proofs amongst them. I would also like to add explanations for those who don't know category theory. Apart from that, this is still very early in the research. I need to deal with noise, and impose sensible structures on the categories. It will then be necessary to show that enough instances of generalisation can be formalised as adjunction.

I also need to demonstrate that the formalisation is useful. One way to try using it would be to compare different generalisation methods. If we're using category theory correctly, it should make this quite easy. An interesting possibility would be to take non-symbolic learning methods such as neural nets, and to try laying them alongside some derived symbolic system, showing how symbols can emerge. Good examples to try this with might be those used by Stephen Harnad in his work on categorical perception.

Another possibility is to try deriving learning algorithms from the definition. Algorithms have been derived from categorical definitions in other areas of computer science, for example by Rod Burstall ...

I also need to restrict the definition. I do not claim that all adjunctions correspond to generalisations - it's hard to see how the functor mapping sets to their free groups would, for instance. So one thing we need to do is to restrict the definition so as to capture the essential properties of generalisation. Part of this may be that all elements of an object in Gen are "on the same level" in the sense that the elements of a free group or monoid are not. From the set {a,b}, we can generate the free monoid 0, a, b, aa, ab, ba, bb, aaa, aab, ...: some of its elements are generated from others in a way that those in the objects of Gen are not. But perhaps this just reflects a lack of interesting structure in Gen.

It would be interesting to connect this research with that described in Goguen's paper What is Unification?. There, he formalises unification in terms of equalisers and limits. The least-general generalisation of two terms, as used in logical induction by inverse resolution is in some sense dual to unification. So we could try dualising his formalisation to model "anti-unification", and if this works, try broadening it to other kinds of generalisation. (Perhaps doing this will end up with a categorical formulation equivalent to the one I'm working on here.)

Acknowledgements

To RENFE and to Bom Jesus.

25th June 2000.