[ Jocelyn Ireson-Paine's Home Page | Publications | Dobbs Code Talk Index | Dobbs Blog Version ]

Arbno, the Cursor, and Snobol Patterns in Prolog

Can any Prolog or SWI-Prolog expert tell me how to implement an equivalent of Snobol's "arbno" pattern for use in Definite Clause Grammars, efficiently and elegantly? This is the pattern that matches an arbitrary number of repetitions of an argument pattern. And how would you define Snobol's "$" and "@"?

In my daily string-hacking, I still use Snobol, despite its age. And I use SWI-Prolog, with Definite Clause Grammars (DCGs) to specify the syntax of strings. Because compared with Snobol patterns and DCGs, which other executable syntax notations stand a snowball's chance in Hell?

But DCGs are only a notation for writing grammars and turning these into parsers. One still has to code the syntactic primitives that the grammar rules will call.

For example, there is a Snobol pattern function named "span" which returns patterns that match a non-empty sequence of characters. The pattern:

span( "abc" )
would match "a", and "ab", and "cb", and "acbbcabb": non-empty sequence of "a"'s, "b"'s, and "c"'s. I can implement this easily enough in DCG rules, but it's nice not to have to bother.

Another example: the "rem" pattern matches the remainder of a string. The "$" operator assigns the string matched by the pattern on its left to the variable on its right. And the "break" pattern function returns patterns that match everything up to a specified character. So the pattern:

break( "XYZ" ) $ Pre rem $ Post
would match "Xabc" and assign the null string to Pre and "Xabc" to Post. It would match "abcXdef" and assign "abc" to Pre and "Xdef" to Post.

(If the above paragraphs don't mean much, you may like to read my explanations of Snobol in Programs That Transform Their Own Source Code; or: the Snobol Foot Joke and More Technonecrophilia with Snobol One-Liners. I also recommend Mark Emmer's very useful A Snobol4 Tutorial, sections from which I've linked to for further reading in these essays.)

Yet another example is the "arb" pattern, which matches an arbitrary string. So the pattern:

arb $ Pre span( "abc" ) $ Mid rem $ Post
would match "XabccbaY", assigning "X" to Pre, "abccba" to Mid, and "Y" to Post.

Which brings me to "arbno". This is a pattern function that implements a kind of higher-order "arb". As already mentioned, it returns patterns that match an arbitrary number of repetitions of the argument. So the pattern:

arbno( span("abc") "X" )
would match "abcXbcXcbaaX". And
arbno( span("abc") "X" ) $ Matched
would match "abcXbcXcbaaX" and assign it to Matched.

That's Snobol pattern matching for you. The spans and the breaks and the arbs and the rems are nice to have around, and I've written SWI-Prolog equivalents which I can call from DCG rules. But I don't like my code for arbno.

Before I show you it, I'll demonstrate my coding style by showing you "span". First, — and assuming you are comfortable with DCGs and the "phrase" predicate, as well as Prolog in general — here is me calling "span" from the SWI-Prolog top level:

2 ?- phrase( span("ab"), "aXXXX", Out ).
Out = [88, 88, 88, 88].

3 ?- phrase( span("ab"), "abbaXXXX", Out ).
Out = [88, 88, 88, 88].

4 ?- phrase( span("ab"), "XXXX", Out ).
false.

5 ?- phrase( span("ab",Matched), "abbaXXXX", Out ).
Matched = [97, 98, 98, 97],
Out = [88, 88, 88, 88].
In this output, "Out" is the third argument of "phrase". As customary, it holds the string remaining after the match. In three of the matches, this is four 88s: that is, four "X"'s. I shouldn't need to remind Prolog programmers that double-quoted strings are merely a handy notation for lists of character codes. The [97, 98, 98, 97] are "abba".

Now I'll define "span":

span( Chars ) -->
  span( Chars, _Matched ).


span( Chars, [ C | RestMatched ] ) -->
  any( Chars, [C] ),
  span_tail( Chars, RestMatched ).


span_tail( Chars, Matched ) -->
  span( Chars, Matched ), !.

span_tail( _Chars, [] ) -->
  [].

I used something else to build "span". It's an uncomplicated pattern function named "any", which returns a pattern that match any of the characters in its argument. Thus, the pattern:

any( "abc" )
matches "a", "b", or "c".

This is "any" in Prolog:

any( Chars ) -->
  any( Chars, _Matched ).


any( Chars, [C] ) -->
  [ C ],
  { memberchk( C, Chars ) }, !.

By the way, you will notice that I have two arities of pattern. The one-argument variant is directly analogous to the Snobol equivalent. Its argument takes the set of characters making up the substring to be matched. (In "span" and "any", at least. In other patterns, this argument may be different. In "len", for instance, it's a number of characters.)

The extra argument of the two-argument variant returns the substring matched. You can see this in my example of "span". I coded this because I found it often useful. Snobol has other ways to get this matched substring, namely the "." and "$" operators. But as I'll explain later on, I can't decide how to implement these efficiently.

Now that I've demonstrated my coding conventions, I'll return to "arbno". Here is a call which matches repetitions of span("abc") followed by "X". It shows the alternative solutions, invoked in SWI-Prolog by typing a semicolon:

6 ?- phrase( arbno( (span("abc"),"X") ), "abcXbcXcbaaX", Out ).
Out = [97, 98, 99, 88, 98, 99, 88, 99, 98|...] ;
Out = [98, 99, 88, 99, 98, 97, 97, 88] ;
Out = [99, 98, 97, 97, 88] ;
Out = [] ;
false.
As before, Out is the string left unmatched by "phrase"'s first argument; 97 is the character code for "a"; and 88 is the character code for "X".

This is how I define arbno:

arbno( Pat ) -->
  arbno( Pat, _Matched ).


arbno( Pat, Matched ) -->
  arbno( Pat, [], Matched ).


arbno( _Pat, MatchedSoFar, MatchedSoFar ) -->
  [].

arbno( Pat, MatchedSoFar, Matched ) -->
  { copy_term( Pat, PatCopy ) },
  $( PatCopy, MatchedHere ),
  { append( MatchedSoFar, MatchedHere, MatchedSoFarAndHere ) },
  arbno( Pat, MatchedSoFarAndHere, Matched ).


$( Pat, Matched, In, Out ) :-
  phrase( Pat, In, Out ),
  append( Matched, Out, In ).

My first clause has the same purpose as the first clauses for "span" and "any". It defines the variant that doesn't return the substring matched. You can see that it ignores the _Matched variable in its tail.

The variant of the predicate that does return the substring matched is next. This is "arbno"'s core, if patterns can have cores. Unlike "span" and "any", it uses an accumulator argument to do so. Its first clause can be read declaratively as:

An arbitrary number of repetitions of Pat can match the empty list. The substring matched before this call followed by the substring matched by it is the same as the substring matched before this call.

The second clause can be read as:

An arbitrary number of repetitions of Pat can match Pat followed by an arbitrary number of repetitions of Pat. The substring matched before this call followed by the substring matched by it is the same as the substring matched before this call, followed by the substring matched by Pat, followed by the substring matched by the remaining repetitions of Pat.

That declarative reading doesn't explain my call to copy_term. This is a standard predicate, which here copies Pat, replacing variables that occur in it by new variables in the copy. Why do I use it? Had I not, the first time Pat was called, it would probably have bound the variables in its arguments. (Whether it would in fact have is up to its author, but there's no point in having variables if you're not going to put something into them.)

But then consider the recursive call to arbno, and hence to Pat. This will need to bind those variables too. But the values it needs to bind them to almost certainly won't be the same as the values the first call bound them to; and so the call will fail. Copying the variables avoids this.

The next tail goal calls my analogue of Snobol's immediate assignment operator. I introduced this earlier, when I said that

break( "XYZ" ) $ Pre rem $ Post
would match "Xabc" and assign the null string to Pre and "Xabc" to Post; or would match "abcXdef" and assign "abc" to Pre and "Xdef" to Post. Here's an example call to my analogue. I've made it a binary operator, but I shan't bother to show the call to "op":
7 ?- phrase( ( break("XYZ")$Pre, rem$Post ), "abcXdef", Out ).
Pre = [97, 98, 99],
Post = [88, 100, 101, 102],
Out = [] ;
false.

In arbno, I am using "$" to grab the substring matched by Pat, which the "append" then appends to make MatchedSoFarAndHere.

Here, again, is my definition of "$":

$( Pat, Matched, In, Out ) :-
  phrase( Pat, In, Out ),
  append( Matched, Out, In ).

But there is a problem. In "$" and in the second clause of arbno, I seem to use too many "append"'s, but I can't think of a more efficient method of getting the substring matched by arbno. Can you? The "append" in

$( Pat, Matched )
seems a particularly egregious way to capture the substring that Pat matches. In effect, it subtracts two strings that I can get easily, namely that remaining before matching Pat, and that remaining after doing so.

Perhaps, though, my arbno isn't the most useful analogue of Snobol's anyway. Because as I explained earlier, my patterns have an optional extra argument that returns the substring they match. You might want to pass these to arbno. Or you might want to call your own predicates that pass information back in variables. But because of my copy_term call, these variables will never get bound.

So here's a different try at an arbno which does capture bindings made by the pattern it calls. It resembles setof/3, in that: it calls a goal (the pattern); and takes a "template" argument which is a term containing the variables whose bindings we want to get back; and returns a list of terms containing these bindings. So by analogy with setof, I'll name it seqof.

Here, I show a call that matches "abcXbcXcbaaX" against an arbitrary number of span("abc") followed by "X"'s. The output shows alternative solutions, with their values for List and Out. List is the list of terms returned by seqof. This contains values for Matched, which holds the substring matched by "span". Out is the substring remaining after the entire match: returned, as before, in "phrase"'s third argument:

8 ?- phrase( seqof(Matched,(span("abc",Matched),"X"),List), "abcXbcXcbaaX", Out ).
List = [],
Out = [97, 98, 99, 88, 98, 99, 88, 99, 98|...] ;
List = [[97, 98, 99]],
Out = [98, 99, 88, 99, 98, 97, 97, 88] ;
List = [[97, 98, 99], [98, 99]],
Out = [99, 98, 97, 97, 88] ;
List = [[97, 98, 99], [98, 99], [99, 98, 97, 97]],
Out = [] .

Next, my definition of seqof. Like arbno, it calls copy_term. This copies Pat, and also the template, which I've named Vars. Copying them both in the same call means the copied variables in Vars will be shared properly with their occurrences in the copied pattern. Calling the copied pattern will bind them (if its author has written it in a sensible way), and they will be captured in the first element of seqof's third argument, the list:

seqof( _Vars, _Pat, [] ) -->
  [].

seqof( Vars, Pat, [ VarsCopy | List ] ) -->
  { copy_term( Pat-Vars, PatCopy-VarsCopy ) },
  PatCopy,
  seqof( Vars, Pat, List ).

To conclude, I've shown you how one can implement equivalents of some Snobol patterns in Prolog DCGs, and thereby facilitate string-processing. Other programmers know so too: I'm sure I saw Richard O'Keefe post about "span" in the SWI-Prolog mailing list, though I can't now find the post. Patterns via DCGs are hardly rocket science, assuming that rocket science means something impossible to comprehend and implement. I suspect, though, that we can't implement all the Snobol patterns as DCGs without forcing the user to write extra arguments to their DCG rules, to pass global pattern-matcher state from pattern to pattern.

I say this because there are Snobol patterns that seem to demand such state. Such as:

  &anchor = 1
  "abcd" arb len(1) $ output @output fail
end

This matches the string "abcd" against an arbitrary substring followed by a single character, the latter matched by len(1). The Snobol "fail" is like a Prolog "fail": it makes the matcher backtrack and try alternative solutions. In this case, solutions for "arb", which will match successively longer substrings.

The "len(1) $ output" assigns len's single character to "output", a variable that Snobol associates with the standard output stream. This displays that character on a line by itself. And the "@output" calls the "@" operator, which assigns the pattern matcher's cursor to "output". The cursor is an integer giving the current position in the string being matched. So the program displays a sequence of characters and character positions:

C:\dobbs>\snobol4\SNOBOL4.EXE at.sno

SNOBOL4+ Version 2.24. 8087 present.
(c) Copyright 1984,1998 Catspaw, Inc., Salida, Colo. All Rights Reserved.

No errors

a
1
b
2
c
3
d
4

But I can't see how to define a pattern predicate that gets the cursor's value, unless one passes it in an extra DCG argument. Which would inconvenience the person calling the pattern predicates, because they'd have to pass this argument into every goal in their rules and and then out again. Although we could, I suppose, write a pre-processor that translates DCG rules into Prolog such that they all have these extra arguments. Except for these extra arguments, it would be the standard DCG translator.