[ Jocelyn Ireson-Paine's Home Page | Publications | Dobbs Code Talk Index | Dobbs Blog Version | Snobol-style pattern-matching in R ]

More Technonecrophilia with Snobol One-Liners

I don't think I want to imply that Snobol is dead: to me, it has more vitality than Perl. But I thought I'd let my title thank Mike Swaine for his blog posting Adventures in Technonecrophilia about my Programs That Transform Their Own Source Code; or: the Snobol Foot Joke. Mike concludes that figuring out how to solve a problem in Snobol is like doing a puzzle, and discovering the power hidden in this ancient language is like winning a prize. So I'll pose you two more puzzles. Here is the first, a one-liner that sorts a string:

loop█s█arb█$█pre█len(1)█$█c█len(1)█$█d█*lgt(c,d)█rem█$█post█=█pre█d█c█post█:s(loop)

If you don't see coloured boxes in this one-liner, you're probably using Internet Explorer. I don't know why Microsoft's browser can't depict a Unicode 2588 full block character whereas Firefox can. Perhaps I have an old version. But I'm going to use the coloured boxes later on to explain Snobol syntax, so I'll ask you to switch to Firefox, or some other browser that will show you coloured boxes above. As a test, the thing between these dashes — █ — should be a chartreuse box.

With that out of the way, I'll show you the one-liner again. It's actually a seven-liner, but only one line does the sorting. The others: initialise two global system variables; read the string to be sorted; display it and the result of sorting; and end the program. Here they all are:

  &anchor = 1
  &fullscan = 1
  s = trim(input)
  output = "Original: '" s "'"
loop s arb $ pre len(1) $ c len(1) $ d *lgt(c,d) rem $ post = pre d c post :s(loop)
  output = "Sorted : '" s "'"
end

I'll now explain what each statement does. I'm going to refer back to Programs That Transform Their Own Source Code; or: the Snobol Foot Joke, to save re-explaining things that I explained there. I'm also going to link to various sections in Mark Emmer's very useful A Snobol4 Tutorial.

The first statement assigns to a global system variable named "&anchor". Snobol calls these variables keywords. This one affects pattern-matching, as described in Emmer's 4.11 ANCHORED AND UNANCHORED MATCHING. It makes the matcher start every match at the beginning of the string being matched, rather than allowing the pattern to slide along this string until the match succeeds. The string being matched, by the way, is often called the subject string.

The next statement also sets a keyword. This one puts the matcher into fullscan mode instead of quickscan. As Emmer notes in 7.3 QUICKSCAN AND FULLSCAN, quickscan mode uses heuristics to eliminate some possible matches. This speeds up matching, and doesn't usually eliminate matches that would actually be useful, at least in sensible programs acting on realistic data. But it will hurt "pathological" programs designed to test the matcher, for example by deliberately enumerating all possible matches. Because of this, and because I prefer not to have to remember what the heuristics do — and anyway, computers are much faster than when Snobol was designed — I always run in fullscan mode.

These are the next two statements again:

s = trim(input)
output = "Original: '" s "'"

These read the string to be sorted, and display it. In Programs That Transform Their Own Source Code, I said that "output" is an output-associated variable that displays anything you assign to it. Similarly, "input" is an input-associated variable. It is connected to standard input, unless you reassociate it with the built-in "input" function. When you use its value, Snobol reads a line from it, as explained in INPUT/OUTPUT AND KEYWORDS.

Now we come to the heart of my program. Here it is again:

loop s arb $ pre len(1) $ c len(1) $ d *lgt(c,d) rem $ post = pre d c post :s(loop)

This is a pattern match statement. It's one I invented after seeing a similar program on the Web, then never being able to find it again. But perhaps I should clarify the syntax. If you've ever tried learning Chinese, you'll enjoy — while wincing at shared pain — David Moser's amusing paper Why Chinese Is So Damn Hard. He mentions the obvious reasons: the characters; the tones; lack of shared word roots; lack of shared culture. But another is that Chinese does not separate words by spaces. So you see "漢語語調稀奇古怪". Do you read it as "漢語語調稀奇古怪" or as "漢語語調稀奇古怪" or as "漢語語調稀奇古怪" or as something else? The Snobol tyro faces an opposite yet analogous problem. Snobol uses spaces too often with too many meanings: and because most programming languages accustom us to ignore spaces, we ignore them in Snobol too. Hence Snobol becomes a programmer's Chinese, its token-separators vanished from conscious view. But behold! By using colour, I restore them to view:

loop█s█arb█$█pre█len(1)█$█c█len(1)█$█d█*lgt(c,d)█rem█$█post█=█pre█d█c█post█:s(loop)

This is what my colours mean:

█ Punctuation. For example, separates a label from the rest of the statement, or a jump from the part of the statement before it.

█ Separates the subject of a pattern match from the pattern following it.

█ Separates the arguments of a binary infix operator from the operator. Unary operators, such as the star in *lgt, don't take spaces.

█ Concatenation, usually of strings or patterns. This is actually a binary operator, albeit one with no visible name.

█ These spaces surround the equals sign that separates the pattern of a pattern match from the replacement string.

With this parsing in mind, we can summarise the statement as:

loop s PATTERN = REPLACEMENT :s(loop)

This matches the string in variable "s" against PATTERN. It then replaces the matched part by REPLACEMENT. REPLACEMENT is "pre d c post", which I'll now make explicit, rewriting the statement as:

loop s PATTERN = pre d c post :s(loop)

The next step in understanding is to realise that "pre", "d", "c", and "post" are string variables, which get set during the pattern match by the "$" operator. In Programs That Transform Their Own Source Code, I explained the "." or conditional assignment operator. I said that "A . B" tells the matcher that if the entire match of which pattern A is a part succeeds, then the substring matched by A must be assigned to variable B. The "$" operator is similar, but assigns to B immediately, without waiting for the match to succeed. The assigned substring can then be used later on in the same pattern. We call "$" the immediate assignment operator. There is more about these two in 4.7 CAPTURING MATCH RESULTS and 5.3 IMMEDIATE ASSIGNMENT.

Now we'll look at the pattern. Here it is again:

arb $ pre len(1) $ c len(1) $ d *lgt(c,d) rem $ post

The way to read these patterns is not to worry about how pattern matching is implemented. Instead, read them declaratively, as string specifications built from pattern primitives such as "arb" and "len".

So I shall now note that "arb" is a pattern primitive that matches any string. See 4.8.1 Primitive Patterns in 4.8 UNKNOWNS .

And "len" is a pattern-valued function: len(N) matches N characters. See 4.8.3 Integer Pattern Functions in 4.8 UNKNOWNS.

The next component, "*lgt(c,d)", is explained in 5.2 UNEVALUATED EXPRESSIONS . The function "lgt" is lexicographic string comparison: "lgt(S1,S2)" succeeds if S1 is lexicographically greater than S2. See 2.3.1 Conditional Functions in 2.3 BUILT-IN FUNCTIONS.

The star operator is what makes "lgt(c,d)" into an unevaluated expression. It creates a pattern that calls "lgt(c,d)" and succeeds or fails accordingly. So the subpattern

len(1) $ c len(1) $ d *lgt(c,d)

extracts character c, then character d, then succeeds if c is lexicographically greater than d. We can read it declaratively as:

the one-character substring c followed by the one-character substring d such that c is lexicographically greater than d.

You might like to compare this declarative reading with the ones I gave for Prolog in my essay The Prolog Lightbulb Joke.

The final component, "rem", is the simplest of all. It merely matches the remainder of the string, whatever it is. See 4.8.1 Primitive Patterns in 4.8 UNKNOWNS.

Now knowing these components, we can read the entire pattern declaratively. Here it is yet once more:

arb $ pre len(1) $ c len(1) $ d *lgt(c,d) rem $ post

and here is a declarative reading:

any substring pre, followed by the one-character substring c followed by the one-character substring d such that c is lexicographically greater than d, followed by the remainder of the string post.

In other words, any pair of characters c, d such that c is greater than d, anywhere in the string being matched.

Now look at the entire statement again:

loop s arb $ pre len(1) $ c len(1) $ d *lgt(c,d) rem $ post = pre d c post :s(loop)

The pattern matches the entire subject string, with pre and post "taking up the slack" either side of c and d. And the "=", explained in 4.9 PATTERN MATCHING WITH REPLACEMENT, tells the matcher to replace the matched portion of the subject string — namely all of it — by an equivalent string but with c and d transposed.

So the pattern match finds a pair of adjacent characters and swaps them. If the match finds such a pair, it will succeed; and if it succeeds, the ":s(loop)" will jump back to the beginning of the statement for another try. If the match doesn't find an out-of-order pair, it will fail, and control will drop through to the next statement. So the statement keeps on inverting out-of-order pairs until none remain.

Here's another one-liner, inspired by a 1998 posting to org.perl.perl5-porters by Mark-Jason Dominus. In Pattern matching in SNOBOL4, Dominus starts by saying:

This note started out as an analysis of SNOBOL4's `FAIL' pattern, and turned into a huge ramble about SNOBOL in general. If it has a point, the point is only that SNOBOL's pattern matching is *still* a lot better than Perl's, and that it is worth studying, because we could learn a lot from it.

Dominus goes on to say that he got sidetracked into thinking about how to transform

"abc"

into

('abc', 'ab', 'a', 'bc', 'b', 'c', '')

How about, he wonders, something related to m//g in Perl? He continues:

m/.*/g wouldn't do it, of course. But this reminded me of a feature in SNOBOL that was useful for similar purposes, and I dug out my SNOBOL book and got sucked in, as I always do, and I came to the same conclusion that I always come to, which is that SNOBOL4 was a remarkably usable language, especially for 1971, and that people should pay more attention to it. It had associative arrays, recursive functions with locally-scoped variables, pattern matching that was better than then Perl's is now. On the other hand, SNOBOL's control flow and syntax are hopelessly 1971.

He then talks about how Snobol's primitive "fail" pattern causes the pattern matcher to backtrack. There's a little example under the entry for FAIL in 7.4 OTHER PRIMITIVE PATTERNS. This demonstrates a pattern that, in unanchored mode, backtracks over the string being matched, displaying its characters one by one.

So how could one use backtracking to enumerate all substrings of a string? Here's my answer. Actually, it doesn't generate the null substring, because a backtracking match that doesn't was easier to code:

  define( "write(x)" ) :(write_end)
write
  output = x
  :(return)
write_end

  &fullscan = 1
  &anchor = 1

  "abc" arb $ pre (len(1) arb) $ x arb $ post rpos(0) *write("'" x "'") fail

end

When I run this, I get the following output:

C:\dobbs>\snobol4\SNOBOL4.EXE substr.sno

SNOBOL4+ Version 2.24. 8087 present.
(c) Copyright 1984,1998 Catspaw, Inc., Salida, Colo. All Rights Reserved.

No errors

'a'
'ab'
'abc'
'b'
'bc'
'c'

And here is a variant that stores the substrings end-to-end in a comma-separated string:

  store_substrings = ""

  define( "store(x)" ) :(store_end)
store
  store_substrings = differ(store_substrings) store_substrings ","
  store_substrings = store_substrings x
  :(return)
store_end

  &fullscan = 1
  &anchor = 1

  "abc" arb $ pre (len(1) arb) $ x arb $ post rpos(0) *store(x) fail

  output = store_substrings

end

To understand this, the only extra thing you need to know about Snobol is the built-in "differ" function, and that's only for the trivial purpose of avoiding a trailing comma in the concatenated substrings. It is described in 2.3.1 Conditional Functions of 2.3 BUILT-IN FUNCTIONS.

Here is the program's output. From what I've written this week and last, and from the other links I've referenced, can you see how it and the previous program work?

C:\dobbs>\snobol4\SNOBOL4.EXE substr2.sno

SNOBOL4+ Version 2.24. 8087 present. (c) Copyright 1984,1998 Catspaw, Inc., Salida, Colo. All Rights Reserved.

No errors

a,ab,abc,b,bc,c

█	Punctuation. For example, separates a label from the rest of the statement, or a jump from the part of the statement before it.
█	Separates the subject of a pattern match from the pattern following it.
█	Separates the arguments of a binary infix operator from the operator. Unary operators, such as the star in *lgt, don't take spaces.
█	Concatenation, usually of strings or patterns. This is actually a binary operator, albeit one with no visible name.
█	These spaces surround the equals sign that separates the pattern of a pattern match from the replacement string.