Little Language

One spin-off of the Unix Design Philosophy was the realization that it is easier to implement a task-specific language optimized for that task than it is to implement a general-purpose language optimized for all possible uses. The Little Language title was given by Jon Bentley in the Communications of the ACM [Jon Bentley, "Little languages", Communications of the ACM, 29(8):711-21, August 1986.].

What Bell Labs did was to make separate languages for the tasks they found, and optimized them for those tasks. These tasks include:

Pattern matching (and thus Regular Expressions were born)

text line editing (ed/sed)

language grammar (lex/yacc)

Regex record data manipulation (AWK [Awk Language])

shell services (sh)

text formatting (troff/nroff); spawned multiple interacting little languages:

picture drawing (pic, IDEAL)

graph drawing (grap, dot, Gnu Plot)

typesetting mathematics (eqn)

typesetting tables (tbl)

CHEM (chemical structures)

bib, refer (bibliographic notations)

dformat -- data structure/record format drawings

ptx (create permuted KWIC (keyword in context) index)

macro packages like mm, me, ms

incremental rebuilds (make)

preprocessing (cpp, m4)

arithmetic (dc [Dee Cee], bc)

find (non-power users may not have noticed that it supports full boolean expressions over file attributes)

After UNIX left Bell Labs, more other little languages got added to it, including:

shell services (csh, ash, ksh, zsh, bash, tcsh, etc.)

file editing (vi) (vim and several other clones are now big)

sendmail CF configuration language (Turing Complete, I think, but obfuscated)

I know there are more. Could someone fill these in? Thanks.

Little Languages have these characteristics:

may or may not be Turing Complete (see table below)

very efficient to execute (for example, sed scripts don't need to be tokenized; they already are.) both in terms of speed and memory (most cases) in their intended scope.

reasonably simple to express (Regular Expressions are simple enough to be concisely described on a single sheet 8 1/2"x11" paper, single spaced, 10 point font.)

very powerful, within their scope, frequently to the logical extreme of their scope.

pathologically cryptic, requiring users to become experts before using them

Some little languages are not designed with efficiency in mind at all. Instead, it's intended to save the developer time and effort by allowing her to express her intentions at a much higher level of abstraction. This allows the developer's programs to be much shorter than equivalent programs in other languages.

The "little" part primarily refers to the scope of what the language tackles.

(For example, TeX is purely focused on typesetting, so it qualifies as a little language no matter how big the implementation).

Little Languages are related to:

Domain Specific Languages: a little language is a kind of domain specific language in a very narrow domain.

Minimalist Language, a language whose interpreter/compiler can be run on systems with very little RAM. Unfortunately, this typically requires user programs to be much longer than equivalent programs in other languages, as the user manually implements stuff the interpreter left out.

Below is a long discussion about the Turing Completeness of the languages above. Let's summarize here what we know about the Turing Completeness of the above:

Regular Expressions: not TC, equivalent to Finite Automata, but is TC if match and replace and also repeats are allowed

ed/sed: TC (but data may be tricky to represent)

lex/yacc: not TC per se, equivalent to Push Down Automata (Context Free Grammars)

AWK: TC

vi/ex: TC

sh: almost TC, only needs some way to manipulate data structures (dd is enough; dd/sh is even used to distribute CLC-INTERCAL)

troff/nroff: ?

TeX: TC

make: not TC, lacks means for representing data. Needs a shell to do anything.

cpp: some implementations are TC (but loops have to be written by #include'ing the current file)

m4: TC

dc/bc: TC (but data may be tricky to represent)

IDEAL: not TC


Other little languages:

"dc is the oldest language on Unix; it was written on the PDP-7 and ported to the PDP-11 before Unix [itself] was ported." (Ken Thompson)

Tcl "Tool Command Language" ( see "little language" -- wiki.tcl.tk )

"SQL used to be a little language"

the Berkeley Packet Filter (BPF)

control languages for the BPF, such as OpenBSD's pf (www.openbsd.org )

Would Jakarta Struts application configuration (struts-config.xml) be considered a little language? Only if you consider every XML DTD to be a little language.

Rebol Language ??? REBOL isn't a little language itself, but it's designed for creating them. --Gregg Irwin

LOGO ??? Logo is a dialect of Lisp, but it's not a little language. The little language inside Logo is the one related to Turtle Graphics. --Gregg Irwin


Some Little Languages are Turing Complete. For example, I see no reason why somebody couldn't write a Universal Turing Machine interpreter in AWK to prove it. (If that sounds like too much work, you could simulate a URM machine - I would have thought the AWK associative arrays would be great for that.) how about a sed Universal Turing Machine? <queen.rett.polimi.it > -- (That's down, try sed.sourceforge.net )

How about Lex/Yacc, TeX, and AWK?

A related sort of little language is the little language that isn't designed with efficiency in mind at all. Instead, it's intended to save the developer time and effort by allowing her to express her intentions at a much higher level of abstraction.

SQL used to be one of these, I think. Prolog and the various expert system shells also. The idea seems sadly quiescent these days; instead we're all standing around marvelling at bytecode, container libraries, and garbage collection as if these things somehow represented advances in the state of the art.

Lex/Yacc, sure. TeX? Maybe at one time. It no longer seems very little to me anymore. I guess it still could fit, due to ancestry, but I personally would never list it. AWK? probably, though it's also growing. Part of my argument against AWK would be that most things people do in perl could also be done in AWK, and I would certainly not call perl little. AWK is admittedly smaller, and it was originally little (the version on the AIX box at work is easily a Little Language. The version on my Linux box at home probably isn't. For the record: on my Linux box, sed is 35k, mawk is 88k, and gawk is 255k.)

The size of the implementation code is not the point, especially given the Moore's Law side effect that most programs keep growing larger over time, whether they need to or not. Someone might implement any of these things (bc, ed, lex, tic tac toe) in a hundred billion line C++ program that requires a petabyte of RAM sometime in the future, but that wouldn't necessarily mean that it's no longer a "little language" (although it would certainly make the term ironic, and one would have to wonder if bc really needed a real-time ray tracer etc to draw the numerals in a pretty way)

The "little" part primarily refers to the scope of what the language tackles. TeX is purely focused on typesetting, so it qualifies as a little language no matter how big the implementation.

First of all, the Turing Test refers to an AI problem rather than computability.

Turing Complete is framed in terms of mathematical operations on the integers {0,1,2,...} rather than in terms of machine words or whatever other data type you use in practice. (So long as your data types are finite, you could code them up as numbers and compute over them with functions.)

The fact that you can't write a device driver in AWK, or interface with a graphics system is irrelevant from the computability theory perspective. The argument would run like this:

Allowing for finite-space limitations, the computational problems solved by the device driver or graphics system are effectively computable. 2. Unix and its device drivers and graphics systems are effectively computable - so it is possible to write a universal Turing Machine program which emulates the Unix system and its device drivers.(I said it is possible, not that it's fun!)

Write a universal Turing Machine simulator in AWK.

Run the Turing Machine simulator in AWK, and load in the Unix emulator as data.

You end up having solved the graphics, operating systems and device driver computations solely using AWK. The fact that it's not practically feasible doesn't matter.

(An alternative strategy would be to produce a universal Turing Machine to AWK compiler rather than write the emulator.) Then write a C compiler to target the UTM.

Turing Complete is about being able to solve any effectively computable mathematical problem in a language - it's not about practical software architecture issues such as interfacing with device drivers. -- some guy

Greg Mc Farlane wrote a set of macros to demonstrate that vi is Turing Complete; they're distributed as part of the standard vim distribution at www.vim.org . AWK should be no problem; egrep probably isn't Turing Complete; Sed neither. Csh?

First, having been corrected on my misunderstanding of Turing Complete, I believe I've corrected the introductory description.

I'd say shells in general have problems - basically, shells are big Escape Hatches; most of the shell's abilities are to run other programs, and they have very limited abilities on their own.

That being said, I would propose that ksh, bash, and zsh are probably Turing Complete, and Bourne Shell is definitely not Turing Complete.

Why do you think Bourne Shell is not Turing Complete?

Sed? Around 1992 or so, Kevin Braunsdorf told me about a 100-line sed script written to do arbitrary-precision integer mathematics. It wasn't pretty, it apparently got sed banned from all future obfuscated code contests, but it apparently worked. Beyond putting fear into me (I'm not sure if it was because someone could do that in sed, or that someone did do that in sed, that scared me more) it made me decide to not doubt sed's computational abilities.

Here's a C++ program you can't write in sed:

#include <iostream> main() { for(;;) { cout << "foo" << endl; } return 0; }

Er, yes it can:

:loop s/.*/foo/p b loop

I found a proof of sed's turing completeness: robertkotcher.com

Look it's even shorter in sed! >:)


A Little Language is designed for a specific task, and not for throwing at any computational task that comes along. As such, it usually has no need to be Turing Complete - at least, that is what is expected. But along comes a task that lies just on the wrong side of the border of its abilities, and the temptation to extend it a little bit to handle the edge case can be irresistible. If such a Little Language ends up being Turing Complete then it's pretty much by accident, and the result is virtually guaranteed to be a particularly uncomfortable Turing Tarpit.

If a Little Language is designed from the outset to be Turing Complete then those borders don't exist and the pressure from that direction to extend the language is eliminated; it has a better chance of retaining what conceptual purity (coherent semantic model, robustness of implementation, etc.) it started out with, even if it is subsequently extended for other reasons. It may still be a Turing Tarpit in the general case, but if you're working anywhere in the vicinity of its bailiwick, at least you won't suddenly run into an impossibility or gotcha.

For an example of an "accidentally TC" Little Language I nominate sendmail; and for a deliberately TC Little Language, Post Script.


It bothered me that a page on Little Languages should not contain a reference to Jon Bentley or Programming Pearls, so here they are. (Actually, Bentley's column on Little Languages appeared in the second collection, More Programming Pearls.)

Also, I have the feeling that the idea of Language As Interface is relevant (or that, if it isn't, it ought to be).


"sed scripts don't need to be tokenized; they already are"

I've often read this but never known what it means. Would someone explain? -- Anonymous Donor

All sed commands are one byte long, not including arguments. -- Ed Grimm

More languages than just sed have this property or a similar one. The qmail security guarantee talks about using this (and null-terminated lines) under the subject of parsing. dc is similar; there are only one and two character commands, some of which have a mandatory one character argument, and no two-character command overshadows a one character command except !.


If a little language is Turing Complete, many properties of program written in that language become undecidable: Such languages suffer from the Halting Problem.

SQL is another well known little language that has no way to express loops nor recursion, and which is not Turing Complete.

But queries are a form of implicit looping over all records...

In fact, I have written a SQL statement that did run forever reading a table with one row. - Joshua

Yes, there is queries looping over records (you can even make it access multiple tables and do a bunch of other stuff), and you can make it to do things with it by creating a trigger and then inserting into a read-only view that has that trigger attached. They can even be made recursive. You can even do conditionals (with a WHERE clause), and a lot of other stuff. Numeric loops are missing, although a virtual table provides that, so you can use that.

Keeping a little language declarative does have some advantages such as making it possible to use the scripts for multiple purposes (code generation, sanity checks, documentation, ...)

For those interested - see a paper I wrote in 1997 titled Little Languages: Little Maintenance? (it's on my homepage), which also discusses whether a little language should be Turing Complete. -- Arie Van Deursen

Also, the Berkeley Packet Filter (BPF), can be considered as a "little language". It is also decidedly turing-incomplete.

The basic Rebol Language book describes it as a Minimalist Language. Implementations have a very small footprint. Is it appropriate to think of Little Language as a logical categorization of a language while Minimalist Language emphasizes the physical size of the implementation? -- David Ness

Olin Shivers has a nice take on little languages at: www.cc.gatech.edu . He argues, that instead of little languages, we should use a single extensible language plus a lot of libraries. Why redesign things like variables and functions over and over? Wasn't that the idea behind Tool Command Language?

Regular Expressions are also Turing Complete, if they get applied in turn until none of them matches. Actually, plain text substitutions suffice - they are an instance of Semi Thue Grammars. -- Panu Kalliokoski

No...as phrased this isn't right. Plain old regular expressions, as you know, are not TC. Applying them in turn doesn't help, that's just (RE)*, another RE, and strictly speaking, it doesn't even matter if they're NDFA or DFA. What you must have meant to say is that regular expression match and replace is equivalent to a Semi Thue Grammar, since then you can have arbitrary RHS and LHS on each grammar rule. -- Doug Merritt

By which point regular expressions have become overkill, since search-and-replace on plain substrings is sufficient to implement a Semi-Thue grammar.


So why are Little Languages seen to be a good idea but Fourth Generation Languages seen to be a bad idea? The 4GLs I have used seemed to be collections of related Little Languages.

4GL began as a technical term but was emptied of all meaning by marketroids back in the 1980s, so it currently tends to be considered to be marketing fluff unless proven otherwise. Nor is there a "fifth generation language" by which to contrast, despite the passage of time. Terminology trends change. "fourth generation" and "generation" in general used to be commonly applied to many things, not just languages, and now they are (mostly) not. The last usage I personally saw was Bentley calling awk a 4GL, and meaning that in a positive sense (which he described in detail).

Wasn't Prolog given the fifth generation tag by Japanese marketroids? - Scott Elliott

I forgot about that one. But I meant used meaningfully; the Japanese 5th generation label was empty in general, and there was no reason to apply that label to Prolog (except that it was the official language of their 5th Gen. project). "Fourth Generation" was once in a while used in a non-empty way.


One interesting thing about little languages is that they must interface to the world. When they do, it will usually display some awkwardness. Otherwise little languages are really nice.

Take a rather libre view of the classification of little languages, C is good at managing the memory, but some part of the world is better interfaced using OO, there comes the awkwardness of writing GUI applications in C.

Awk and sed are good at parsing flat files. When you want to parse EBNF using awk and sed, you would hate yourself. Yacc is for that.

A few more examples would be good.

When we talked about little languages, there must be mentioned that a good method for these little languages to interface with to the world is just equally important. In UNIX, that is the pipe. With pipe, the little languages can stay little.

The pipe is not perfect though, for it is unable to prevent people from inventing Perl. Or we could say that awk and sed is of no use to yacc and lex. When we want to parsing EBNF with another little language, the pipe can't let us utilize the other previously existing little languages. When we want to use a functional style to parse the EBNF instead of the C used in yacc, again we can not effectively utilize the yacc and lex but have to be forcef to implement our own yet another yacc in our own favorite functional language, ie. haskell etc.. The pipe cannot help us on that.

For a description of a possible class of Little Languages in the Flow Based Programming context, see www.jpaulmorrison.com . Flow Based Programming supports a network of "pipes", and so should be a great environment for hooking together different (types of) Little Languages. -- Paul Morrison


Does Furry Script count? It is domain specific, and I think for mostly a narrow domain; it can be used with other stuff but not very well. What about, Dada Engine and rmutt, which have some similar purposes (even though none of them was based on the other)?



See Minilanguages: Finding a Notation That Sings www.faqs.org


See original on c2.com