Markup Language

A markup language defines certain text elements that contain meta information about a text. How to layout the text, what fonts to use, whether it is important, relations to other texts.

Some examples:

RUNOFF (the grandaddy of them all; ascii-only, presentation level)

troff/nroff (descendent of RUNOFF; phototypesetting presentation)

SGML (Standard Generalized Markup Language; structural level)

HTML (Hyper Text Markup Language; bastardized SGML, back to presentation level)

Curl (Curl Language; from same DARPA at MIT as w3c and Amaya; replaces HTML+CSS+JS with declarative/functional lang)

TeX, LaTeX (La Tex)

MathML (Math Ml) mathematics markup language using XML

Often the plain text once marked up is barely recognizable, buried within the Markup Language. An exception is PXSL which reverses the process and only marks up the data.


Wiki Wiki is different. It doesn't require a Markup Language. We use White Space and established plain text conventions: Empty lines separate paragraphs, asterisks start list items. Simplicity means that many people can contribute without learning a lot of baggage first. Most of the markup happens implicitly. Putting empty lines between paragraphs is simple and most people don't have to think about it. Nobody has to memorize fancy rules. That is why anybody is able to contribute.

Some of the conventions are getting dangerously close to YetAnotherMarkupLanguage: two, three, and up to six apostrophes for emphasis, square brackets for links (on other Wikis). That's not good: The benefit of not having a Markup Language lies in the simplicity. Let's not lose that advantage by adding more and more tags to the Text Formatting Rules.

Contributors: Alex Schroeder.


self-reinforcing text

Simplicity isn't defined by numbers though - more text formatting rules are fine if they are simple and self reinforcing. Putting links inside square brackets is fine, so long as the resulting formatted text preserves those square brackets - it acts as a small lesson in how to write links. That's why bullet lists start with asterisks - the markup language (which isn't) looks like the final result. An editor of printed texts have all manner of symbols for marking up corrections: a non-capitalized letter which should be is enclosed with a large letter C which is a literal clue to Capitalize, inserts are indicated with a wedge shaped caret, and so on.

A good attribute of a markup language is the ability to scrape it directly from an HTML page and have it appear with bold and italics intact. Wiki markup doesn't quite do this. Is it possible?

Many modern email readers partially implement this -- they accept plain ASCII text email, and recognize http: links and format them so they really work, and do other a few other bits of formatting.

The syntax-highlighting editors seem even closer to what you want to do. They only store the raw text, and regenerate all the highlighting (colors, italics, bold, links, etc.) on-the-fly every time they load a text file.

That's what you want, right? Something that's easy to learn, because you can directly see everything on screen. All the "magic" formatting stuff is regenerated from the letters you can see, rather than stored in some kind of (normally) invisible formatting code.

Rather than the act of creation being some sort of mystery (involving memorizing arcane control codes or hunting around on a menu), it becomes transparent -- you can see exactly which keys the author pressed to create that text.


Many people believe XML should be used for describing structured text, like the text used by Wiki Wiki systems. Their reason is that it's easy to use standard tools like XML parsers, which come in different flavors (DOM, SAX, pull parsers), or tools like XSLT.

The problem I find with these markup languages is that it's very hard for people to write them without using programs that keep track for them of what is allowed to enter in a given context. Moreover the tools used to process documents written in such a markup language are very complex not only to use, but also to implement. Just look at [xml.apache.org Xerces] and [xml.apache.org Xalan], the leading open-source Java tools for XML processing: they're full of bugs, incompatibilities and other problems.

The question is: why do we need all this complexity? The simple annotation of the text, as used by Wiki Wiki (see Text Formatting Rules), should be sufficient.

I think it may have to do with our need to solve problems in a different way. But are they necessarily better?


I happen to like Icoruma, a markup language designed for a very specific purpose, which is to type rules for role-playing games. It can then be compiled into HTML or other formats. Formatting codes include: bold, italic, superscript, index entry, numbered list, unnumbered list, footnotes, tables, six levels of headings, and two types of cross-references. It also supports include files, subroutines, and custom fields; It is even TC (although there is only one kind of loop (iteration through records in a list), you can still call subroutines recursively). For an example, see: zzo38computer.cjb.net


I don't know, guys. It seems like a pretty tough act to follow with XML. The content is still fully human readable and can be formatted to fit within a large number of constraints. Not only that, but the content and its representation can be isolated to any level you choose just by smarts applied to the design of the elements and their attributes. XML gets my vote as the #1 choice for cross-platform data transport.


Other references:


Category Language Feature (may not be a very good match, but it belongs somewhere) Category Text Filter

See original on c2.com