Soft Wiki

An experimental all-software Wiki With Programmable Content by Zygo Blaxell - c2.com

# General description

Implementation language is a safe subset of Tcl 8.2 or 8.3 (both seem to work) (see The Tclers Wiki for more information on Tcl) with "Soft Wiki extensions" as described herein.

The content of the Wiki is user-editable Tcl scripts which run inside the Soft Wiki interpreter and produce MIME output, including headers. This allows Soft Wiki to host - or generate on the fly - any possible WWW content-type.

HTTP requests are supported via CGI in the usual way. SMTP requests should be supported via an auto-reply mailbox (Soft Wiki generates the reply email headers and text, but cannot alter the source or destination SMTP envelope address), but this isn't implemented yet.

To respond to a request, Soft Wiki loads a "boot" script with a hardcoded name ("/boot/cgi") (different scripts for email and web access) from the Soft Wiki database into the safe Tcl interpreter and evaluates the script. This "boot" script in turn does appropriate input parsing (i.e. CGI parsing if via HTTP) and pass control to another script inside the Soft Wiki database, based on PATH_INFO, QUERY_STRING, Subject header, etc.

Soft Wiki scripts generally have full access to everything from the client a CGI script would have, including file uploads, access to User-agent and Referrer fields, client IP, ability to generate non-HTML output.

Soft Wiki scripts can themselves create safe interpreters which are even more restricted than themselves, thereby implementing their own secure wrappers for other Soft Wiki scripts in a hierarchical fashion. This can be used to implement access control within Soft Wiki itself.

There exist two units of code that are not, and except in unusual circumstances, should not ever be editable through Soft Wiki itself. Both units of code exist outside of the "normal" Soft Wiki environment:

Bootstrap code to create the safe interpreter, set up access to the Wiki database, set resource limits, load the boot page, evaluate it, and attempt to provide an intelligible error report if a fatal error is encountered early in a Soft Wiki script's lifetime. Think of this as a minimal "kernel" on top of which everything else runs combined with a "watchdog" that prevents it running too long. This kernel is minimalistic, so hopefully there is no reason to want to change it except to set parameters such as resource limits and interpreter safety.

The persistent database engine. Having the ability to edit this unit is like having the ability to edit an OS's filesystem code while the OS is still running - not useful in most cases. There are reasons to want to change this, e.g. to use a different database module, or to enhance performance, or to support "invisible" features such as backups or replication. Another problem is that the bootstrap code must access this database somehow, and how does one use the database-reading code if it is located inside the database itself?

There are two units of code that may or may not be editable by users because of site-specific policy. Both units of code may exist and even be editable inside the Soft Wiki environment if the Wiki Wiki Clone operator chooses to allow it:

"Unsafe" Tcl code, which can open socket connections, load Tcl extensions from shared libraries, execute sub-processes, and so forth. Such code can provide proxy services to "safe" Tcl code which implement some kind of security policy. As an example, the "safe" restriction can be simply removed from the bootstrap code, so that the "boot" script in the Soft Wiki database is not restricted to a subset of Tcl. The "boot" script in turn can create its own "safe" Tcl interpreter after establishing a new security policy, usually one that restricts write access to itself.

A simple web form and email address for executing Tcl code inside the Soft Wiki environment. This hardcoded interface bypasses the "boot" page, and enables examination and repair of the database even if (when) someone accidentally damages the boot script. A "secure Soft Wiki" would have to remove or restrict access to this unit, since it can read or modify any database entry. Think of this as the "debugger".

Another simple web form and email address for accessing the most recent global database transaction log. "Most recent" means whatever the site administrator has left on the machine - they are not part of the database proper, so they can be removed at any time. This form is capable of re-inserting a continuous set of transactions in reverse, effectively acting as a "big global undo button" for reversing fatal changes. Soft Wiki itself can access this log and use it to produce more sophisticated functionality (e.g. different forms of Recent Changes or Quick Changes, or restricting the reported set of transactions to those on particular keys).

Hard resource limits are implemented in Soft Wiki using a loadable C extension to Tcl which sets limits on a Unix process. These limits cannot be overridden inside the safe Tcl interpreter, but of course are trivial to override inside the boot script. They are:

One thread

15 real-time seconds (not including HTTP data transfer time)

5 CPU seconds

8 megabytes rss/virtual/data/stack memory

8 megabytes maximum I/O, including database transactions.

Stock Tcl 8.2 (8.3 in testing) interpreter core with no experimental features (e.g. threads) enabled. For those who don't know Tcl very well, this means:

simple, consistent, yet powerful syntax (Lisp with simpler syntax)

namespaces, global and local (lexically and dynamically scoped) variables

Perl-style regular expressions

Perl-style binary strings (actually Unicode, as I discovered the hard way -- Zygo Blaxell)

lists, strings, integers, doubles, booleans, associative arrays, and procedures all convertible to a canonical string representation

no reference type (but some reference semantics in the implementations of various types)

sub-interpreters, support for multiple execution contexts

procedures are transparently byte-compiled at run-time for performance (important when you only get 5 seconds ;-)

some introspection capability can examine procedures on call stack)

exception handling

Soft Wiki visible features of the database server:

Simple key-value dictionary storage:
"softwiki get key" (returns value) and "softwiki set key value" (replaces the database entry for "key" with "value").

Flat namespace for keys, but arbitrary key structure allowed (so "foo bar" and "/home/myself/scripts/homepage.tcl" and "::foo::bar" are all valid)

Sorted key enumeration. Can fetch a list of keys ranging from X to Y (or "/a/b/c/" to "/a/b/d/")

Lock-free transaction support. A transaction consists of two lists of (key, value) pairs. The first list is a list of assertions, while the second is a list of assignments. A transaction is processed by acquiring locks on all of the keys, then verifying all of the assertions' actual values match the values specified in the transaction, then (if all assertions are verified) assigning all of the assignments' values to their respective keys. Transactions are atomic from Soft Wiki's point of view.

All keys have metadata in addition to their values (actually, deleted keys have metadata too). The metadata contains:

A string of audit data (IP address of CGI client, Received header for email, authenticated user ID if authentication is used). This might be stored as a simple index into an audit data database.

The transaction serial number of the transaction that most recently changed the key's value.

The timestamp of the last modification.

A transaction log is accessible from within Soft Wiki. Each transaction is recorded with a serial number, which increases by one on each new transaction. The transaction log records:

A string of audit data, as in the metadata database

A timestamp, the time the transaction was committed

The previous values of all keys in the assignment list

The transaction serial numbers from any keys that were modified. These can be followed as a linked list to find the history of modifications of a key.

Less-visible features:

Database implementation details are hidden from Soft Wiki. Actually, they're protected in a separate process, since Soft Wiki processes are expected to receive a lot of fatal signals from the hard resource limits...

Flat files in a directory sounds like a good place to start.

Sleepycat's Berkeley DB 2.x inside a dedicated Tcl server shared among all Soft Wiki clients sounds like a good place to go...or perhaps a combination of the above with "small" objects in the DB and "large" ones in files.

Transparent data compression might be nice, especially if it can avoid Berkeley DB overflow pages.

If a resource limit (e.g. CPU time) is encountered by a Soft Wiki script during a transaction, it will not violate the atomicity of a transaction, but it may kill the process that initiated the transaction while the transaction is being completed.

The following are probably good ideas, but there are issues to be worked out first before they become practical:

Access to SMTP, HTTP, FTP, NNTP, ...

There's a major potential for abuse here, e.g. port-scanning, email spamming...

Abuse could be mitigated by providing a white-list of URL's that can be accessed via HTTP

SMTP may only be allowed to people who subscribe (with confirmation) to a mailing list. The mailing list itself would have no traffic, it would only provide a mechanism whereby people can elect to receive mail from Soft Wiki, and Soft Wiki scripts would only be able to send mail to subscribers.

HTTP access is useful for implementing middleware.

SMTP access is useful for implementing email notification of changes to Soft Wiki objects.

Scheduled/deferred execution (similar to at/cron)

LambdaMOO allows this

Could be used for garbage collection, regenerating static pages, etc.

Could be implemented by having someone hit a specific URL at regular intervals, i.e. outside Soft Wiki.

Best implementation is probably another "boot" script which would be a Soft Wiki implementation of 'cron'.

Could support queues of commands for "hourly", "daily", "weekly", "monthly", which allows the site administrator to pick the time while allowing the Soft Wiki contributor to pick the frequency.

Could allow larger resource limits because the request load is better defined (no Slashdot Effect, since there is only one request per hour/day/week/month). This could allow very expensive Soft Wiki scripts such as building an index for a full-text search engine.

Forking (spawning new processes or threads)

LambdaMOO allows this

Vanilla Tcl doesn't (but there is experimental thread support)

would probably have to be "spawn", not "fork" - i.e. evaluate a script in a different process

Creates all-new resource limit and ownership issues - a script has to be tracked to figure out if it was created by the web server, or by another script. Combine with HTTP client access and you have scripts that can get more resources by requesting themselves over network loopback, possibly indirected through another WWW mediator.

Wiki Wiki Essence is probably lost when there are "ownership issues."

Perhaps a background priority queue would be useful. A script could specify its priority in the queue in terms of its desired level of resource usage. A script requesting 1 CPU second runs before one requesting 10, and that one runs before a script requesting 100.

Access controls and authority structure

LambdaMOO has access control per member of objects. You can create an object that you don't entirely own, and you can own parts of objects you didn't create. Workable solution to "ownership issues" problem, and they've been doing this much longer than we have.

Can probably be faked with a lot of safe Tcl interpreters and a secure executive. Unfortunately Tcl does not allow safe interpreters to expose and hide commands, not even the safe ones, but it does allow secure command aliases.

In particular, access control to the database can be implemented inside of the safe Tcl interpreter.

Tcl "language" extensions (distinguished from "capability" extensions in that they don't provide new functionality, only new ways to express existing functionality):

Itcl:
object-oriented Tcl extensions - instances, constructors, and destructors.

Expect:
features signal handling and an interactive debugger, among other things. Might cause more problems than it solves, though: the ability to set up a signal handler would mean that resource limits could be circumvented.

Modify the Tcl core to implement a command count quota similar to LambdaMOO's per-process "tick" limit. "info cmdcount" in Tcl tells you how many Tcl commands have been evaluated in an interpreter. It should be trivial to implement an extension to the "interp" command ("interp quota"?) which allows setting a limiting value for "info cmdcount" inside a slave interpreter. Once the quota has been reached, any further Tcl commands in that interpreter will return errors, which will unwind the Tcl stack back up to the master interpreter's "eval" statement.

Tcl "capability" extensions (distinguished from "language" extensions in that they give you access to things that cannot be represented otherwise (vastly increased efficiency counts as a capability - LZW compression in pure Tcl is amusing, but Soft Wiki imposes hard RAM and CPU limits):

gd:
image generation (generate GIFs on the fly, popular in CGI scripts)

Image Magick:
more image manipulation

diff:
show differences between revisions of a page (use this to build a Quick Diff feature inside Soft Wiki itself)


I've been asked many times "why Tcl?", or more about supporting non-Tcl languages in Soft Wiki (Python, Perl, Java, Guile, Lisp, Scheme...). I'd certainly like to be able to use other languages, but there are some issues to resolve first:

Language implementation must feature a restricted interpreter and must pass a security audit to protect its host machine from accidents or abuse. This virtually always implies an open-source implementation.

Language must be available and working on Linux.

Language should be interpreted (but Java is a good example of a compiled language that would be acceptable).

Language should be easy to manipulate with a web browser (so if a tab is not syntactically equivalent to 8 spaces, the language may have problems; ditto languages that use non-ASCII-text source code).

Language should allow writing simple scripts quickly by someone experienced in the language (otherwise it's not very Wiki Wiki, is it?).

Language should be dynamically loadable from a C program.

Language should be easy to extend (positively, by adding new functionality, and negatively, by removing functionality that exists) in C, or a language accessible from C.

Language must respect hard CPU, memory, and I/O resource limits. Unix makes this easy for small interpreters embedded in a single process per client request; however, if the language is something like LambdaMOO, where a big server process handles all client requests, that server will be responsible for resource managements.

So far, Perl and Guile/Scheme are being considered as Soft Wiki's second languages.

The issue of multiple language support raises some interesting (and as yet not entirely resolved) problems:

How do you know in advance what language a Soft Wiki object (all of which are simple text strings) will be written (or interpreted) in?

Possible answer:
A "loader" script (of course written inside Soft Wiki) determines the language using something analogous to "#!" Unix syntax, except with names meaningful to Soft Wiki

How do interpreters of language X access interpreters of language Y?

The apparent Lingua Franca is a function for each language which interprets a string of source code in that language and returns some result as a string. Passing more complex objects around, even when they are supported by both languages involved, is less well-specified.

One could use a two-step approach:
the first step creates an interpreter in the called language as an object in the calling language. The second step is to use the object in the calling language to interpret source code for the called language in that language's interpreter. This would allow control over initialization overhead.

How often and under what circumstances do languages communicate?

The most common case is likely to be a simple case of language X loading a string in language Y, interpreting that string in language Y, and then exiting. Call this the "bootstrap" model, where Tcl is used to select a different language, and all the interesting stuff happens in that language.

The worst case is two object-oriented languages doing inherited method calls into each other.

How does Soft Wiki internal security work when multiple languages are available? Recall that Tcl can implement a hierarchical access control mechanism using nested interpreters. This access control mechanism apparently falls apart when interpreters of other languages are created, unless these other language interpreters have exactly identical access control mechanisms implemented themselves.

One solution is to replace "/boot/cgi" (the Boot script) with "/boot/cgi.tcl", "/boot/cgi.pl", "/boot/cgi.guile", "/boot/cgi.py", "/boot/cgi.class", i.e. one boot script for each language. That also implies one debugger for each language, and indeed one complete set of Soft Wiki infrastructure per language.

Another solution is to implement access control inside the database itself, outside of any interpreter. Again, what language do you express such controls in...

Perhaps the most elegant solution is to find some way to implement each child interpreter's Soft Wiki interface in terms of its parent's. This may require N^2 different implementations for N languages. Ouch.

Thanks for the comments so far. They're really helping me to better understand the problem. -- Zygo Blaxell


Mike Stump suggests Soft Wiki as a first step toward a truly global source code management system. It's a nice idea, but all I designed Soft Wiki for was to be a maintenance-enhanced Wiki Wiki Web.



I've made a system like this in Perl, based off Use Mod. There is a small bootstrap program that runs a community-editable Use Mod-like script. It's live and is able to accept modifications in its code. See Self Programming Wiki or purl.net .


# See also