Composition of Digital Knowledge

In the pre-digital era, Composition was never much of a problem. A scientist would take a few research articles or monographs describing the various ingredients, and then write down their composition on a fresh sheet of paper. Variations in the notations across different sources would be no more than an inconvenience. Our pre-digital scientist would translate notation into concepts when reading each source, and the concepts into his or her preferred notation when writing down the composition. As long as the concepts match, as they do in any mature field of science, that is routine work.

Composition of digital knowledge is very different. The items to be composed must be matched not only in terms of (human) concepts, but also in terms of the syntax and semantics of a formal language. And that means that all ingredients must be expressed in the same formal language, which is then also the language of the composed assembly.

If we start from ingredients expressed in different languages, we have basically two options: translate everything to a common language, or define a new formal language that is a superset of all the languages used for expressing the various ingredients. We can of course choose a mixture of these two extreme approaches. But both of them imply a lot of overhead and add considerable complexity to the composed assembly. Translation requires either tedious and error-prone manual labor, or writing a program to do the job. Defining a superlanguage requires implementing software tools for processing it.

As an illustration, consider a frequent situation in computational science: a data processing program that reads a specific file format, and a dataset stored in a different format. The translation option means writing a file format converter. The superlanguage option means extending the data processing program to read a second file format. In both cases, the use of multiple formal languages adds complexity to the composition that is unrelated to the real problem to be solved, which is the data analysis. In software engineering, this is known as “accidental complexity”, as opposed to the “essential complexity” inherent in the task.

As a second example, consider writing a program that is supposed to call a procedure written in language A and another procedure written in language B. The translation option means writing a compiler from A to B or vice-versa. The superlanguage option means writing a compiler or interpreter that accepts both languages A and B. A mixed approach could use two compilers, one for A and one for B, that share a common target language. The latter solution seems easy at first sight, because compilers from A and B to processor instructions probably already exist. However, the target language of a compiler is not “processor instructions” but “the processor instruction set plus specific representations of data structures and conventions for code composition and memory management”. It is unlikely that two unrelated compilers for A and B have the same target language at this level of detail. Practice has shown that combining code written in different programming languages is always a source of trouble and errors, except when using tools that were explicitly designed from the start for implementing the superlanguage.

Many of the chores and frustrations in the daily life of a computational scientist are manifestations of the composition problem for digital knowledge. Some examples are

* file format conversion, as explained above * combining code in different languages, also explained above * software installation, which is the composition of an operating system with libraries and application-specific software into a functioning whole * package management, which is an attempt to facilitate software installation that re-creates the problem it tries to solve at another level * software maintenance, which is the continuous modification of source code to keep it composable with changing computational environments * I/O code in scientific software, which handles the composition of software and input data into a completely specified computation * workflow management, which is the composition of datasets with multiple independently written and installed software packages into a single computation

These examples should be sufficient to show that the management of composition must be a high-priority consideration when designing formal languages for digital scientific knowledge.

See Konrad Hinsen's Scientific notations for the digital era paper. arxiv

Re: Jon Udell, Federated Wiki for teaching and learning basic Composition. post

And here we come to something that is very hard to get about federated wiki which Mike Caulfield would love your help explaining — Ward likes to say his interest is can you take a few simple ideas and data structures and make them really generative. And so you have this simple idea of paragraphs as the atomic structure rather than pages, and suddenly the lousy revision histories we have turn into this beautiful, almost poetic view.

Knowledge is a particular property that matter can have in our universe. This way of looking at knowledge breaks with a long-standing tradition that sees it as a mainly anthropomorphic, subjective concept. According to this tradition, knowledge presupposes that there is a sentient being, such as a human being. Knowledge, in other words, would only exist in minds. According to this idea, knowledge seems to be subjective.