We've written a number of one-off content translators that usually involve adapting old scripts to new circumstances. In a recent conversion we found making two sites useful where the second site records the difficulties and solutions used to produce the first. This replaces console.log with hypertext.
See where I wrote more about this.
We've written many times about various import processes, some real, others imagined, often with shared code.
Import Documents — earliest examples explained
Import Annotations — provenance of imported data
Round Trip Import — using hidden item properties
Import, Quotation and Reference — tracking copyrights
We chose one specific publication as our test case, a Joshua favorite but now retired blog, Programming in the Twenty-First Century. site
We chose also to work in small increments creating new pages for each experiment. We are looking for a process similar to test-driven design where the history we leave behind remains executable as a supplement to the history we leave in source-code control. github
Our source was in html so we would have to read at least the subset used in the blog, now a static corpus. We begin by separating original content from boilerplate expanded from the blog's template. Div c1 and c2 stand for columns one and two.
We wrote a fit-for-purpose html parser that first separated tags from non-tags, then reassembled these into a tree based on distinguishing open from close tags. From this we could search for c1 and then p and pre tags within it.
We created a second metasite using a top-down approach guided by type declarations for the parts we felt we now understood. These we implemented with proven code borrowed from the first site.
One question arose immediately: when would the transformation happen? We chose early, in the init, creating a brief pause each time the server started.
Almost as an after thought we added a table of contents. We were converting pages asynchronously and in no particular order. We extended one type with the post number extracted from the source file name to sequence our list of titles.
The original author's titles read more like headlines which made for a hollow list of contents. We enriched this list by picking the first paragraph from each translated page, the synopsis in wiki lingo, and add up to 160 characters of this to each page title. Now we had a concise blog history.