Extract to Files

Sources can be highly variable in how much data is available and how quickly it can be retrieved. Cache each extract as flat raw json files that can be examined and processed repeatedly.

Script the extraction so that it can be repeated daily to get fresh data. Design scripts to run locally for testing and in batch under automatic control. A common debugging scenario is to run the extract script locally for a failed extract source adding print statements as necessary.

./data/<source>/raw.json

Choose the simplest extraction methods that will yield identifiable nodes first. Prefer raw data that is organized as an array of object, each destine to become a node.

Format raw json files to be easily read with command line tools and text editors. Use tools and editors that can read large files without choking. Expect to text search these files to find missing or malformed records.

With each extraction we create an additional file with source details we will use to document when and how the extraction has taken place.

cat <EOF >explain.yml date: `date` description: "blah blah blah" links: - http://github.com ... - http://data.example.com ... EOF