Three times now I have set out to document the federation search data flows and each time I make a new site. This is usually motivated by some bug. Seems I don't ever get the documentation finished after the bug has been found and fixed. matrix
Our new work mimics the freeform data entry used in the example from SigMod Example Unbound.
# Sitemap
The scrape runs every six hours on a schedule that shifts with daylight savings time. The scrape is built from scripts that manipulates files in directories. Some files are rolled up from similarly named files in subdirectories. github
Our federation wide search runs a scrape four times a day to update flat-file indices that are searched on demand with a plugin and several related tools. github
We return again to the collection of mostly Ruby scripts that implement federation search motivated by fending off slow decay based on growth and evolution in the federation itself.
# Applications
A good way to understand the federation is to write a sitemap scraper. matrix
We add restrictions to Scrape Pages so that it finds more relevant content.
This page displays reachable titles following links forward or backwards two hops.
# Resources
We collect various counts while scraping and report them as a text file. json
We'll mine the search index logs for insight as to what is happening in the federation.
All sites found, organized by domain name, excluding sites with less than ten pages.