Scrape Pages

A good way to understand the federation is to write a sitemap scraper. matrix

I wrote a scraper once that was a lot of fun to watch as it was written in one html script tag that animated the scrape as it went. I never distributed it because no learning comes from running it, just a lot of fetch load on the federation. I would share it with anyone who promised to run it only once and then destroy it.

See Javascript Sitemap Scrape

Tip to self: it is called scrape-pages.html. Try sys/find.

I found a copy on a remote disk. I opened the html file in chrome which launched the scrape. Within seconds it had filled my screen with info gleaned from sitemaps.

Now, after 35 minutes of scraping, and stopping a few times, we have found a lot more work to do.

This script is fetching every reachable page. My four-times daily scrape that feeds search only fetches pages that have changed in the last six hours. This is much less abusive of sites. See How Scrape Works

I make scrape data available in several formats and update this four times a day. See Search Index Download

# Code

Here are a few javascript snippets to give you an idea how sitemaps guide a scrape.

As we visit each site we add its pages to that todo list.

As we visit each page we add new sites to that todo list.