Search Overview

David Bovill writes, trying to figure out how / what your Sinatra Server does particularly with regard to search. I figure it might be interesting for me to implement an equivalent here in London. matrix room

I run search.fed.wiki.org on a laptop in my basement. This was my everyday machine until I damaged the screen. I use this machine for search because it has a large memory and solid-state disk. Both of which make searching faster.

I've devoted a site to my own notes as to how this collection of scripts work. See How Search Works

The sinatra app responds to requests that can be answered by examining the 'sites' directory which has a small collection of files for every page of every site I've scraped. For example, words.txt has an alphabetized list of every word found on a page while items.txt has a list of every item id.

I have used Elastic Search to build similar indexes. I understand the process they call "analysis" where one applies matching transformations to both documents and queries. When I make words.txt and items.txt I am doing analysis. One can rent time on a search cluster at Amazon but it is expensive. My directory full of flat files is a cheap alternative.

The sinatra app has functions 'sites' and 'pages' that look through words.txt, items.txt, and a few other attribute files. If I am looking for any of foo, bar or baz, I use the sites function to find sites that deserve further examination with the pages function. github

A third function, 'has', examines each flat file for the presence of all or any of the search terms. github

When I first started this project I meet a Google search expert through Nike. I told him how I was going to attach unique ids to every paragraph and how that offers a new search opportunity. His eyes glazed over. I knew then I would have to do it myself.