Neo4J

Neo4j is an open-source graph database implemented in Java and accessible from software written in other languages using the Cypher query language through a transactional HTTP endpoint. wikipedia site

I've sought the kind assistance of work colleague, Erika Arnold, author of wikiGraph, a shortest-path visualizing application for Wikipedia based on Neo4J. page

Here I follow her approach.

# Load

Build node and relation csv files from Search Index Downloads with a ruby converter that assigns numeric ids to sites, pages and titles. github

Runtime 3.5 min, 8 mb output. We now repeat this build after every scrap. Look for new data at 1:00 & 7:00, am & pm, pacific time. nodes.csv rels.csv

Build a graph database from csv files with neo4j's import command. docs

neo4j-import \ --into wiki.db \ --nodes nodes.csv \ --relationships rels.csv

IMPORT DONE in 10s 298ms. 92237 nodes 321452 relationships 92237 properties

Find where neo4j resources have been installed.

locate neo4j

Move the constructed db to the server's realm.

cd /var/lib/neo4j/data sudo mv ~/neo-wiki/wiki.db . chmod -R a+w wiki.db

Edit the config and restarted the server.

cd /var/lib/neo4j/conf sudo vi neo4j-server.properties sudo service neo4j-service restart

Open an ssh tunnel to the remote server.

ssh -L 7474:localhost:7474 bay.wiki.org

Then view the graph using the builtin app. localhost

# Query

It's hard to know what to look for until you have a real need and some experience formulating queries. I read docs and tried things. Some impressed me enough to save the svg.

For fast queries find a good place to start and then traverse from there. I picked .org sites and looked for links to titles about Education. svg

match (s:Site)-[HAS]->(p:Page)-[LINK]->(t:Title) where s.title =~ '.*org' and t.title =~ '.*Education.*' return s,p,t limit 100

I try retrieving nodes by the shape of their relations alone. This is slow. I find sites that have/link the same title. svg

match (a)-->(b)-->(c)<--(d)<--(e) return * limit 300

Top page counts for happening sites.

match (s:Site)-->(p) where s.title =~ '.*fedwikihappening.net' with s.title as site, count(p) as pages where pages >= 100 return pages, site order by pages desc

pages,site 3089,don.ny2.fedwikihappening.net 294,machines.alyson.sf.fedwikihappening.net 225,frances.uk.fedwikihappening.net 220,kate.au.fedwikihappening.net 216,maha.uk.fedwikihappening.net 178,tim.au.fedwikihappening.net 164,chamboonline.sf2.fedwikihappening.net 147,jon.sf.fedwikihappening.net 140,jenny.uk.fedwikihappening.net 134,thoka.uk2.fedwikihappening.net 134,sarah.uk.fedwikihappening.net 133,alyson.sf.fedwikihappening.net 119,audrey.sf.fedwikihappening.net 106,cogdog.sf.fedwikihappening.net

Shortest path between titles with sites that hold the pages along the way. svg

match (here:Title { title:"How Life Works" }), (want:Title { title:"Federated Wiki On Digital Ocean" }), paths = allShortestPaths((here)-[*]-(want)) with nodes(paths) as way match (s:Site)-->(p:Page) where p in way return * limit 40

But wait, this path goes through unrelated 'scratch' pages. It also disregards the relations' direction. We need to constrain the path to sites we know.

# Knows

Revise the batch import to include sites found on each page as KNOWS between a Page and neighborhood Sites. github

Add directional HAS|KNOWS pattern to the shortest path to constrain result to operationally discoverable sites. I add Titles to the path ends with IS relations. This adds a bit of ambiguity as to which page we're starting at. svg

match (here:Title { title:"Hacker Beach" })<-[h:IS]-(start), (end)-[w:IS]->(want:Title { title:"Naval Undersea Museum" }), paths = shortestPath((start)-[:HAS|:KNOWS*]->(end) return here, h, want, w, paths limit 40

I've tested the path by clicking through it. It works.

We can find the sites with the most neighbors by counting distinct KNOWS relations.

match (s:Site)-[h:HAS]->(p:Page)-[k:KNOWS]->(n:Site) with s.title as site, count(DISTINCT n.title) as neighbors return * order by neighbors desc limit 20

The numbers are much higher than we might expect. This is because we conflate forks, references and rosters while we scrape. For the graph database we should do better.

526 don.ny2.fedwikihappening.net 407 ward.asia.wiki.org 363 search.fed.wiki.org:3030 170 journal.hapgood.net 160 david.viral.academy 157 c0de.academy 131 wiki.viral.academy 128 david.bovill.me 125 forage.ward.fed.wiki.org 115 ward.fed.wiki.org 110 machines.hapgood.net 108 fedwiki.jeffist.com 96 tim.federatedwiki.org 94 chamboonline.sf2.fedwikihappening.net 92 sfw.mcmorgan.org 92 edfedwiki.com 90 sarah.uk.fedwikihappening.net 88 jenny.uk.fedwikihappening.net 87 maha.uk.fedwikihappening.net 86 tim.au.fedwikihappening.net

This agrees with the command line word count of the site wide rollup of site.txt files.

ls | while read i do wc -l $i/sites.txt done | sort -n

We can write a meaningful who-links-here that will resolve to the page at least as a twin. svg

match (s0:Site {title:'ward.asia.wiki.org'})-[:HAS]-> (p0:Page {title: 'federation-search'})-[:IS]-> (t0:Title)<-[a:LINK]-(p1:Page)<-[b:HAS]-(s1:Site) where (p1)-[:KNOWS]->(s0) return a,b

~

Neo4j is a sophisticated and revolutionary type of database that allows fast searches of very large networks stored in the database. It is designed to deal with networks of nodes and connections, both of which can be further defined through assigned properties (categories).