Neo4j is an open-source graph database implemented in Java and accessible from software written in other languages using the Cypher query language through a transactional HTTP endpoint. wikipedia site
See Neo4J Resources
See Neo4J Production
I've sought the kind assistance of work colleague, Erika Arnold, author of wikiGraph, a shortest-path visualizing application for Wikipedia based on Neo4J. page
Here I follow her approach.
# Load
Build node and relation csv files from Search Index Downloads with a ruby converter that assigns numeric ids to sites, pages and titles. github
Runtime 3.5 min, 8 mb output. We now repeat this build after every scrap. Look for new data at 1:00 & 7:00, am & pm, pacific time. nodes.csv rels.csv
Build a graph database from csv files with neo4j's import command. docs
neo4j-import \ --into wiki.db \ --nodes nodes.csv \ --relationships rels.csv
IMPORT DONE in 10s 298ms. 92237 nodes 321452 relationships 92237 properties
Find where neo4j resources have been installed.
locate neo4j
Move the constructed db to the server's realm.
cd /var/lib/neo4j/data sudo mv ~/neo-wiki/wiki.db . chmod -R a+w wiki.db
Edit the config and restarted the server.
cd /var/lib/neo4j/conf sudo vi neo4j-server.properties sudo service neo4j-service restart
Open an ssh tunnel to the remote server.
ssh -L 7474:localhost:7474 bay.wiki.org
Then view the graph using the builtin app. localhost
# Query
It's hard to know what to look for until you have a real need and some experience formulating queries. I read docs and tried things. Some impressed me enough to save the svg.
For fast queries find a good place to start and then traverse from there. I picked .org sites and looked for links to titles about Education. svg
match (s:Site)-[HAS]->(p:Page)-[LINK]->(t:Title) where s.title =~ '.*org' and t.title =~ '.*Education.*' return s,p,t limit 100
I try retrieving nodes by the shape of their relations alone. This is slow. I find sites that have/link the same title. svg
match (a)-->(b)-->(c)<--(d)<--(e) return * limit 300
Top page counts for happening sites.
match (s:Site)-->(p) where s.title =~ '.*fedwikihappening.net' with s.title as site, count(p) as pages where pages >= 100 return pages, site order by pages desc
pages,site 3089,don.ny2.fedwikihappening.net 294,machines.alyson.sf.fedwikihappening.net 225,frances.uk.fedwikihappening.net 220,kate.au.fedwikihappening.net 216,maha.uk.fedwikihappening.net 178,tim.au.fedwikihappening.net 164,chamboonline.sf2.fedwikihappening.net 147,jon.sf.fedwikihappening.net 140,jenny.uk.fedwikihappening.net 134,thoka.uk2.fedwikihappening.net 134,sarah.uk.fedwikihappening.net 133,alyson.sf.fedwikihappening.net 119,audrey.sf.fedwikihappening.net 106,cogdog.sf.fedwikihappening.net
Shortest path between titles with sites that hold the pages along the way. svg
match (here:Title { title:"How Life Works" }), (want:Title { title:"Federated Wiki On Digital Ocean" }), paths = allShortestPaths((here)-[*]-(want)) with nodes(paths) as way match (s:Site)-->(p:Page) where p in way return * limit 40
But wait, this path goes through unrelated 'scratch' pages. It also disregards the relations' direction. We need to constrain the path to sites we know.
# Knows
Revise the batch import to include sites found on each page as KNOWS between a Page and neighborhood Sites. github
Add directional HAS|KNOWS pattern to the shortest path to constrain result to operationally discoverable sites. I add Titles to the path ends with IS relations. This adds a bit of ambiguity as to which page we're starting at. svg
match (here:Title { title:"Hacker Beach" })<-[h:IS]-(start), (end)-[w:IS]->(want:Title { title:"Naval Undersea Museum" }), paths = shortestPath((start)-[:HAS|:KNOWS*]->(end) return here, h, want, w, paths limit 40
I've tested the path by clicking through it. It works.
We can find the sites with the most neighbors by counting distinct KNOWS relations.
match (s:Site)-[h:HAS]->(p:Page)-[k:KNOWS]->(n:Site) with s.title as site, count(DISTINCT n.title) as neighbors return * order by neighbors desc limit 20
The numbers are much higher than we might expect. This is because we conflate forks, references and rosters while we scrape. For the graph database we should do better.
526 don.ny2.fedwikihappening.net 407 ward.asia.wiki.org 363 search.fed.wiki.org:3030 170 journal.hapgood.net 160 david.viral.academy 157 c0de.academy 131 wiki.viral.academy 128 david.bovill.me 125 forage.ward.fed.wiki.org 115 ward.fed.wiki.org 110 machines.hapgood.net 108 fedwiki.jeffist.com 96 tim.federatedwiki.org 94 chamboonline.sf2.fedwikihappening.net 92 sfw.mcmorgan.org 92 edfedwiki.com 90 sarah.uk.fedwikihappening.net 88 jenny.uk.fedwikihappening.net 87 maha.uk.fedwikihappening.net 86 tim.au.fedwikihappening.net
This agrees with the command line word count of the site wide rollup of site.txt files.
ls | while read i do wc -l $i/sites.txt done | sort -n
We can write a meaningful who-links-here that will resolve to the page at least as a twin. svg
match (s0:Site {title:'ward.asia.wiki.org'})-[:HAS]-> (p0:Page {title: 'federation-search'})-[:IS]-> (t0:Title)<-[a:LINK]-(p1:Page)<-[b:HAS]-(s1:Site) where (p1)-[:KNOWS]->(s0) return a,b
~
Neo4j is a sophisticated and revolutionary type of database that allows fast searches of very large networks stored in the database. It is designed to deal with networks of nodes and connections, both of which can be further defined through assigned properties (categories).