So far when dealing with hu-Hu-HUGE networks, the data cannot be processed in the memory of a single machine. Usually, we store the network in database tables (or similar, but worse: excel spreadsheets) describing the nodes and edges. Then you have to implement the graph-algorithm of your choice in this framework, which usually is leads to sub-optimal performance (putting it mildly). Straightforward optimizations would be for example in adressing a single node, the database could already load the adjacent edges into memory (cache) so the immediate next steps do not require additional access to the the disk-drive. Also, you might want to distribute parts of the network across several machines. Of course a carefully handcrafted and optimized object-relational mapping with tuned indices can do little wonders when you get it right, but the nagging thought remains that this can – and has to! – be dealt with in a better way. By now not only bioinformaticians and google-employees feel the occasional need to crunch BIG GRAPHS.
So the idea of Not-Only-SQL Databases has definitely been gaining appeal as well as momentum in the last two years also outside academia. Knowing SQL in it’s various incarnations for decades now, the (self-proclaimed) “Ultimate Guide to the Non – Relational Universe” has over a dozen different entries in the category “Graph Databases” alone.
Even when leaving interesting options from other categories (like Apaches’ Hadoop or the MongoDB) out the picture for the moment, getting an overview is daunting. It’s a jungle out there, and you don’t want to get bitten (at least not by samfink dat’s gonna kill ya) … But don’t look for definite advise to me! I’m just starting my tour and am getting my bearings right, cautiously dabbling around, and probably proceeding in the direction where the least number of beasts and dead-ends seem to be lurking. You know that (like in the classic tale of downgrading from wife1.0 to girlfriend2.0) once you commit to any particular system, it might be a very painful process later to change course. So here are some of the useful pointers I found:
From what I see at the moment, Neo4j is supporting most of my favourite languages (like java and python), it’s an open source project with “a high-performance graph engine with all the features of a mature and robust database” (they say). And also it looks like they have the manpower and commercial userbase to see this through in the long run. They quote Werner Vogels, CTO of Amazon, with “For anything with multiple relationships, multiple connections, Neo4j absolutely ROCKS!”. Seems that’s the beast I am going to check out first, but if you read this and got bitten before – and survived 😉 – let us know!
select fun, profit from real_world where relational=false;