This BioInformatics applications note is worthwhile checking out: it describes a prototype implementation of a distributed search engine technology for scientific (biological) data. Although in itself this is not exactly a scientific breakthrough, it demonstrates how distributed networks could be handling the data deluge. The ever increasing output of modern experimental “high-throughput” experiments now routinely generate data somewhere in the peta-byte range. Current approaches to annotation and analysis do not scale well with the exponentially increasing influx of data. The prototype called “ScienceNet” (based on YaCy(2)) gives a glimpse of where the underlying philosophy of a distributed network might take us in future: rather than transmitting the data to central repositories, it is taking the search engine to the data such that it is located on the machines where the data is produced, stored and maintained.
This could be empowering to the scientific community as a whole – resting on the long term interest of the researchers who produce the data. Unlike dumping the somewhat processed data as a supplementary file to a publication and then forgetting about it, such a dynamic structure would help to continously improve and add to it.
“Imagine if, rather than relying on the proprietary software of a large professional search engine operator, your search engine was run by many private computers which aren’t under the control of any one company or individual. Well, that’s what YaCy does! The resulting decentralized web search currently has about 1.4 billion documents in its index (and growing – download and install YaCy to help out!) and more than 600 peer operators contribute each month. About 130,000 search queries are performed with this network each day.” (2).
It’s now over two decades ago that Tim “TimBL” Berners-Lee laid the foundations of the world-wide-web. Could you imagine doing your research without it? If you extrapolate, in probably less than 10 years we won’t be able to imagine doing science without Open Linked Data. I had to listen (3) and read (4) TimBL’s vision of the semantic web, aiming at integrating all available data, a couple of times before it really started to sink in. Standards like RDF (Resource Description Framework) in combination with SPARQL, REST etc. address the question of “how to turn the existing human-visible text and links into machine-readable data without repeating content”(1). In the biological domain, this job so far has been done by centralised databases and curators, usually at central locations like the NBCI or EMBL-EBI. More recently, search engines like EB-eye provided an alternative to existing biological search and retrieval engines. They focus on crosslinking centralized databases and making a single point of access available to the community. By creating a distributed network of decentralised search engines, we are now moving towards a network of scientific data and knowledge, open to all.
As we have seen with the world-wide-web, such a grass-roots movement naturally takes some time to take off. Once a critical mass is reached, it will become a pervasive force that is difficult to imagine ahead of time. But the sooner we embrace such developments the better for future research! In order to make the semantic web of knowledge work for future scientific discoveries, the main bottleneck is to overcome the barriers in our heads and stop hugging/clinging to/suffocating “my data” (think “Lord of the Rings”, Gollum saying “my prrrecious“). The result – “our open linked data” – is going to be far more valuable (3).
(1) Ben Adida, Mark Birbeck: RDFa Primer — Bridging the Human and Data Webs
(2) YaCy: a free and fully decentralised search engine network
(3) Tim Berners-Lee’s TED-talk on Linked Open Data
(4) Tim Berners-Lee on Design Issues in Linked Data
Similar post: rethinking biological complexity, google-style.