rethinking biological complexity, google-style.

Under selective (funding) pressure, outsourcing the annotation of biological data seems to be an inevitable necessity, not only for economic reasons. The existing pipelines for high-quality annotation just cannot keep up with the amount of papers, let alone raw data, that is produced. Something that started with Rfam has now spread to other databases (well described in “A Warning Sign for Biomedical Databases” by Manuel Corpas). Similarly, crowd-sourced “secondary” databases started to add value to established primary databases like the Protein Data Bank (PDB) – PDBWiki and Proteopedia for example. So how is this concept going to work in future, and what are the pitfalls?

Institutes like the NIH/NCBI, EMBL-EBI or SIB have done a great job at coping with the ever increasing amounts and varieties of biological data. Here’s a quote from the agenda of the NIH that is already 10 years old:

“The principal obstacle impeding effective health care is lack of new knowledge, and the principal mission of the NIH is to overcome this obstacle. At this point the impact of computer technology is so extensive it is no longer possible to think about that mission without computers.”

The incredible work of human annotators reading and understanding the scientific literature to enter the content into central databases is mindboggling, but it doesn’t scale. Would the outsourcing in a wikipedia-way have a negative impact on the quality of the data? This feels like restating the question if a project like wikipedia would ever be able to compete with the Encyclopædia Britannica. I guess the case is closed: it can, and it can do so much more! When releasing biological data, the annotators would be spend more time on supervising the development, arbitrating conflicts and offering support to contributors. The most important aspect is to keep the biological community on board by developing methods that help data to be stored in the most productive way. Productive rather than re-productive in the sense that they don’t have to replicate the data that has already been produced before. The current guidelines, i.e. such that sequences, structures and interaction data have to be deposited in a public database prior to journal publication were good and necessary steps in that direction already. While looking into disputed issues and safeguarding the integrity of the data such a system would have a multiplying effect of the curators expertise using the wisdom of the (scientific) crowd. At the same time, the role of developers would be to enable seamless submission and analysis tools for the data.

The only principle is to engage the research community: according to my experience, most scientists take an active, long term interest in their results and data. Even years after they did the research and experiments, developed algorithms and programs, set up websites and databases – whatever – they want it to be read, useful and updated. And all they usually want in return is to get some due credit for it.
Now that is a crucial point, and there is no use in pretending there is an easy solution to it all. If funding bodys would stop dancing madly around the holy golden cow of “impact factor” and start to acknowledge other equally significant contributions to the scientific enterprise it might have a much more positive impact on science as a whole than a dozen officially certified world-class geniusses put together. Nevertheless researchers all over the world happily donate their time and expertise to contribute data to the subjects they care for.
I don’t know if something like proteopedia’s a point-awarding system or a flattr – style is the answer to the “credit” question, let alone sustainable sources of funding. But it can’t be worse than the currently dominating impact-factor cargo-cult.

As a thought experiment, try to imagine the aforementioned institutions as something like a GooglePlex of Bioinformatics. Rather than storing or producing the primary content, the goal is to enable access to the vast network of biomolecular information.

Although in I agree with the principle, I am not sure if exactly wikimedia-based crowd-sourced annotation is THE answer, simply because I can’t quite see how complex relationships between different datasets would be handled in an efficient way. But with these concepts in mind, listening to Tim “TimBL” Berners-Lee call “RAW DATA NOW!” a second time starts to make some more sense. In his TED-talk “The next Web of open, linked data” the father of the web describes a vision where we should be taking this, and biomolecular data I think is a prime area to put these ideas to practice. So lets stop “hugging our databases” and take some inspiration from the guy “who invented the internet” –

, , , , , ,

  1. #1 by cistronic on 2011/06/23 - 22:21

    thanks to Henning Stehr for the hint on Manuel Corpas’ Blog!

  2. #2 by cistronic on 2011/06/24 - 14:56

    Coming to think of it: the development of the web was catalyzed by “Mosaic” – are there any developments for a (graphical) browser for linked data? Somewhere between how G.Rosling presented (data visualisation) and Firefox …

  1. a free network of open (scientific) data « cistronic
  2. Europe’s science cloud: Helix Nebula « cistronic

further hints, constructive criticism, questions, praise

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: