Coping with the onslought of biomedical literature (not to mention data) has been likened to satisfying your thirst by drinking from a firehose at full pressure. In short: it hurts and you get blown away and soaked, but not refreshed or energized to take your next steps up the mountains of scholarly insight.
It’s clear you can’t read everything that is somewhat related to your special field of interest. So how is it going to help if there is an openly available dataset of medline available, containing about 19 million(!) entries?
The Medline dataset is a subset of bibliographic metadata covering approximately 98% of all PubMed publications. The dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. There are approximately 19 million publication records.
An infamous saying among graduate students says “(reading) A paper a day keeps the scooper away” – if you follow that advice (no holidays!) it will take 52.000 years to get through the content of this medline dataset – so you should have started to catch up with your reading already in the middle of the last ice-age, just shortly after our ancestors left Africa.
Don’t get me wrong : of course I am glad about the release of such a vast dataset – Obviously we need to explore different ways to handle all that information, and the efforts at the Open Bibliographic Data Working Group of the Open Knowledge Foundation are supporting that. Above is a snapshot of an example which maps a subset of the data to geospatial locations (using WebGL – worked in chrome, but not FireFox 4 for me). It looks great, but I think the most useful structure of the data is not contained in the geographical information. (Although there is evidence for the role of physical proximity as a predictor of the impact of collaborations.)
The key is (similar to googles page-rank algorithm) to identify the most informative nodes in the network of articles, not looking at the content of each of the papers as such. Forget about impact factor. It’s about the papers relevant for you – which is probably best defined by the collection of papers (.pdfs) you have already gathered.
On the left is a layout of a couple of hundred articles we collected in the lab together with data on co-citation (if two papers cite the same article, they probably talk about similar things, the more co-citations the more similar). I implemented a simple 2-dimensional embedding of the edges (denoting similarity), the nodes (articles) are on the center line while the z-axis indicates the degree of similarity.
After several hours of number-crunching on a single machine (we are talking about 25000 nodes and 5.5 million edges here) the “blocky” structure becomes appearant. These blocks correspond to areas of high local similarity – corresponding to subtopics where most articles are highly related. What I’d like to do now is to dive in deeper into the datasets, explore the sub-topics and find representative and relevant articles worthwhile reading. The prototype visualisation was done using grapher – I’d like to get an interactive version going in processing. There is an old processing discourse on graph-layout – quite informative on a number of practical issues on handling larger networks. The great thing is that the data reveals some underlying logical structure, you are not imposing a pre-conceived mental model on the structure of the problem.
I came to think about it from one-dimensional graph layout in the following manner: like a librarian, we are given thousands of books and a single shelf, which happens to be very, very long. The task is to arrange all the books on that single shelve such that works on a similar topic are close to each other on the shelf. Then you can decide which is the best book for you on a given subtopic by just comparing it to its local neighbours.
The resulting algorithm is a version of a Force-directed layout and in spirit similar to a Hyperassociative Map.
For a one-dimensional layout, it’s sufficient to assign a “slot” (integer 1..n) to each of the n nodes – that already takes care of the “repulsion” term you need in 2D or 3D. The scoring function measures the distance of the similarity edges in 1D space induced by the assignment and moves the nodes along the axis in order to minimize this cost function. That’s all the magic required! Yours truely will keep you updated on the progress of making this approach interactive.