Using the fact that sometimes scientists look at the pictures first

ResearchBlogging.orgI was happy to see that the authors published this article in PlosOne. I was following their work a while ago, but had lost track (plus, when asked, the last author implied that they had moved on to new projects). So here's the citation and then I'll summarize and comment.

Divoli, A., Wooldridge, M., & Hearst, M. (2010). Full Text and Figure Display Improves Bioscience Literature Search PLoS ONE, 5 (4) DOI: 10.1371/journal.pone.0009619

The authors created a prototype information system that used Lucene to index the metadata for open access biomed articles, the full text, and the captions for images and tables. The interface is set up to allow you to use one search box and then radio buttons to select full text and abstracts, figure captions, or tables. In the first, the results are sort of like the standard metadata and abstract with key word in context excerpts and extracted images. For figure captions, you can either have a grid of figures, or a list. For tables, you get a citation, the table caption, and the table. The article spends a good deal of time discussing design decisions, providing a tutorial for creating your own.

i-548b4b9060b02d3118529fd28f6d5348-image_thumb_2.png To build the prototype, they got the XML from PubMed Central, pulled out authors, images, captions, abstracts... They made different sizes of the images for quick retrieval later.They then included different fields with different weights depending on what you select to search. They then got a group of biologists (n=20 although number isn't really important for qualitative studies), and ran them through a study. The participants provided the query and looked at it in each view, thinking aloud about their reactions and steps. They were then asked a few questions about each interface

The majority of the participants would choose to use this type of interface for at least some of their searching. Seems like they got the full text search, but were not quite as sure about the table search. Some thought it would be useful for getting right to the results but several didn't think they would use it.

Now for some commentary...

I was somewhat critical in my post I linked to above, but I really think this is promising stuff. The authors point out that this is very dependent on access to the full text and also won't be universally useful. There are plenty of search situations in which the images wouldn't be used, but they should be an option. Since my earlier post, CSA has added "deep indexing" to more of their files.  It's not the samei-b9fef8990690990d7cab3bb467cea116-image_thumb.png as their dedicated Illustrata product, which is more like Biotext.

Publishers have the full text, so some of them are also making the images and tables available outside of the article. For example, both ACS and RSC have added images to their RSS feeds. ScienceDirect has a tables and images tab on their articles - which is nice for scanning to see if the article is relevant.  PlosOne lets you look through a list of the tables and images, download a ppt or high quality image.

Springer Images also lets you search the tables and captions to get pictures. It also indexes the context of the reference to the image in the text. You also get a link to the article and excerpts like on Google Books. My colleague at work pointed out that it is useful for finding phase diagrams.


But more than all of that, there's been a lot of talk recently about disaggregating the journal article or even doing away with the whole and just using the pieces. If so, maybe this is an intermediate step.

More like this

Good question. Certainly Google could index all of the content Biotext covers (as can anyone), but precisely what Google indexes and when is not provided. This is a sore point with librarians. Clearly CSA and Springer index things that are covered with a standard copyright, but both provide data to Google. One would assume that they don't "give away the store" to Google, but I don't know what that boundary is. I also do not know how the Google team weights metadata from various parts of an article. Do they take the XML feed from PMC and treat the abstract differently from the full text?

A lot of what this team (and the Tenopir & Sandusky team) did was to understand *how* to make the images and tables available. It would be interesting to compare the google image search with the biotext image search - for a search that was very clearly within the biomed domain. The authors also point out that snapshots of the whole page in search results typically aren't that useful - does Google pull out images in scholar? I haven't seen it if they do.

Good question. Certainly Google could index all of the content Biotext covers (as can anyone), but precisely what Google indexes and when is not provided.

By red pepper (not verified) on 27 Apr 2010 #permalink