Making the Data Public: Interview With Xan Gregg

Xan Gregg has also attended both the first Science Blogging Conference and the second one in January, where he co-moderated a session on Public Scientific Data. He blogs on FORTH GO.

Welcome to A Blog Around The Clock. Would you, please, tell my readers a little bit more about yourself? Who are you? What is your scientific background? What is your Real Life job?

I'm a software engineer working at SAS Institute on a desktop "statistical discovery" application called JMP. (Yes, we have a blog, and I sometimes post to it.) My primary interest is data visualization, and in 2006 I won a data visualization competition judged by author Stephen Few. My background is in math and computer science, and I use both fields as a team member at Project Euler, which is a site full of challenging math problems that usually require writing programs to solve.

When and how did you discover science blogs? What are some of your favourites? Have you discovered any new cool science blogs while at the Conference?

It wasn't until I attended the first Science Blogging Conference that I knew about so much science blogging going on. Now I have trouble keeping up. I can hardly read as fast as you can blog! I like those blogs that provide good summaries of recent research, such as Cognitive Daily, Statistical Modeling, and one I discovered at the conference, ThankYouBrain by attendee Bill Klemm.

How did you get interested in public data?

Having a focus on data visualization, I'm always analyzing graphs and trying to think of ways to make them better. To really make a point, I need to actually produce a better visualization from the same data, and I have been disappointed to find that the data is not often readily available. I can sometimes to resort to programs like GraphClick that can scrape data from standard graphs, but even that doesn't work for summary graphs where the real data is invisible.

i-2aadbeff6392d896f47c0a4360e96061-xanhead.jpgWhy should scientists make their raw data public? What are the pros and cons?

The more I researched the subject, I found a disconnect between what scientists say and what they do. Almost every authority extolls the principles of public data, but few scientists practice it openly. I've found it to be primarily a question of when. Full open science labs like Jean-Claude Bradley's UsefulChem publish data as it's generated, but that model isn't for everyone. I'd be happy to see data published with papers, whith the policy of the American Economic Review, but the usual answer to the question of when is "when somebody asks for it nicely enough."

The pros and cons depend on your goals. If you're trying to further public knowledge, then sharing data supports that goal. If you're in a competitive situation, then sharing data could weaken your position. I guess that's a philosophical issue on the nature of scientific research and the public good. In practical terms, publishing data encourages better review and new derivative research, and the only con is with confidential data that can't be effectively anonymized.

Are there disciplinary differences?

The main disciplinary difference I've seen regards the quantity of data. Fields like astronomy and genetics have tons of data, which encourages central data respositories for archiving data.

How would you go about persuading a scientist to make his/her data public?

The idea is there already, so I'd focus on showing how easy it is to share data in a minimal way. Of course, most scientists take their cues from journals and funders, and we need more of them to require data. Some governments, including the US government, are moving in that direction for publicly funded research. It'd be nice to see PLoS adopt something like the data policy of American Economic Review. I'd be happy to work with someone on setting up a data repository site.

How should the raw data be presented online?

Anyway you can. Just a CSV (comma-separated values) file sitting on a web server is fine. Better is an independent site, such as Swivel or Google Docs. The important thing is to remember to include a description of the data fields and sources. Then use the URL of your data as a citation point.

Is there anything that happened at this Conference - a session, something someone said or did or wrote - that will change the way you think about science communication, or something that you will take with you to your job, blog-reading and blog-writing?

The whole conference makes me temporarily depressed. I find out that for every good idea I've had, not only has someone else already had it, but three sites are already implementing it!

It was so nice to see you again and thank you for the interview.

Thank you, Bora. Keep on tickin'.

============================

Check out all the interviews in this series.

More like this