Metagenomics, biomes, and dirt: separating good data from bad

The simple fact is this: some DNA sequences are more believable than others.

The problem is, that many students and researchers never see any of the metrics that we use for evaluating whether a sequence is "good" and whether a sequence is "bad."

All they see are the base calls and sequences: ATAGATAGACGAGTAG, without any supporting information to help them evaluate if the sequence is correct. If DNA sequencing and personalized genetic testing are to become commonplace, the practice of ignoring data quality is (in my opinion) simply unacceptable.

So, for awhile anyway, I'm making a bunch of this data available on-line and I'll describe how to work with it and what it means.

To see some DNA sequence data, with quality values:
1. go to http://classroom1.bio-rad.ifinch.com
2. log in with the user name: BR_guest
3. and the password: guest

i-c3d8c64a7fef35bd61584e7b8c2e8df4-step1_folders.png

When you get there, click the link to see the folders that I've set up.

i-2e5436baf1d313d6a66155f49ba496ca-step2.pngThis link takes you to a folder with student data from 2005. (Learn more about the project) Then, click the link to see a summary of information about the chromatograms.

 

i-5a31d9e2bca2515ce7139b97244529cb-chromat_table.png

When you get to the chromatogram table, you can see some information about the quality of each chromatogram. You can take a closer look at the data by clicking the FinchTV link to open the chromatogram in FinchTV. (FinchTV is freely available here from Geospiza.)

 

Which values do you think correspond to good data?

Which values are associated with poor quality data?

Feel free to sort the data and play with it a bit. What fraction of the sequences would you say are "good"?

More like this