Metagenomics, biomes, and dirt: separating good data from bad

The simple fact is this: some DNA sequences are more believable than others.

The problem is, that many students and researchers never see any of the metrics that we use for evaluating whether a sequence is "good" and whether a sequence is "bad."

All they see are the base calls and sequences: ATAGATAGACGAGTAG, without any supporting information to help them evaluate if the sequence is correct. If DNA sequencing and personalized genetic testing are to become commonplace, the practice of ignoring data quality is (in my opinion) simply unacceptable.

So, for awhile anyway, I'm making a bunch of this data available on-line and I'll describe how to work with it and what it means.

To see some DNA sequence data, with quality values:
1. go to
2. log in with the user name: BR_guest
3. and the password: guest


When you get there, click the link to see the folders that I've set up.

i-2e5436baf1d313d6a66155f49ba496ca-step2.pngThis link takes you to a folder with student data from 2005. (Learn more about the project) Then, click the link to see a summary of information about the chromatograms.



When you get to the chromatogram table, you can see some information about the quality of each chromatogram. You can take a closer look at the data by clicking the FinchTV link to open the chromatogram in FinchTV. (FinchTV is freely available here from Geospiza.)


Which values do you think correspond to good data?

Which values are associated with poor quality data?

Feel free to sort the data and play with it a bit. What fraction of the sequences would you say are "good"?

More like this

This the third part of case study where we see what happens when high school students clone and sequence genomic plant DNA. In this last part, we use the results from an automated comparison program to determine if the students cloned any genes at all and, if so, which genes were cloned. (You can…
How did the human genome ever get finished if every one of the three billion bases had to be reviewed by human eyes? In the early days of the human genome project, laboratory personnel routinely scanned printed copies of chromatograms, editing and reviewing all DNA sequences by eye. For more…
What happens when high school students clone and sequence genomic DNA? Background DNA sequencing is a wonderful tool for discovery and a great technique for getting students involved in molecular science. This fall, Bio-Rad will officially begin selling their DNA cloning and sequencing kit. Now,…
Sometimes asking a question can be a mistake. Especially when your question leads to more questions and having to question things that you didn't want to question, and pretty soon you begin to regret ever opening the file and looking at the data and asking the question in the first place. Sigh.…