Now on ScienceBlogs: The Future - And Present - of Maternal and Infant Health Care.

ScienceBlogs Book Club: Inside the Outbreaks

Discovering Biology in a Digital World

My thoughts on biology, teaching, life, and exploring the living world via the digital one. Only my opinions are represented by these postings, they do not represent the viewpoints of any funding agency or Geospiza, Inc.

Profile

Sandra Porter I am a digital biologist, teacher, and entrepreneur. My passion is developing instructional materials for 21st century biology (Digital World Biology).

Search

Follow digitalbio on Twitter

National Science Foundation projects

Bio-Link Bio-Link is an Advanced Technology Education center of Excellence that works to improve biotechnology and life science education in the community colleges.

My Bio-Link blog

bio-itest bio-itest is an ITEST project (Innovative Technology Experiences for Students and Teachers). We are developing curriculum that uses bioinformatics resources to explore genetic testing and DNA barcoding.

Scenario based learning

Digital World Biology

Digital World Biology produces educational materials that help students and biologists use bioinformatics resources to explore biology. We write books, produce tutorials, sell biology-related merchandise and give workshops.

DigitalBio Favorites

Recent Posts

Recent Comments

Categories

Blogroll

Science Education Groups

Keep up to date

Awards

Red Orbit






When you need to laugh

Interesting places

Locations of visitors to this page

Archives

« Metagenomics, biomes, and dirt: separating good data from bad | Main | Cultural confusion: white papers vs. peer review »

Digging up the dirt on campus bacteria: how do we know if we have good data?

Category: BioinformaticsGenomicsMicrobiologyScience educationclassroom activitiesenvironmental educationsequence analysisteachingweb resources
Posted on: October 28, 2007 2:24 PM, by Sandra Porter

Metagenomics is a field where people interrogate the living world by isolating and sequencing nucleic acids. Since all living things have DNA, and viruses have either DNA or RNA, we can identify who's around by looking at bits of their genome.

Researchers are using this approach to find the culprit that's killing the honeybees. We're also trying to find out who else shares our bodies, and lives in our skin, in our stomachs, and other places where the sun doesn't shine. Craig Venter used metagenomics when he sailed around the world and sequenced DNA samples from the Sargasso Seas.

In this article and some related posts (here, here, here, and here), I'm writing about the students at JHU and their on-going research project to look at the bacteria that live on their campus.

I've also made one of our data sets (from 2005) available on-line. You, too, can log in, take a look at the data, run an SQL query, and download the data set, if you wish, and identify sequences.

Enough preliminaries, let's move on.

In the chromatogram report below, we can see some mysterious values that describe the data quality. (Some of the column headings have been shortened to make everything fit on the page.)

chromat_table.png


What do the numbers mean?

  • Len refers to the length of the read. The read is the sequence of bases obtained from the chromatogram.
  • Trim refers to the number of bases that are left after the poor quality bases have been trimmed from the 5' and 3' ends of the sequence. If we have a trim length of zero, it means that we didn't have any high quality bases left after trimming. A chromatogram with a trim length of zero would NOT be considered good data.
  • Q20 is defined here as the number of bases with a quality value of 20 or greater. You can learn more about quality values and what they mean here. That post refers to phred quality values. Quality values from the ABI KB base caller are an equivalent measurement.
  • Q40 refers to the number of bases with a quality value of at least 40 (a one in 10,000 chance of an error). If you are using DNA as a diagnostic test or working to discover medically important sequences, you want most of your data to have these kinds of quality values.
  • Q20/len and Q40/len refer to values that we get when we divide the number of acceptable or high quality bases by the length of the sequence. For example, if every base had an acceptable quality (Q20 or greater), then Q20/len would equal 1.
  • Signal strengths. The last four columns show the signals from the different fluorescent-labeled bases. These values can help us diagnose problems with our sequencing process. If the strengths are low, for example, we can check the template concentration or the quality of the other reagents. If the strengths are too high, template concentration might also be problematic.

You can sort through the data by clicking the column headings and count data that match specific criteria by using the finder.

What fraction of the data would you consider to be "good"?

Share on Facebook
Share on StumbleUpon
Share on Facebook
Find more posts in: Life ScienceTechnology

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)





ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.