DNA sequencing and bioinformatics, part III: a case study from the classroom

This the third part of case study where we see what happens when high school students clone and sequence genomic plant DNA. In this last part, we use the results from an automated comparison program to determine if the students cloned any genes at all and, if so, which genes were cloned. (You can also read part I and part II.)

Did they clone or not clone? That is the question.

But first, we have to answer a different question about which parts of their reads are usable and which parts are not. (A read is the sequence of bases obtained from a chromatogram file.)

How does our data get processed? Let me count the ways
Once iFinch has counted the number of bases with quality scores of 20 or greater, other algorithms look at the bases at each end of the read. Typically, bases at the ends are more difficult to identify and consequently, have a lower quality. Sometimes people want to trim those bases off of the final DNA sequence, so iFinch marks the trimming points so you can see which bases will get trimmed if you chose to download the trimmed sequence.

If we open a file from iFinch in FinchTV, the bases that would get trimmed appear in a shaded region.

i-9405e1e1d4b6fcc67bd64d3acf5b04c4-finchtv_shading.gif

Identifying biological issues

Screening the results
There are three things that we're likely to sequence in a genomic cloning experiment. These are vector DNA, E. coli DNA, and hopefully, DNA from the region we want. We can save time by using some of iFinch's featues to identify these sequences automatically.

One of the steps in our bioinformatics pipeline is to compare the DNA sequences from the chromatograms with sequence collections that we put into iFinch ahead of time. This allows us to quickly identify the kinds of things that we sequenced in our experiment. We call these steps "screening."

One of those collections contains standard vector sequences. If part of a sequence matches a vector sequence, it gets colored blue in the chromatogram read report. These same sequences are highlighted in pink when we view the trace in FinchTV (above). We can also hide (mask) the vector sequences from other programs when we download FASTA sequences from iFinch.

i-3a78e1aac3dc4898d1156c922690fe54-read_color.gif

Our other sequence collection, that we use for screening, is usually the E. coli genome. In this case, we decided to use a different set of sequences that better suit our experiment.

The gene that students are trying to clone, GAPDH, belongs a multi-gene family. Multi-gene families arise when genes are duplicated during evolution. Arabidopsis thaliana, our favorite weedy plant, has seven members that belong to the GAPDH family, located on two different chromosomes. Knowing that we had a strong chance of cloning multiple family members, I added a sequence file to iFinch that contained all the Arabidopsis GAPDH genes. iFinch uses a sensitive algorithm to compare all of the sequences from our experiment with each of the seven sequences from Arabidopsis and store the results in the iFinch database. We can see the results of the sequence screen whenever we look at chromatogram report. If a sequence matches one of the Arabidopsis GAP genes, then it's shown in red and the sequence that it matches is shown below the read. A sequence must be at least 80% identical to the Arabidopsis sequence to be identified by our algorithm. If a sequence isn't identified in our screen, it might still be a Snapdragon GAPDH gene, but it's not similar enough to Arabidopsis to be detected by our screen.

Okay, okay, what did we clone?
iFinch presents most of the information that we want in tables and reports but sometimes it's nice to dig a little deeper into the iFinch database. We can do this with SQL. In setting up this iFinch, I wrote some SQL statements to get some of this information. These SQL statements have been stored in the iFinch database and made public so anyone can run them and either view the results or download them as an Excel file.

What more can we learn from the sequence screen?
Do any of our clones consist completely of vector sequences? Two of the SQL statements check that possibility. The answer is "no." Although many of our clones contain some vector sequences, none of our clones are completely vector. This is good news!

Which GAP genes did we clone? When I run the query by clicking the button, I find that the class cloned four different GAPC genes. About two thirds of our clones are from the gene that we are trying to clone (GAPC). Two of the other sequences, GAPCP-1 and GAPCP-2, code for proteins that get transported into the chloroplast.

i-9552bec005e5d57994676754513a8c59-cloned_genes.gif

Next, we want to make sure that the students don't carry out assemblies with sequences from multiple genes. To do this, we want to confirm that all the sequences in each folder are from the same gene. We use a query that shows us all the matches, where the matches occur, the percent match, and the folders where those sequences are stored.

i-6642339a632d42ecc35488677467ca4a-folder_multigenes.gif

It turns out that there is one folder that contains sequences from multiple genes. It would best for this group of students to move the GAPC-2 sequence to the DISCARD folder since including it in the assembly could produce misleading results.

Why are we spending so much time worrying about whether the sequencing worked or not?

There is a saying that I heard somewhere that if we measure things, we can improve them.

Lots of things can go wrong in DNA sequencing and we won't know whether anything went wrong or what went wrong unless we measure what's been done. From our investigations above, we found that only half of our samples gave us usable data. This information is helpful if we ever plan to sequence DNA again. These kinds of data help us to troubleshoot our experiments and improve our lab technique.

And then, there's being practical. Our classroom time is always too short and we want students to get the most they can out doing the experiment. Quickly sorting good data from bad, helps students avoid wasting time trying to analyze crappy data.

Naturally, the students are doing much more than this. The students are going further and finding out more about the gene structure and the protein sequence. They may even end up submitting their work to the NCBI. But this is where my part of the case study ends, the thrill of cloning a new gene and the excitement of discovery belongs to the students.

If you want to give iFinch a try, we have a guest system set up and stocked with real data from one of our beta testers. Go to http://classroom1.bio-rad.ifinch.com and log in with the user name: BR_guest and the password: guest.

More like this

The simple fact is this: some DNA sequences are more believable than others. The problem is, that many students and researchers never see any of the metrics that we use for evaluating whether a sequence is "good" and whether a sequence is "bad." All they see are the base calls and sequences:…
What happens when high school students clone and sequence genomic DNA? Background DNA sequencing is a wonderful tool for discovery and a great technique for getting students involved in molecular science. This fall, Bio-Rad will officially begin selling their DNA cloning and sequencing kit. Now,…
This the second part of three part case study where we see what happens when high school students clone and sequence genomic plant DNA. In this part, we do a bit of forensics to see how well their sequencing worked and to see if we can anything that could help them improve their results the next…
We have lots of DNA samples from bacteria that were isolated from dirt. Now it's time to our own metagenomics project and figure out what they are. Our class project is on a much smaller scale than the honeybee metagenomics project that I wrote about yesterday, but we're using many of the same…