sequence analysis

Is it real or is it April Fools? The March 21st issue of Science has an interesting news article by Elizabeth Pennisi and a letter to the editor about a proposal to wikify GenBank. Currently, the NCBI holds the original authors responsible for editing or correcting entries and this does cause problems when those authors fail to return to the scene and fix what they've submitted. Some researchers are suggesting that third parties be allowed to fix some of those mistakes or at least add comments to records, to warn the unwary. There are some good arguments on both sides and it's certainly…
One of my colleagues has a two part series on FinchTalk (starting today) that discusses uncertainty in measurement and what that uncertainty means for the present and Next Generation DNA sequencing technologies. I've been running into this uncertainty myself lately. I have always known that DNA sequencing errors occur. This is why people build tools for measuring the error rate and why quality measurements are so useful for determining which data to use and which data to believe. But, some of the downstream consequences didn't really hit home for me until a recent project. This project…
I read about this in Bio-IT World and had to go check it out. It's called the Genome Projector and it has to be the coolest genome browser I've ever seen. They have 320 bacterial genomes to play with. Naturally, I chose our friend E. coli. The little red pins in the picture below mark the positions of ribosomal RNA genes (It's not perfect, at least one of these genes is a ribosomal RNA methyltransferase and not a 16S ribosomal RNA.) I'm not entirely happy about finding it now, after I've already written and posted all the assignments for my class, but still, I'll post a link for my…
Here's a fun puzzler for you to figure out. The blast graph is here: The table with scores is here, click the table to see a bigger image: And here is the puzzling part: Why is the total score so high? If you want to repeat this for yourself, go here. You can use this sequence as a query (it's the same one that I used). >301.ab1 CTAGCTCTTGGGTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCCGATGGAG GGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGTGGGGGA CCTTCGGGCCTCACACCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGG CTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGA…
yep, I've become a videoblogger, at least sometimes. See the first video below. Be kind in the comments, this is a new thing for me. This video introduces the different blast programs, discusses word size, and how blastn works, the blastn score and the E value. The treatment is light and not too in depth, but as I said, it's an introduction. A quick introduction to BLAST from Sandra Porter on Vimeo.
The New England Journal of Medicine published a study yesterday showing that small changes in the DNA in the long arm of chromosome 16 are associated with autism. I met a teenager with autism last summer, when I attended family night at the Seattle Park and Rec summer camp program for kids with special needs. It's a fantastic program. The kids spend a week or more at summer camp and the parents get some much needed time off. I sat down on a log and my daughter (one of the counselors) introduced to me a boy in her cabin and told me that he's interested in trains. Since I rode a train for…
In which we're reminded that database searches are experiments, too. One of the trickiest things with bioinformatics experiments is repeating them. This challenge isn't related to the validity of the original results, the challenge is that, unless you made your own database and kept it in the same state, the database that you'll be using at a later time, sometimes even a day later, is a different database. And, if you query a different database, you may get a different result. The series that I'm currently posting is one that I started working on a couple of years ago. Originally, I was…
Last week I posted an image with two molecules (below the fold), one protein and one nucleic acid, and asked you about the probability of finding similar molecules in different species. You gave me some interesting answers. DAG made me clarify my question by asking what I meant by "similarity." I was wondering whether I would be likely to find a statistically relevant match by doing a BLAST search and I hadn't really thought about the cutoff values. I decided to guess and say that that the protein would be about 30% similar and the nucleic acid about 60%. Paul gave me some answers…
If you like ham and bacon, you might be interested in this. GenomeWeb reports that researchers at the University of Barcelona have developed an assay that tests 46 SNPs and can be used to trace the origin of your pork dinner. According to GenomeWeb, the test identifies both the breed and origin of the animal. The university and the company said meat traceability is necessary to ensure consumer safety, particularly in cases of infectious disease outbreaks or accidental feed contamination. No more doubts about the home of your Jamón.
'Tis the holiday season and, according to ancient lore, the time when miraculous events are most likely to take place. One of those well-known and miraculous events of ancient days was the birth of a son to a young girl, who, although she was married (Okay, I'm not sure about this part of the story) she was said to be a virgin and the birth to be a miracle. Hmmm. How do you think the news would be received if that sort of thing happened today? Certainly, if the young girl were to produce a grilled cheese sandwich with a burn spot that vaguely resembled a woman in a robe, someone might be…
GenomeWeb reports that Rite Aid drug stores on the West coast are now selling kits for doing paternity tests. The kits are made by Sorenson Genomics in Utah. Sorenson Genomics calls it the "peace-mind-test." Really! Each kit contains a swab for collecting cheek cells from the inside of your mouth and a container for mailing the sample to the lab. As far as I can tell, you buy the kit for $29.99, take a sample, fill out the consent forms, and mail the sample to Sorenson along with the $119 lab fee. Maybe I'm too imaginative, but I'm a little puzzled by some of the information that wasn'…
Which read(s): 1. contain either a SNP (a single nucleotide polymorphism) or a position where different members of a multi-gene family have a different base? C 2. doesn't have any DNA? B 3. is a PCR product? A, B, and C.  All of three reads were obtained by sequencing PCR products, generated with the same set of primers. The quality plots that I refer to are here.
Since DNA diagnostics companies seem to be sprouting like mushrooms after the rain, it seemed like a good time to talk about how DNA testing companies decipher meaning from the tests they perform. Last week, I wrote about interpreting DNA sequence traces and the kind of work that a data analyst or bioinformatics technician does in a DNA diagnostics company. As you might imagine, looking at every single DNA sample by eye gets rather tiring. One of the things that informatics companies (like ours) do, is to try and help people analyze several samples at once so that they can scan fewer…
As many of you know, I'm a big fan of do-it-yourself biology. Digital biology, the field that I write about, is particularly well-suited to this kind of fun and exploration. Last week, I wrote some instructions for making a phylogenetic tree from mitochondrial genomes. This week, we'll continue our analysis. I wrote this activity, in part, because of this awful handout that my oldest daughter brought home last year. She presented me with an overly photocopied paper that showed several protein sequences from cytochrome C in several creatures. She said she was supposed count the…
DNA sequence traces are often used in cases where: We want to identify the source of the nucleic acid. We want to detect drug-resistant variants of human immune deficiency virus. We want to know which base is located at which position, especially where we might be able to diagnose a human disease or determine the best dose of a therapeutic drug. In the future, these assays will likely rely more on automation. Currently, (at least outside of genome centers) many of these results are assessed by human technicians in clinical research labs, or DNA testing companies, who review these data by…
Students at Soldan International High School are participating in an amazing experiment and breaking ground that most science teachers fear to tread. Soldan students, along with hundreds of thousands of other people, are participating in the National Geographic's Genographic Project. Through this project, students send in cheek swabs, DNA is isolated from the cheek cells, and genetic markers are used to look at ancestry. Genetic markers in the mitochondrial DNA are used to trace ancestry through the maternal line and markers on the Y chromosome can be used to learn about one's father.…
Last year I wrote about an experiment where I compared a human mitochondrial DNA sequence to primate sequences in the GenBank. Since I wanted to know about the differences between humans, gorillas, and chimps, I used the Entrez query 'Great Apes' to limit my search to a set of sequences in the PopSet database that contained gorillas, bonobos, chimps, and human DNA. A week ago, I tried to repeat this experiment and... It didn't work. All I saw were human mitochondrial sequences.  I know the other sequences match, but I didn't see them since there are so many human sequences that match…
Metagenomics is a field where people interrogate the living world by isolating and sequencing nucleic acids. Since all living things have DNA, and viruses have either DNA or RNA, we can identify who's around by looking at bits of their genome. Researchers are using this approach to find the culprit that's killing the honeybees. We're also trying to find out who else shares our bodies, and lives in our skin, in our stomachs, and other places where the sun doesn't shine. Craig Venter used metagenomics when he sailed around the world and sequenced DNA samples from the Sargasso Seas. In this…
The simple fact is this: some DNA sequences are more believable than others. The problem is, that many students and researchers never see any of the metrics that we use for evaluating whether a sequence is "good" and whether a sequence is "bad." All they see are the base calls and sequences: ATAGATAGACGAGTAG, without any supporting information to help them evaluate if the sequence is correct. If DNA sequencing and personalized genetic testing are to become commonplace, the practice of ignoring data quality is (in my opinion) simply unacceptable. So, for awhile anyway, I'm making a…
We have lots of DNA samples from bacteria that were isolated from dirt. Now it's time to our own metagenomics project and figure out what they are. Our class project is on a much smaller scale than the honeybee metagenomics project that I wrote about yesterday, but we're using many of the same principles. The general process is this: 1. We sort the chromatogram data to identify good data and separate it from bad data. Informatics can help you determine if data is good, and measure how good it is, but it cannot turn bad data into good data. And, there's no point in wasting time with…