What is Sequence Data ?

3 points by krtab


krtab

This may be bordering non-topicality but the PhD thesis as a whole is about bio informatics, and I felt it made sense to have the first chapter title as a self sufficient post that linking to the PhD as a whole.

madhadron

This thesis seems to miss the point of what they call homopolymer compression (I never heard the name, but I independently developed the technique like many others). Homopolymer compression is a terrible idea. The reason we all did it is because our sequence alignment algorithms were all designed based on the error model of Sanger sequencing. In Sanger sequencing, your errors are primarily misreading a base, and occasionally missing a base or inserting an extra one, but those are much rarer. In nextgen sequencers like from 454 or NanoPore your main error mode is miscounting the number of a particular base that appears. If you have one A, you might read one, and somewhat less likely two, and less likely yet three...if the actual sequence is AAA you have a distribution of around 3 of the number you get in a read. The right way to handle this is to use an alignment algorithm that is adapted to this error mode. Unfortunately, alignment algorithms are distributed as black box command line tools that are painful to mess with, so, to get work done right now, you do "homopolymer compression" which brings the error model much closer to Sanger sequencing at the cost of throwing away information. If I were still in the field, what I would want is not slightly better ways to throw away data, but an alignment algorithm that is adapted to the different error regime.