What is Sequence Data ?
3 points by krtab
3 points by krtab
This may be bordering non-topicality but the PhD thesis as a whole is about bio informatics, and I felt it made sense to have the first chapter title as a self sufficient post that linking to the PhD as a whole.
This thesis seems to miss the point of what they call homopolymer compression (I never heard the name, but I independently developed the technique like many others). Homopolymer compression is a terrible idea. The reason we all did it is because our sequence alignment algorithms were all designed based on the error model of Sanger sequencing. In Sanger sequencing, your errors are primarily misreading a base, and occasionally missing a base or inserting an extra one, but those are much rarer. In nextgen sequencers like from 454 or NanoPore your main error mode is miscounting the number of a particular base that appears. If you have one A, you might read one, and somewhat less likely two, and less likely yet three...if the actual sequence is AAA you have a distribution of around 3 of the number you get in a read. The right way to handle this is to use an alignment algorithm that is adapted to this error mode. Unfortunately, alignment algorithms are distributed as black box command line tools that are painful to mess with, so, to get work done right now, you do "homopolymer compression" which brings the error model much closer to Sanger sequencing at the cost of throwing away information. If I were still in the field, what I would want is not slightly better ways to throw away data, but an alignment algorithm that is adapted to the different error regime.
Hi there, author of the linked thesis here. I think you're coming from the right place here, however "the right way to handle this" depends on many factors. In the particular contexts where homopolymer compression (HPC) is typically applied (mapping and assembly), speed is a big issue. You are aligning millions of pairs of sequences, so any heuristic that can speed this up is good to take. Empirically, HPC has been shown to improve and speed up these algorithms in practice, so I'm not sure I would just discard it just because its not the "right way to handle this". Secondly, you might have noticed that, in the thesis, HPC is just a starting point, and what we're actualy doing is exploring a sequence to sequence transformation (SSR) space. In fact HPC is not even in this space. The SSRs that we defined don't necessarily discard information either, only if some dinucleotides are mapped to the empty output. Overall, this work is pretty exploratory, we're not claiming that our approach is the only correct way of doing mapping. That being said if people want to develop mapping software as fast as minimap that deal with homopolymer indels at the pairwise alignment level that'd be great, but in the meantime HPC it is.
in the thesis, HPC is just a starting point, and what we're actualy doing is exploring a sequence to sequence transformation (SSR) space.
That wasn't clear to me. Apparently I should have read more closely. I stand corrected.
That being said if people want to develop mapping software as fast as minimap that deal with homopolymer indels at the pairwise alignment level that'd be great, but in the meantime HPC it is.
Sadly true.