About ===== This page goes a little more in depth on the software and its goals. Motivations ----------- Several different factors motivated dammit's development. The first of these was the sea lamprey transcriptome project, which had annotation as a primary goal. Many of dammit's core features were already implemented there, and it seemed a shame not share that work with others in a usable format. Related to this was a lack of workable and easy-to-use existing solutions; in particular, most are meant to be used as protocols and haven't been packaged in an automated format. Licensing was also a big concern -- software used for science should be open source, easily accessible, remixable, and free. Implicit to these motivations is some idea of what a good annotator *should* look like, in the author's opinion: 1. It should be easy to install and upgrade 2. It should only use Free software 3. It should make use of standard databases 4. It should output in reasonable formats 5. It should be relatively fast 6. It should try to be correct, insofar as any computational approach can be "correct" 7. It should give the user some measure of confidence for its results. The Obligatory Flowchart ~~~~~~~~~~~~~~~~~~~~~~~~ .. figure:: _static/workflow.svg :alt: The Workflow The Workflow Software Used ------------- - TransDecoder - BUSCO - HMMER - Infernal - LAST - crb-blast (for now) - pydoit (under the hood) All of these are Free Software, as in freedom and beer Databases --------- - Pfam-A - Rfam - OrthoDB - BUSCO databases - Uniref90 - User-supplied protein databases The last one is important, and sometimes ignored. Conditional Reciprocal Best LAST -------------------------------- Building off Richard and co's work on Conditional Reciprocal Best BLAST, I've implemented a new version with Python and LAST -- CRBL. The original lives here: https://github.com/cboursnell/crb-blast Why?? - BLAST is too slooooooow - Ruby is yet another dependency to have users install - With Python and scikit learn, I have freedom to toy with models (and learn stuff) And, of course, some of these databases are BIG. Doing ``blastx`` and ``tblastn`` between a reasonably sized transcriptome and Uniref90 is not an experience you want to have. ie, practical concerns. A brief intro to CRBB ~~~~~~~~~~~~~~~~~~~~~ - Reciprocal Best Hits (RBH) is a standard method for ortholog detection - Transcriptomes have multiple multiple transcript isoforms, which confounds RBH - CRBB uses machine learning to get at this problem .. figure:: _static/RBH.svg CRBB attempts to associate those isoforms with appropriate annotations by learning an appropriate e-value cutoff for different transcript lengths. .. figure:: _static/CRBB_decision.png :alt: CRBB CRBB *from http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365#s5* CRBL ~~~~ For CRBL, instead of fitting a linear model, we train a model. - SVM - Naive bayes One limitation is that LAST has no equivalent to ``tblastn``. So, we find the RBHs using the TransDecoder ORFs, and then use the model on the translated transcriptome versus database hits.