16 research outputs found

    Domain adaptation algorithms for biological sequence classification

    Get PDF
    Doctor of PhilosophyDepartment of Computing and Information SciencesDoina CarageaThe large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction

    A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction

    No full text

    Improving the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules

    No full text
    <p>Genome assemblies come in all qualities. Most are basically drafts of the genome, but even the most heavily curated assemblies contain mis-assemblies and truncations or gaps in repetitive regions. The 7x draft assembly of the Tribolium genome is based on paired-end Sanger sequencing of 4-6 Kb insert plasmid libraries, scaffolded with paired-end reads from 40Kb fosmid and ~130Mb BAC clones. The total assembled length of ~156 Mb represents 75% of the estimated genome (200Mb) and presumably lacks a significant portion of repetitive DNA. Superscaffolds or chromosome builds (ChLG 2-10 and X) were constructed by mapping molecular markers from the genetic recombination map to the assembly scaffolds, anchoring greater than 90% of the assembled sequence1 (Fig1).</p> <p>To improve this draft assembly, we constructed physical maps of the T. castaneum genome. Using the irys system designed by BioNano Genomics (http://www.bionanogenomics.com/). Ultra long molecules (Mb) were nicked on one strand with Nt.BspQI andNt. Bbv.CI, labeled with fluorescent nucleotides, and repaired. Individual molecules were imaged on a massively parallel scale in nanochannels etched on silicon chips. Consensus maps de novo assembled from the imaged molecules were compared with in silico maps generated from the assembly sequence. Here we report our progress on using these comparisons to validate the assembly in regions where they agree and reanalyze the assembly in regions where they do not. Additional scaffolds have been anchored to the chromosomes, order and orientation of scaffolds have been corrected, and scaffolds have been extended by spanning repetitive regions. In the figures below, BNG consensus maps are blue and the in silico genome assembly maps are green.</p

    Multi-k-Mer de novo Transcriptome Assembly, Validation, and Count Summarizing for Four Plant Taxa

    No full text
    <p>Large genomes, polyploidy, and repetitive DNA are common obstacles in the assembly of plant genomes making de novo transcriptomes valuable genomic resources. To generate high quality de novotranscriptomes, we developed a custom assembly workflow for illumina and 454 RNA-Seq reads. The primary components of the workflow include stringent pre-cleaning; Oases multi-k-mer assembly for Illumina reads; MIRA assembly for 454 reads; and MIRA to cluster resulting contigs. We tested the workflow using data from four transcriptomes, two polyploid monocots and two dicots. Assemblies were validated using number of contigs, cumulative length of contigs, and N50 metrics. Ortholog hit ratios (OHR=length of alignment:length of proteins from a close relative) were calculated to estimate assembly fragmentation. Each clustered assembly had a high N50 (2.6-3.2 kb) and a high percentage of hits with an OHR >= 0.8 (52-65%) suggesting the workflow produced high quality assemblies.For de novo transcriptomes, there are few standalone programs that summarize aligned read counts for input into EdgeR or DeSeq. Although popular, HTSeq allows partial length single-end alignments but does not allow alignment of one out of two mates. We developed the custom script Count_reads_denovo.pl, for de novo RNA-Seq projects. Count_reads_denovo.pl uses a model similar to featureCounts, a reference-based summarizer, to leverage paired-end data even where only one mate aligns. The script filters read counts by mapping quality (MAPQ). Scripts used in the above workflow, as well as Count_reads_denovo.pl are available at https://github.com/i5K-KINBRE-script-share/.</p

    Multi-k-Mer de novo Transcriptome Assembly, Validation, and Count Summarizing for Four Plant Taxa

    No full text
    <p>Large genomes, polyploidy, and repetitive DNA are common obstacles in the assembly of plant genomes making de novo transcriptomes valuable genomic resources. To generate high quality de novotranscriptomes, we developed a custom assembly workflow for illumina and 454 RNA-Seq reads. The primary components of the workflow include stringent pre-cleaning; Oases multi-k-mer assembly for Illumina reads; MIRA assembly for 454 reads; and MIRA to cluster resulting contigs. We tested the workflow using data from four transcriptomes, two polyploid monocots and two dicots. Assemblies were validated using number of contigs, cumulative length of contigs, and N50 metrics. Ortholog hit ratios (OHR=length of alignment:length of proteins from a close relative) were calculated to estimate assembly fragmentation. Each clustered assembly had a high N50 (2.6-3.2 kb) and a high percentage of hits with an OHR >= 0.8 (52-65%) suggesting the workflow produced high quality assemblies.For de novo transcriptomes, there are few standalone programs that summarize aligned read counts for input into EdgeR or DeSeq. Although popular, HTSeq allows partial length single-end alignments but does not allow alignment of one out of two mates. We developed the custom script Count_reads_denovo.pl, for de novo RNA-Seq projects. Count_reads_denovo.pl uses a model similar to featureCounts, a reference-based summarizer, to leverage paired-end data even where only one mate aligns. The script filters read counts by mapping quality (MAPQ). Scripts used in the above workflow, as well as Count_reads_denovo.pl are available at https://github.com/i5K-KINBRE-script-share/.</p
    corecore