Search CORE

16 research outputs found

Domain adaptation algorithms for biological sequence classification

Author: Herndon Nic
Publication venue: Kansas State University
Publication date
Field of study

Doctor of PhilosophyDepartment of Computing and Information SciencesDoina CarageaThe large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction

K-State Research Exchange

Recommended from our members

AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture

Author: Andorf Carson
Arnaud Elizabeth
Berardini Tanya Z.
Birkett Clayton
Campbell Jacqueline
Cannon Ethalinda K. S.
Cannon Steve
Carson James
Condon Bradford
Cooper Laurel
Dunn Nathan
Elsik Christine G.
Farmer Andrew
Ficklin Stephen P.
Grant David
Grau Emily
Harper Lisa
Herndon Nic
Hu Zhi-Liang
Humann Jodi
Jaiswal Pankaj
Jonquet Clement
Jung Sook
Laporte Marie-Angelique
Larmande Pierre
Lazo Gerard
Main Doreen
McCarthy Fiona
Menda Naama
Mungall Christopher J.
Munoz-Torres Monica C.
Naithani Sushma
Nelson Rex
Nesdill Daureen
Park Carissa
Poelchau Monica
Reecy James
Reiser Leonore
Sanderson Lacey-Anne
Sen Taner Z.
Staton Margaret
Subramaniam Sabarinath
Tello-Ruiz Marcela Karey
Unda Victor
Unni Deepak
Walls Ramona
Wang Liya
Ware Doreen
Wegrzyn Jill
Williams Jason
Woodhouse Margaret
Yu Jing
Publication venue
Publication date
Field of study

The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require datamanagement plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices

ScholarsArchive@OSU

A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction

Author: Doina Caragea
Nic Herndon
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Improving the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules

Author: Jennifer Shelton (555802)
Nic Herndon (737599)
Susan J. Brown (19392)
Warren Andrews (738016)
Weiping Wang (210831)
Publication venue
Publication date
Field of study

Genome assemblies come in all qualities. Most are basically drafts of the genome, but even the most heavily curated assemblies contain mis-assemblies and truncations or gaps in repetitive regions. The 7x draft assembly of the Tribolium genome is based on paired-end Sanger sequencing of 4-6 Kb insert plasmid libraries, scaffolded with paired-end reads from 40Kb fosmid and ~130Mb BAC clones. The total assembled length of ~156 Mb represents 75% of the estimated genome (200Mb) and presumably lacks a significant portion of repetitive DNA. Superscaffolds or chromosome builds (ChLG 2-10 and X) were constructed by mapping molecular markers from the genetic recombination map to the assembly scaffolds, anchoring greater than 90% of the assembled sequence1 (Fig1). To improve this draft assembly, we constructed physical maps of the T. castaneum genome. Using the irys system designed by BioNano Genomics (http://www.bionanogenomics.com/). Ultra long molecules (Mb) were nicked on one strand with Nt.BspQI andNt. Bbv.CI, labeled with fluorescent nucleotides, and repaired. Individual molecules were imaged on a massively parallel scale in nanochannels etched on silicon chips. Consensus maps de novo assembled from the imaged molecules were compared with in silico maps generated from the assembly sequence. Here we report our progress on using these comparisons to validate the assembly in regions where they agree and reanalyze the assembly in regions where they do not. Additional scaffolds have been anchored to the chromosomes, order and orientation of scaffolds have been corrected, and scaffolds have been extended by spanning repetitive regions. In the figures below, BNG consensus maps are blue and the in silico genome assembly maps are green.</p

FigShare

Multi-k-Mer de novo Transcriptome Assembly, Validation, and Count Summarizing for Four Plant Taxa

Author: Jennifer Shelton (555802)
Nic Herndon (737599)
Miranda Gray (737597)
Hanquan Liang (737602)
Timothy Durrett (738017)
Loretta Johnson (738018)
Alina Akhunova (737600)
Susan J Brown (6214)
Publication venue
Publication date: 22/11/2012
Field of study

Large genomes, polyploidy, and repetitive DNA are common obstacles in the assembly of plant genomes making de novo transcriptomes valuable genomic resources. To generate high quality de novotranscriptomes, we developed a custom assembly workflow for illumina and 454 RNA-Seq reads. The primary components of the workflow include stringent pre-cleaning; Oases multi-k-mer assembly for Illumina reads; MIRA assembly for 454 reads; and MIRA to cluster resulting contigs. We tested the workflow using data from four transcriptomes, two polyploid monocots and two dicots. Assemblies were validated using number of contigs, cumulative length of contigs, and N50 metrics. Ortholog hit ratios (OHR=length of alignment:length of proteins from a close relative) were calculated to estimate assembly fragmentation. Each clustered assembly had a high N50 (2.6-3.2 kb) and a high percentage of hits with an OHR >= 0.8 (52-65%) suggesting the workflow produced high quality assemblies.For de novo transcriptomes, there are few standalone programs that summarize aligned read counts for input into EdgeR or DeSeq. Although popular, HTSeq allows partial length single-end alignments but does not allow alignment of one out of two mates. We developed the custom script Count_reads_denovo.pl, for de novo RNA-Seq projects. Count_reads_denovo.pl uses a model similar to featureCounts, a reference-based summarizer, to leverage paired-end data even where only one mate aligns. The script filters read counts by mapping quality (MAPQ). Scripts used in the above workflow, as well as Count_reads_denovo.pl are available at https://github.com/i5K-KINBRE-script-share/.</p

FigShare