32,999 research outputs found
COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment, and paired-end read LinkAge
The advent of next-generation sequencing (NGS) technologies enables
researchers to sequence complex microbial communities directly from
environment. Since assembly typically produces only genome fragments, also
known as contigs, instead of entire genome, it is crucial to group them into
operational taxonomic units (OTUs) for further taxonomic profiling and
down-streaming functional analysis. OTU clustering is also referred to as
binning. We present COCACOLA, a general framework automatically bin contigs
into OTUs based upon sequence composition and coverage across multiple samples.
The effectiveness of COCACOLA is demonstrated in both simulated and real
datasets in comparison to state-of-art binning approaches such as CONCOCT,
GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two
aspects. One is employing distance instead of Euclidean distance for
better taxonomic identification during initialization. More importantly,
COCACOLA takes advantage of both hard clustering and soft clustering by
sparsity regularization.
In addition, the COCACOLA framework seamlessly embraces customized knowledge
to facilitate binning accuracy. In our study, we have investigated two types of
additional knowledge, the co-alignment to reference genomes and linkage of
contigs provided by paired-end reads, as well as the ensemble of both. We find
that both co-alignment and linkage information further improve binning in the
majority of cases. COCACOLA is scalable and faster than CONCOCT ,GroopM, MaxBin
and MetaBAT.
The software is available at https://github.com/younglululu/COCACOLAComment: 15 pages, 5 figures, 1 table paper accepted at the RECOMB-Seq 201
Information Extraction from Scientific Literature for Method Recommendation
As a research community grows, more and more papers are published each year.
As a result there is increasing demand for improved methods for finding
relevant papers, automatically understanding the key ideas and recommending
potential methods for a target problem. Despite advances in search engines, it
is still hard to identify new technologies according to a researcher's need.
Due to the large variety of domains and extremely limited annotated resources,
there has been relatively little work on leveraging natural language processing
in scientific recommendation. In this proposal, we aim at making scientific
recommendations by extracting scientific terms from a large collection of
scientific papers and organizing the terms into a knowledge graph. In
preliminary work, we trained a scientific term extractor using a small amount
of annotated data and obtained state-of-the-art performance by leveraging large
amount of unannotated papers through applying multiple semi-supervised
approaches. We propose to construct a knowledge graph in a way that can make
minimal use of hand annotated data, using only the extracted terms,
unsupervised relational signals such as co-occurrence, and structural external
resources such as Wikipedia. Latent relations between scientific terms can be
learned from the graph. Recommendations will be made through graph inference
for both observed and unobserved relational pairs.Comment: Thesis Proposal. arXiv admin note: text overlap with arXiv:1708.0607
Machine learning for metagenomics: methods and tools
Owing to the complexity and variability of metagenomic studies, modern
machine learning approaches have seen increased usage to answer a variety of
question encompassing the full range of metagenomic NGS data analysis. We
review here the contribution of machine learning techniques for the field of
metagenomics, by presenting known successful approaches in a unified framework.
This review focuses on five important metagenomic problems: OTU-clustering,
binning, taxonomic profling and assignment, comparative metagenomics and gene
prediction. For each of these problems, we identify the most prominent methods,
summarize the machine learning approaches used and put them into perspective of
similar methods. We conclude our review looking further ahead at the challenge
posed by the analysis of interactions within microbial communities and
different environments, in a field one could call "integrative metagenomics"
A Joint Identification Approach for Argumentative Writing Revisions
Prior work on revision identification typically uses a pipeline method:
revision extraction is first conducted to identify the locations of revisions
and revision classification is then conducted on the identified revisions. Such
a setting propagates the errors of the revision extraction step to the revision
classification step. This paper proposes an approach that identifies the
revision location and the revision type jointly to solve the issue of error
propagation. It utilizes a sequence representation of revisions and conducts
sequence labeling for revision identification. A mutation-based approach is
utilized to update identification sequences. Results demonstrate that our
proposed approach yields better performance on both revision location
extraction and revision type classification compared to a pipeline baseline
An Attentive Survey of Attention Models
Attention Model has now become an important concept in neural networks that
has been researched within diverse application domains. This survey provides a
structured and comprehensive overview of the developments in modeling
attention. In particular, we propose a taxonomy which groups existing
techniques into coherent categories. We review salient neural architectures in
which attention has been incorporated, and discuss applications in which
modeling attention has shown a significant impact. We also describe how
attention has been used to improve the interpretability of neural networks.
Finally, we discuss some future research directions in attention. We hope this
survey will provide a succinct introduction to attention models and guide
practitioners while developing approaches for their applications.Comment: accepted to Transactions on Intelligent Systems and Technology(TIST);
33 page
MRFalign: Protein Homology Detection through Alignment of Markov Random Fields
Sequence-based protein homology detection has been extensively studied and so
far the most sensitive method is based upon comparison of protein sequence
profiles, which are derived from multiple sequence alignment (MSA) of sequence
homologs in a protein family. A sequence profile is usually represented as a
position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and
accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This
paper presents a new homology detection method MRFalign, consisting of three
key components: 1) a Markov Random Fields (MRF) representation of a protein
family; 2) a scoring function measuring similarity of two MRFs; and 3) an
efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning
two MRFs. Compared to HMM that can only model very short-range residue
correlation, MRFs can model long-range residue interaction pattern and thus,
encode information for the global 3D structure of a protein family.
Consequently, MRF-MRF comparison for remote homology detection shall be much
more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that
MRFalign outperforms several popular HMM or PSSM-based methods in terms of both
alignment accuracy and remote homology detection and that MRFalign works
particularly well for mainly beta proteins. For example, tested on the
benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM
succeed on 48% and 52% of proteins, respectively, at superfamily level, and on
15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign
succeeds on 57.3% and 42.5% of proteins at superfamily and fold level,
respectively. This study implies that long-range residue interaction patterns
are very helpful for sequence-based homology detection. The software is
available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog
PopIns: population-scale detection of novel sequence insertions
The detection of genomic structural variation (SV) has advanced tremendously
in recent years due to progress in high-throughput sequencing technologies.
Novel sequence insertions, insertions without similarity to a human reference
genome, have received less attention than other types of SVs due to the
computational challenges in their detection from short read sequencing data,
which inherently involves de novo assembly. De novo assembly is not only
computationally challenging, but also requires high-quality data. While the
reads from a single individual may not always meet this requirement, using
reads from multiple individuals can increase power to detect novel insertions.
We have developed the program PopIns, which can discover and characterize
non-reference insertions of 100 bp or longer on a population scale. In this
paper, we describe the approach we implemented in PopIns. It takes as input a
reads-to-reference alignment, assembles unaligned reads using a standard
assembly tool, merges the contigs of different individuals into high-confidence
sequences, anchors the merged sequences into the reference genome, and finally
genotypes all individuals for the discovered insertions. Our tests on simulated
data indicate that the merging step greatly improves the quality and
reliability of predicted insertions and that PopIns shows significantly better
recall and precision than the recent tool MindTheGap. Preliminary results on a
data set of 305 Icelanders demonstrate the practicality of the new approach.
The source code of PopIns is available from http://github.com/bkehr/popins.Comment: Presented at RECOMB-SEQ 201
Regulatory motif discovery using a population clustering evolutionary algorithm
This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within local regions of the population, promoting overall solution diversity and encouraging discovery of multiple solutions. Experiments using synthetic data sets have demonstrated the algorithm's capacity to find position frequency matrix models of known regulatory motifs in relatively long promoter sequences. These experiments have also shown the algorithm's ability to maintain diversity during search and discover multiple motifs within a single population. The utility of the algorithm for discovering motifs in real biological data is demonstrated by its ability to find meaningful motifs within muscle-specific regulatory sequences
Identification of functionally related enzymes by learning-to-rank methods
Enzyme sequences and structures are routinely used in the biological sciences
as queries to search for functionally related enzymes in online databases. To
this end, one usually departs from some notion of similarity, comparing two
enzymes by looking for correspondences in their sequences, structures or
surfaces. For a given query, the search operation results in a ranking of the
enzymes in the database, from very similar to dissimilar enzymes, while
information about the biological function of annotated database enzymes is
ignored.
In this work we show that rankings of that kind can be substantially improved
by applying kernel-based learning algorithms. This approach enables the
detection of statistical dependencies between similarities of the active cleft
and the biological function of annotated enzymes. This is in contrast to
search-based approaches, which do not take annotated training data into
account. Similarity measures based on the active cleft are known to outperform
sequence-based or structure-based measures under certain conditions. We
consider the Enzyme Commission (EC) classification hierarchy for obtaining
annotated enzymes during the training phase. The results of a set of sizeable
experiments indicate a consistent and significant improvement for a set of
similarity measures that exploit information about small cavities in the
surface of enzymes
Recognizing Partial Biometric Patterns
Biometric recognition on partial captured targets is challenging, where only
several partial observations of objects are available for matching. In this
area, deep learning based methods are widely applied to match these partial
captured objects caused by occlusions, variations of postures or just partial
out of view in person re-identification and partial face recognition. However,
most current methods are not able to identify an individual in case that some
parts of the object are not obtainable, while the rest are specialized to
certain constrained scenarios. To this end, we propose a robust general
framework for arbitrary biometric matching scenarios without the limitations of
alignment as well as the size of inputs. We introduce a feature post-processing
step to handle the feature maps from FCN and a dictionary learning based
Spatial Feature Reconstruction (SFR) to match different sized feature maps in
this work. Moreover, the batch hard triplet loss function is applied to
optimize the model. The applicability and effectiveness of the proposed method
are demonstrated by the results from experiments on three person
re-identification datasets (Market1501, CUHK03, DukeMTMC-reID), two partial
person datasets (Partial REID and Partial iLIDS) and two partial face datasets
(CASIA-NIR-Distance and Partial LFW), on which state-of-the-art performance is
ensured in comparison with several state-of-the-art approaches. The code is
released online and can be found on the website:
https://github.com/lingxiao-he/Partial-Person-ReID.Comment: 13 pages, 11 figure
- …