16,250 research outputs found
Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space
We present a framework for discriminative sequence classification where the
learner works directly in the high dimensional predictor space of all
subsequences in the training set. This is possible by employing a new
coordinate-descent algorithm coupled with bounding the magnitude of the
gradient for selecting discriminative subsequences fast. We characterize the
loss functions for which our generic learning algorithm can be applied and
present concrete implementations for logistic regression (binomial
log-likelihood loss) and support vector machines (squared hinge loss).
Application of our algorithm to protein remote homology detection and remote
fold recognition results in performance comparable to that of state-of-the-art
methods (e.g., kernel support vector machines). Unlike state-of-the-art
classifiers, the resulting classification models are simply lists of weighted
discriminative subsequences and can thus be interpreted and related to the
biological problem
Recruiting from the network: discovering Twitter users who can help combat Zika epidemics
Tropical diseases like \textit{Chikungunya} and \textit{Zika} have come to
prominence in recent years as the cause of serious, long-lasting,
population-wide health problems. In large countries like Brasil, traditional
disease prevention programs led by health authorities have not been
particularly effective. We explore the hypothesis that monitoring and analysis
of social media content streams may effectively complement such efforts.
Specifically, we aim to identify selected members of the public who are likely
to be sensitive to virus combat initiatives that are organised in local
communities. Focusing on Twitter and on the topic of Zika, our approach
involves (i) training a classifier to select topic-relevant tweets from the
Twitter feed, and (ii) discovering the top users who are actively posting
relevant content about the topic. We may then recommend these users as the
prime candidates for direct engagement within their community. In this short
paper we describe our analytical approach and prototype architecture, discuss
the challenges of dealing with noisy and sparse signal, and present encouraging
preliminary results
TopSig: Topology Preserving Document Signatures
Performance comparisons between File Signatures and Inverted Files for text
retrieval have previously shown several significant shortcomings of file
signatures relative to inverted files. The inverted file approach underpins
most state-of-the-art search engine algorithms, such as Language and
Probabilistic models. It has been widely accepted that traditional file
signatures are inferior alternatives to inverted files. This paper describes
TopSig, a new approach to the construction of file signatures. Many advances in
semantic hashing and dimensionality reduction have been made in recent times,
but these were not so far linked to general purpose, signature file based,
search engines. This paper introduces a different signature file approach that
builds upon and extends these recent advances. We are able to demonstrate
significant improvements in the performance of signature file based indexing
and retrieval, performance that is comparable to that of state of the art
inverted file based systems, including Language models and BM25. These findings
suggest that file signatures offer a viable alternative to inverted files in
suitable settings and from the theoretical perspective it positions the file
signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201
- …