877 research outputs found
Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space
We present a framework for discriminative sequence classification where the
learner works directly in the high dimensional predictor space of all
subsequences in the training set. This is possible by employing a new
coordinate-descent algorithm coupled with bounding the magnitude of the
gradient for selecting discriminative subsequences fast. We characterize the
loss functions for which our generic learning algorithm can be applied and
present concrete implementations for logistic regression (binomial
log-likelihood loss) and support vector machines (squared hinge loss).
Application of our algorithm to protein remote homology detection and remote
fold recognition results in performance comparable to that of state-of-the-art
methods (e.g., kernel support vector machines). Unlike state-of-the-art
classifiers, the resulting classification models are simply lists of weighted
discriminative subsequences and can thus be interpreted and related to the
biological problem
Cursive script recognition using wildcards and multiple experts
Variability in handwriting styles suggests that many letter recognition engines cannot correctly identify some hand-written letters of poor quality at reasonable computational cost. Methods that are capable of searching the resulting sparse graph of letter candidates are therefore required. The method presented here employs âwildcardsâ to represent missing letter candidates. Multiple experts are used to represent different aspects of handwriting. Each expert evaluates closeness of match and indicates its confidence. Explanation experts determine the degree to which the word alternative under consideration explains extraneous letter candidates. Schemata for normalisation and combination of scores are investigated and their performance compared. Hill climbing yields near-optimal combination weights that outperform comparable methods on identical dynamic handwriting data
Reverse-Safe Data Structures for Text Indexing
We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n Ï log d) time, where Ï is the matrix multiplication exponent. We show that, despite the n Ï factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model
Novel Methodologies for Pattern Recognition of Charged Particle Trajectories in the ATLAS Detector
By 2029, the Large Hadron Collider will enter its High Luminosity phase (HL- LHC) in order to achieve an unprecedented capacity for discovery. As this phase is entered, it is essential for many physics analyses that the efficiency of the re- construction of charged particle trajectories in the ATLAS detector is maintained. With levels of pile-up expected to reach = 200, the number of track candidates that must be processed will increase exponentially in the current pattern matching regime. In this thesis, a novel method for charged particle pattern recognition is developed based on the popular computer vision technique known as the Hough Transform (HT). Our method differs from previous attempts to use the HT for tracking in its data-driven choice of track parameterisation using Principal Component Analysis (PCA), and the division of the detector space in to very narrow tunnels known as sectors. This results in well-separated Hough images across the layers of the detector and relatively little noise from pile-up. Additionally, we show that the memory requirements for a pattern-based track finding algorithm can be reduced by approximately a factor of 5 through a two-stage compression process, without sacrificing any significant track finding efficiency. The new tracking algorithm is compared with an existing pattern matching algorithm, which consists of matching detector hits to a collection of pre-defined patterns of hits generated from simulated muon tracks. The performance of our algorithm is shown to achieve similar track finding efficiency while reducing the number of track candidates per event
Functional classification of G-Protein coupled receptors, based on their specific ligand coupling patterns
Functional identification of G-Protein Coupled Receptors (GPCRs) is one of the current focus areas of pharmaceutical research. Although thousands of GPCR sequences are known, many of them re- main as orphan sequences (the activating ligand is unknown). Therefore, classification methods for automated characterization of orphan GPCRs are imperative. In this study, for predicting Level 2 subfamilies of Amine GPCRs, a novel method for obtaining fixed-length feature vectors, based on the existence of activating ligand specific patterns, has been developed and utilized for a Support Vector Machine (SVM)-based classification. Exploiting the fact that there is a non-promiscuous relationship between the specific binding of GPCRs into their ligands and their functional classification, our method classifies Level 2 subfamilies of Amine GPCRs with a high predictive accuracy of 97.02% in a ten-fold cross validation test. The presented machine learning approach, bridges the gulf between the excess amount of GPCR sequence data and their poor functional characterization
- âŠ