166,186 research outputs found

    Various Sequence Classification Mechanisms for Knowledge Discovery

    Get PDF
    Sequence classification is an efficient task in data mining. The knowledge obtained from training stage can be used for sequence classification that assigns class labels to the new sequences. Relevant patterns can be found by using sequential pattern mining in which the values are represented in sequential manner. Classification process has explicit features but these features are not found in sequences. Feature selection techniques are sophisticated, but the potential features dimensionality may be very high. It is hard to find the sequential nature of feature. Sequence classification is a more challenging task than feature vector classification. Sequence classification problem can be solved by rules that consist of interesting patterns. These patterns are found in datasets that have labeled sequences along with class labels. The cohesion and support of the pattern are used to define interestingness of a pattern. In a given class of sequences, interestingness of a pattern can be measured by combining these two factors. Confident classification rules can be generated by using the discovered patterns. Two different approaches to build a classifier are used. The first classifier consists of an advanced form of classification method that depends on association rule. In the second classifier, the value belonging to the new data object is first measured then the rules are ranked

    Protein sequence classification using feature hashing

    Get PDF
    Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks

    Classification with Single Constraint Progressive Mining of Sequential Patterns

    Get PDF
    Classification based on sequential pattern data has become an important topic to explore. One of research has been carried was the Classify-By-Sequence, CBS. CBS classified data based on sequential patterns obtained from AprioriLike sequential pattern mining. Sequential patterns obtained were called CSP, Classifiable Sequential Patterns. CSP was used as classifier rules or features for the classification task. CBS used AprioriLike algorithm to search for sequential patterns. However, AprioriLike algorithm took a long time to search for them. Moreover, not all sequential patterns were important for the user. In order to get the right and meaningful features for classification, user uses a constraint in sequential pattern mining. Constraint is also expected to reduce the number of sequential patterns that are short and less meaningful to the user. Therefore, we developed CBS_CLASS* with Single Constraint Progressive Mining of Sequential Patterns or Single Constraint PISA or PISA*. CBS_Class* with PISA* was proven to classify data in faster time since it only processed lesser number of sequential patterns but still conform to user’s need. The experiment result showed that compared to CBS_CLASS, CBS_Class* reduced the classification execution time by 89.8%. Moreover, the accuracy of the classification process can still be maintained.

    Multiple perspectives HMM-based feature engineering for credit card fraud detection

    Full text link
    Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However, most studies consider credit card transactions as isolated events and not as a sequence of transactions. In this article, we model a sequence of credit card transactions from three different perspectives, namely (i) does the sequence contain a Fraud? (ii) Is the sequence obtained by fixing the card-holder or the payment terminal? (iii) Is it a sequence of spent amount or of elapsed time between the current and previous transactions? Combinations of the three binary perspectives give eight sets of sequences from the (training) set of transactions. Each one of these sets is modelled with a Hidden Markov Model (HMM). Each HMM associates a likelihood to a transaction given its sequence of previous transactions. These likelihoods are used as additional features in a Random Forest classifier for fraud detection. This multiple perspectives HMM-based approach enables an automatic feature engineering in order to model the sequential properties of the dataset with respect to the classification task. This strategy allows for a 15% increase in the precision-recall AUC compared to the state of the art feature engineering strategy for credit card fraud detection.Comment: Presented as a poster in the conference SAC 2019: 34th ACM/SIGAPP Symposium on Applied Computing in April 201

    tRNA signatures reveal polyphyletic origins of streamlined SAR11 genomes among the alphaproteobacteria

    Get PDF
    Phylogenomic analyses are subject to bias from compositional convergence and noise from horizontal gene transfer (HGT). Compositional convergence is a likely cause of controversy regarding phylogeny of the SAR11 group of Alphaproteobacteria that have extremely streamlined, A+T-biased genomes. While careful modeling can reduce artifacts caused by convergence, the most consistent and robust phylogenetic signal in genomes may lie distributed among encoded functional features that govern macromolecular interactions. Here we develop a novel phyloclassification method based on signatures derived from bioinformatically defined tRNA Class-Informative Features (CIFs). tRNA CIFs are enriched for features that underlie tRNA-protein interactions. Using a simple tRNA-CIF-based phyloclassifier, we obtained results consistent with those of bias-corrected whole proteome phylogenomic studies, rejecting monophyly of SAR11 and affiliating most strains with Rhizobiales with strong statistical support. Yet SAR11 and Rickettsiales tRNA genes share distinct patterns of A+T-richness, as expected from their elevated genomic A+T compositions. Using conventional supermatrix methods on total tRNA sequence data, we could recover the artifactual result of a monophyletic SAR11 grouping with Rickettsiales. Thus tRNA CIF-based phyloclassification is more robust to base content convergence than supermatrix phylogenomics on whole tRNA sequences. Also, given the notoriously promiscuous HGT of aminoacyl-tRNA synthetases, tRNA CIF-based phyloclassification may be relatively robust to HGT of network components. We describe how unique features of tRNA-protein interaction networks facilitate the mining of traits governing macromolecular interactions from genomic data, and discuss why interaction-governing traits may be especially useful to solve difficult problems in microbial classification and phylogeny

    Identification and Classification of Moving Vehicles on Road

    Get PDF
    It is important to know the road traffic density real time especially in cities for signal control and effective traffic management. In recent years, video monitoring and surveillance systems have been widely used in traffic management. Hence, traffic density estimation and vehicle classification can be achieved using video monitoring systems. The image sequences for traffic scenes are recorded by a stationary camera. The method is based on the establishment of correspondences between regions and vehicles, as the vehicles move through the image sequence. Background subtraction is used which improves the adaptive background mixture model and makes the system learn faster and more accurately, as well as adapt effectively to changing environments. The resulting system robustly identifies vehicles, rejecting background and tracks vehicles over a specific period of time. Once the (object) vehicle is tracked, the attributes of the vehicle like width, length, perimeter, area etc are extracted by image process feature extraction techniques. These features will be used in classification of vehicle as big or small using neural networks classification technique of data mining. In proposed system we use LABVIEW and Vision assistant module for image processing and feature extraction.  A feed-forward neural network is trained to classify vehicles using data mining WEKA toolbox. The system will solve major problems of human effort and errors in traffic monitoring and time consumption in conducting survey and analysis of data. The project will benefit to reduce cost of traffic monitoring system and complete automation of traffic monitoring system. Keywords: Image processing, Feature extraction, Segmentation, Threshold, Filter, Morphology, Blob, LABVIEW, NI, VI, Vision assistant, Data mining, Machine learning, Neural network, Back propagation, Multi layer perception, Classification, WEK
    corecore