Search CORE

166,186 research outputs found

Various Sequence Classification Mechanisms for Knowledge Discovery

Author: Goutami R. Mane, Suhas B. Bhagate
Publication venue: Auricle Global Society of Education and Research
Publication date: 30/11/2017
Field of study

Sequence classification is an efficient task in data mining. The knowledge obtained from training stage can be used for sequence classification that assigns class labels to the new sequences. Relevant patterns can be found by using sequential pattern mining in which the values are represented in sequential manner. Classification process has explicit features but these features are not found in sequences. Feature selection techniques are sophisticated, but the potential features dimensionality may be very high. It is hard to find the sequential nature of feature. Sequence classification is a more challenging task than feature vector classification. Sequence classification problem can be solved by rules that consist of interesting patterns. These patterns are found in datasets that have labeled sequences along with class labels. The cohesion and support of the pattern are used to define interestingness of a pattern. In a given class of sequences, interestingness of a pattern can be measured by combining these two factors. Confident classification rules can be generated by using the discovered patterns. Two different approaches to build a classifier are used. The first classifier consists of an advanced form of classification method that depends on association rule. In the second classifier, the value belonging to the new data object is first measured then the rules are ranked

International Journal on Future Revolution in Computer Science & Communication Engineering

Protein sequence classification using feature hashing

Author: Caragea Cornelia
Mitra Prasenjit
Silvescu Adrian
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UNT Digital Library

Classification with Single Constraint Progressive Mining of Sequential Patterns

Author: Saptawati Putri
Sitohang Benhard
Yasmin Regina Yulia
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/08/2017
Field of study

Classification based on sequential pattern data has become an important topic to explore. One of research has been carried was the Classify-By-Sequence, CBS. CBS classified data based on sequential patterns obtained from AprioriLike sequential pattern mining. Sequential patterns obtained were called CSP, Classifiable Sequential Patterns. CSP was used as classifier rules or features for the classification task. CBS used AprioriLike algorithm to search for sequential patterns. However, AprioriLike algorithm took a long time to search for them. Moreover, not all sequential patterns were important for the user. In order to get the right and meaningful features for classification, user uses a constraint in sequential pattern mining. Constraint is also expected to reduce the number of sequential patterns that are short and less meaningful to the user. Therefore, we developed CBS_CLASS* with Single Constraint Progressive Mining of Sequential Patterns or Single Constraint PISA or PISA*. CBS_Class* with PISA* was proven to classify data in faster time since it only processed lesser number of sequential patterns but still conform to user’s need. The experiment result showed that compared to CBS_CLASS, CBS_Class* reduced the classification execution time by 89.8%. Moreover, the accuracy of the classification process can still be maintained.

IAES journal

ZENODO

Institute of Advanced Engineering and Science

Multiple perspectives HMM-based feature engineering for credit card fraud detection

Author: Caelen Olivier
Calabretto Sylvie
Granitzer Michael
He-Guelton Liyun
Laporte Léa
Lucas Yvan
Portier Pierre-Edouard
Publication venue
Publication date: 08/04/2019
Field of study

Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However, most studies consider credit card transactions as isolated events and not as a sequence of transactions. In this article, we model a sequence of credit card transactions from three different perspectives, namely (i) does the sequence contain a Fraud? (ii) Is the sequence obtained by fixing the card-holder or the payment terminal? (iii) Is it a sequence of spent amount or of elapsed time between the current and previous transactions? Combinations of the three binary perspectives give eight sets of sequences from the (training) set of transactions. Each one of these sets is modelled with a Hidden Markov Model (HMM). Each HMM associates a likelihood to a transaction given its sequence of previous transactions. These likelihoods are used as additional features in a Random Forest classifier for fraud detection. This multiple perspectives HMM-based approach enables an automatic feature engineering in order to model the sequential properties of the dataset with respect to the classification task. This strategy allows for a 15% increase in the precision-recall AUC compared to the state of the art feature engineering strategy for credit card fraud detection.Comment: Presented as a poster in the conference SAC 2019: 34th ACM/SIGAPP Symposium on Applied Computing in April 201

arXiv.org e-Print Archive

Crossref

HAL

Hal-Diderot

tRNA signatures reveal polyphyletic origins of streamlined SAR11 genomes among the alphaproteobacteria

Author: Amrine Katherine C. H.
Ardell David H.
Swingley Wesley D.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 30/05/2013
Field of study

Phylogenomic analyses are subject to bias from compositional convergence and noise from horizontal gene transfer (HGT). Compositional convergence is a likely cause of controversy regarding phylogeny of the SAR11 group of Alphaproteobacteria that have extremely streamlined, A+T-biased genomes. While careful modeling can reduce artifacts caused by convergence, the most consistent and robust phylogenetic signal in genomes may lie distributed among encoded functional features that govern macromolecular interactions. Here we develop a novel phyloclassification method based on signatures derived from bioinformatically defined tRNA Class-Informative Features (CIFs). tRNA CIFs are enriched for features that underlie tRNA-protein interactions. Using a simple tRNA-CIF-based phyloclassifier, we obtained results consistent with those of bias-corrected whole proteome phylogenomic studies, rejecting monophyly of SAR11 and affiliating most strains with Rhizobiales with strong statistical support. Yet SAR11 and Rickettsiales tRNA genes share distinct patterns of A+T-richness, as expected from their elevated genomic A+T compositions. Using conventional supermatrix methods on total tRNA sequence data, we could recover the artifactual result of a monophyletic SAR11 grouping with Rickettsiales. Thus tRNA CIF-based phyloclassification is more robust to base content convergence than supermatrix phylogenomics on whole tRNA sequences. Also, given the notoriously promiscuous HGT of aminoacyl-tRNA synthetases, tRNA CIF-based phyloclassification may be relatively robust to HGT of network components. We describe how unique features of tRNA-protein interaction networks facilitate the mining of traits governing macromolecular interactions from genomic data, and discuss why interaction-governing traits may be especially useful to solve difficult problems in microbial classification and phylogeny

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Identification and Classification of Moving Vehicles on Road

Author: Al-Shehri Ahmad Mohammed
Al-Shehri Ali Dhafer Ali
Ashwi Haytham Ibrahim
Badawy Ahmed Said
Changalasetty Suresh Babu
Ghribi Wade
Medisetty Ramakanth
Thota Lalitha Saroja
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 31/07/2013
Field of study

It is important to know the road traffic density real time especially in cities for signal control and effective traffic management. In recent years, video monitoring and surveillance systems have been widely used in traffic management. Hence, traffic density estimation and vehicle classification can be achieved using video monitoring systems. The image sequences for traffic scenes are recorded by a stationary camera. The method is based on the establishment of correspondences between regions and vehicles, as the vehicles move through the image sequence. Background subtraction is used which improves the adaptive background mixture model and makes the system learn faster and more accurately, as well as adapt effectively to changing environments. The resulting system robustly identifies vehicles, rejecting background and tracks vehicles over a specific period of time. Once the (object) vehicle is tracked, the attributes of the vehicle like width, length, perimeter, area etc are extracted by image process feature extraction techniques. These features will be used in classification of vehicle as big or small using neural networks classification technique of data mining. In proposed system we use LABVIEW and Vision assistant module for image processing and feature extraction. A feed-forward neural network is trained to classify vehicles using data mining WEKA toolbox. The system will solve major problems of human effort and errors in traffic monitoring and time consumption in conducting survey and analysis of data. The project will benefit to reduce cost of traffic monitoring system and complete automation of traffic monitoring system. Keywords: Image processing, Feature extraction, Segmentation, Threshold, Filter, Morphology, Blob, LABVIEW, NI, VI, Vision assistant, Data mining, Machine learning, Neural network, Back propagation, Multi layer perception, Classification, WEK

International Institute for Science, Technology and Education (IISTE): E-Journals