16 research outputs found

    Optimal Decision Trees for the Algorithm Selection Problem: Integer Programming Based Approaches

    Full text link
    Even though it is well known that for most relevant computational problems different algorithms may perform better on different classes of problem instances, most researchers still focus on determining a single best algorithmic configuration based on aggregate results such as the average. In this paper, we propose Integer Programming based approaches to build decision trees for the Algorithm Selection Problem. These techniques allow automate three crucial decisions: (i) discerning the most important problem features to determine problem classes; (ii) grouping the problems into classes and (iii) select the best algorithm configuration for each class. To evaluate this new approach, extensive computational experiments were executed using the linear programming algorithms implemented in the COIN-OR Branch & Cut solver across a comprehensive set of instances, including all MIPLIB benchmark instances. The results exceeded our expectations. While selecting the single best parameter setting across all instances decreased the total running time by 22%, our approach decreased the total running time by 40% on average across 10-fold cross validation experiments. These results indicate that our method generalizes quite well and does not overfit.Comment: International Transactions in Operational Research. 201

    Information gain feature selection for multi-label classification.

    Get PDF
    In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This fact has led, in recent years, to a substantial amount of research in multi-label classification. And, more specifically, many feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. However, most methods proposed for this task rely on the transformation of the multi-label data set into a single-label one. In this work we have chosen one of the most wellknown measures for feature selection ? Information Gain ? and we have evaluated it along with common transformation techniques for the multi-label classification. We have also adapted the information gain feature selection technique to handle multi-label data directly. Our goal is to perform a thorough investigation of the performance of multi-label feature selection techniques using the information gain concept and report how it varies when coupled with different multi-label classifiers and data sets from different domains

    The impact of sequence length and number of sequences on promoter prediction performance.

    Get PDF
    Background: The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers. Results: We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more. Conclusion: The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers

    Classificação probabilística baseada em análise de padrões

    No full text
    Classification is a data mining tast that has been useful in several application areas, particularly, in bioinformatics. The genomic revolution has resulted in an explosive growth of biological data generated by the scientific community. With the aim of storing all of these biological information, biological databases were created. The need for computational tools for analysing biological data becomes evident, resulting in the application of data mining methods in this field. The work developed in this thesis is related to classification task and, initially, to its application to bioinformatics. The initial goal is to present a computationally efficient method for protein classification capable of yielding highly accurate results, outperforming the results obtained by previous works. The good results in terms of accuracy and time performance obtained by the proposed method show its potential for the protein classification problem. In addition, aiming to construct a suitable classifier for several kinds of applications, the method proposed for the protein classification problem was extended, becoming appropriate and efficient for several databases associated with different applications.Classificação é uma das tarefas de Mineração de Dados que tem se mostrado útil em diversas áreas de aplicação, em particular, na área de Bioinformática. A revolução genômica resultou em um crescimento exponencial da quantidade de dados biológicos gerados pela comunidade científica. Com a finalidade de armazenar toda essa informação biológica gerada, foram criados os bancos de dados biológicos. A necessidade por ferramentas computacionais capazes de realizar análises nesses dados tornou-se cada vez mais evidente, fazendo com que técnicas de mineração de dados começassem a ser empregadas. O trabalho desta tese concentra-se na tarefa de classificação e, inicialmente, na sua aplicação em bioinformática. O objetivo inicial é apresentar um método de classificação de proteínas computacionalmente eficiente e capaz de alcançar altas taxas de acurácia, superando resultados apresentados anteriormente na literatura. Os bons resultados, em termos de acurácia preditiva e tempo computacional, obtidos a partir do método proposto nesta tese, demonstram o seu potencial para o problema de classificação de proteínas. Além disso, visando a construção de um classificador adequado para diversos tipos de aplicação, o método proposto inicialmente para o problema de classificação de proteínas foi estendido e mostrou-se eficiente também quando utilizado com diferentes tipos de bases de dados pertencentes a aplicações distintas

    Planejamento operacional da lavra de mina usando modelos matemáticos

    No full text
    O trabalho apresenta modelos matemáticos para resolução de problemas operacionais relacionados com o planejamento de lavra de minas a céu aberto. Os modelos se prestam à determinação do ritmo de lavra a ser implementado em cada frente de lavra, levando-se em consideração a qualidade do minério em cada frente, a relação estéril/minério desejada, a produção requerida, as características dos equipamentos de carga e transporte e as características operacionais da mina. Os modelos também consideram a possibilidade de alocação estática e dinâmica dos caminhões. No caso de alocação dinâmica, o modelo determina qual deve ser a produção de cada frente e aloca os equipamentos de carga às frentes escolhidas. No caso da alocação estática, além da alocação dos equipamentos de carga, o modelo também faz alocação dos caminhões às frentes

    HiSP-GC : a classification method based on probabilistic analysis of patterns.

    No full text
    Classification is one of the most important tasks in data mining and, nowadays, has been applied to solve problems related to different areas, such as administration, finance, education, health and others. Therefore, the construction of precise and computationally efficient classifiers is a relevant challenge in data mining field. In previous works we presented an efficient method for protein classification, called HiSP (Highest Subset Probability) classifier, capable of yielding highly accurate results, outperforming the results obtained by other researchers. Aiming to construct a general purpose classifier based on the ideas explored to solve the protein classification problem, the method previously proposed was adapted and extended. Here we present this expanded and general classification method, called HiSP-GC (HiSP General Classifier), and show that it is appropriate and efficient for several kinds of databases associated with different applications

    An Extended Local Hierarchical Classifier for Prediction of Protein and Gene Functions.

    No full text
    Gene function prediction and protein function prediction are complex classification problems where the functional classes are structured according to a predefined hierarchy. To solve these problems, we propose an extended local hierarchical Naive Bayes classifier, where a binary classifier is built for each class in the hierarchy. The extension to conventional local approaches is that each classifier considers both the parent and child classes of the current class. We have evaluated the proposed approach on eight protein function and ten gene function hierarchical classification datasets. The proposed approach achieved somewhat better predictive accuracies than a global hierarchical Naive Bayes classifier

    Categorizing feature selection methods for multi-label classification.

    No full text
    In many important application domains such as text categorization, biomolecular analysis, scene classification and medical diagnosis, examples are naturally associated with more than one class label, giving rise to multi-label classification problems. This fact has led, in recent years, to a substantial amount of research on feature selection methods that allow the identification of relevant and informative features for multi-label classification. However, the methods proposed for this task are scattered in the literature, with no common framework to describe them and to allow an objective comparison. Here, we revisit a categorization of existing multi-label classification methods and, as our main contribution, we provide a comprehensive survey and novel categorization of the feature selection techniques that have been created for the multi-label classification setting. We conclude this work with concrete suggestions for future research in multi-label feature selection which have been derived from our categorization and analysis

    Pollution, bad-mouthing, and local marketing : the underground of location-based social networks.

    No full text
    Location Based Social Networks (LBSNs) are new Web 2.0 systems that are attracting new users in exponential rates. LBSNs like Foursquare and Yelp allow users to share their geographic location with friends through smartphones equipped with GPS, search for interesting places as well as posting tips about existing locations. By allowing users to comment on locations, LBSNs increasingly have to deal with new forms of spammers, which aim at advertising unsolicited messages on tips about locations. Spammers may jeopardize the trust of users on the system, thus, compromising its success in promoting location-based social interactions. In spite of that, the available literature is very limited in providing a deep understanding of this problem. In this paper, we investigated the task of identifying different types of tip spam on a popular Brazilian LBSN system, namely Apontador. Based on a labeled collection of tips provided by Apontador as well as crawled information about users and locations, we identified three types of irregular tips, namely local marketing, pollution and, bad-mouthing. We leveraged our characterization study towards a classification approach able to differentiate these tips with high accuracy
    corecore