55 research outputs found

    SVM-based association rules for knowledge discovery and classification

    Full text link
    © 2015 IEEE. Improving analysis of market basket data requires the development of approaches that lead to recommendation systems that are tailored to specifically benefit grocery chain. The main purpose of that is to find relationships existing among the sales of the products that can help retailer identify new opportunities for cross-selling their products to customers. This paper aims to discover knowledge patterns hidden in large data set that can yield more understanding to the data holders and identify new opportunities for imperative tasks including strategic planning and decision making. This paper delivers a strategy for the implementation of a systematic analysis framework built on the established principles used in data mining and machine learning. The primary goal of that is to form the foundation of what we envisage will be a new recommendation system in the market. Uniquely, our strategy seeks to implement data mining tools that will allow the analyst to interact with the data and address business questions such as promotions advertisement. We employ Apriori algorithm and support vector machine to implement our recommendation systems. Experiments are done using a real market dataset and the 0.632+ bootstrap method is used here in order to evaluate our framework. The obtained results suggest that the proposed framework will be able to generate benefits for grocery chain using a real-world grocery store data

    Feature selection of imbalanced gene expression microarray data

    Full text link
    Gene expression data is a very complex data set characterised by abundant numbers of features but with a low number of observations. However, only a small number of these features are relevant to an outcome of interest. With this kind of data set, feature selection becomes a real prerequisite. This paper proposes a methodology for feature selection for an imbalanced leukaemia gene expression data based on random forest algorithm. It presents the importance of feature selection in terms of reducing the number of features, enhancing the quality of machine learning and providing better understanding for biologists in diagnosis and prediction. Algorithms are presented to show the methodology and strategy for feature selection taking care to avoid over fitting. Moreover, experiments are done using imbalanced Leukaemia gene expression data and special measurement is used to evaluate the quality of feature selection and performance of classification. © 2011 IEEE

    ABC-sampling for balancing imbalanced datasets based on artificial bee colony algorithm

    Full text link
    © 2015 IEEE. Class imbalanced data is a common problem for predictive modelling in domains such as bioinformatics. It occurs when the distribution of classes is not uniform among samples and results in a biased prediction of learning towards majority classes. In this study, we propose the ABC-Sampling algorithm based on a swarm optimization method called Artificial Bee Colony, which models the natural foraging behaviour of honeybees. Our algorithm lessens the effects of imbalanced classes by selecting the most informative majority samples using a forward search and storing them in a ranked subset. Then we construct a balanced dataset with a planned undersampling strategy to extract the most frequent majority samples from the top ranked subset and combine them with all minority samples. Our algorithm is superior to a state-of-the-art method on nine benchmark datasets with various levels of imbalance ratios

    A framework for high dimensional data reduction in the microarray domain

    Full text link
    Microarray analysis and visualization is very helpful for biologists and clinicians to understand gene expression in cells and to facilitate diagnosis and treatment of patients. However, a typical microarray dataset has thousands of features and a very small number of observations. This very high dimensional data has a massive amount of information which often contains some noise, non-useful information and small number of relevant features for disease or genotype. This paper proposes a framework for very high dimensional data reduction based on three technologies: feature selection, linear dimensionality reduction and non-linear dimensionality reduction. In this paper, feature selection based on mutual information will be proposed for filtering features and selecting the most relevant features with the minimum redundancy. A kernel linear dimensionality reduction method is also used to extract the latent variables from a high dimensional data set. In addition, a non-linear dimensionality reduction based on local linear embedding is used to reduce the dimension and visualize the data. Experimental results are presented to show the outputs of each step and the efficiency of this framework. © 2010 IEEE

    Case-based retrieval framework for gene expression data

    Full text link
    © the authors, publisher and licensee Libertas academica Limited. Background: The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process.Methods: This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles. Results: The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children’s Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set. Conclusion: The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps

    Ensemble feature learning of genomic data using support vector machine

    Full text link
    © 2016 Anaissi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data

    Variabilidade genética em progênies jovens de açaizeiro.

    Get PDF
    Neste trabalho estudou-se a variabilidade genética em progênies jovens de uma população de açaizeiro. O experimento foi instalado na base física de Tomé-Açu da Embrapa Amazônia Oriental, envolvendo a análise de 25 progênies de meios-irmãos delineado em látice 5 x 5. O experimento constou de duas repetições e cinco plantas por parcela. Altura da planta (AP), diâmetro do fuste à altura do colo (DFC), número de folhas vivas (NFV) e número de perfilhos (NP) foram obtidos doze meses após o plantio. A análise de variância mostrou que, exceto as características altura da planta e número de folhas vivas, houve diferenças significativas a 5% de probabilidade para diâmetro do fuste à altura do colo e número de perfilhos. Valores estimados no extremo superior do intervalo de variação para as características, apontam indivíduos promissores à prática da seleção para produção de frutos via seleção para diâmetro da planta, visto que esses caracteres são correlacionados positivamente. As maiores estimativas de parâmetros genéticos foram obtidas em relação à característica número de perfilhos seguido do diâmetro do fuste

    Diet of four lizards from an urban forest in an area of amazonian biome, eastern amazon

    Get PDF
    This study described the diet and niche overlap of four lizards from an urban fragment in Amapá state. The samplings were performed through pitfall traps and active visual search. In the stomach analysis, Formicidae and Coleoptera represented 50.79% of the total items. The highest niche overlap value was between Gonatodes humeralis and Tropidurus hispidus, which was not expected due to habitat use. The foraging strategies of all lizards observed have been previously mentioned by several authors. Several studies cite the diet of lizards being basically composed of invertebrates, with few variations, as also demonstrated in this study.Este estudio describió la dieta y la superposición de nicho de cuatro lagartos de un fragmento urbano en el estado de Amapá. Los muestreos se realizaron a través de trampas y búsqueda visual activa. En el análisis estomacal, Formicidae y Coleoptera representaron el 50.79% del total de ítems. El valor de superposición de nicho más alto fue entre Gonatodes humeralis y Tropidurus hispidus, que no se esperaba debido a la diferencia en el uso del hábitat. La estrategia de alimentación de todos los lagartos observados ha sido mencionada anteriormente por varios autores. Varios estudios indican que la dieta de los lagartos se compone básicamente de invertebrados, con pocas variaciones, como también se demostró en este estudio.Asociación Herpetológica Argentin

    Influência da idade sobre as estimativas de parâmetros genéticos em progênies de açaizeiro.

    Get PDF
    A influência da idade sobre a variação de parâmetros genéticos foi estudada para quatro caracteres vegetativos em um teste de progênies de açaizeiro, provenientes de matrizes selecionadas para a alta produção de frutos, presença de perfilhamento e frutos violáceos. Foram realizadas avaliações aos 12 e 24 meses após plantio para altura de diâmetro da planta à altura do colo, número de folhas vivas e perfilhos. Verificaram-se aos 12 meses, diferenças significativas para diâmetro e número de perfilhos a 5% de probabilidade. Aos 24 meses não foi detectada diferença significativa para nenhuma característica. Valores obtidos no extremo superior do intervalo de variação para as características, apontaram indivíduos promissores à prática da seleção para produção de frutos via seleção para altura e diâmetro da planta, haja vista que esses caracteres são correlacionados positivamente. Valores positivos dos coeficientes de correlação foram estimados entre altura, diâmetro da planta e número de folhas vivas. Observou-se baixos valores de coeficiente de variação genética para altura e diâmetro da planta aos 24 meses após plantio e uma clara tendência de decréscimo, com a idade, nos valores dos parâmetros genéticos
    • …
    corecore