451,790 research outputs found
A review of associative classification mining
Associative classification mining is a promising approach in data mining that utilizes the
association rule discovery techniques to construct classification systems, also known as
associative classifiers. In the last few years, a number of associative classification algorithms
have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms
employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule
evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative
classification techniques with regards to the above criteria. Finally, future directions in associative
classification, such as incremental learning and mining low-quality data sets, are also
highlighted in this paper
A survey on utilization of data mining approaches for dermatological (skin) diseases prediction
Due to recent technology advances, large volumes of medical data is obtained. These data contain valuable information. Therefore data mining techniques can be used to extract useful patterns. This paper is intended to introduce data mining and its various techniques and a survey of the available literature on medical data mining. We emphasize mainly on the application of data mining on skin diseases. A categorization has been provided based on the different data mining techniques. The utility of the various data mining methodologies is highlighted. Generally association mining is suitable for extracting rules. It has been used especially in cancer diagnosis. Classification is a robust method in medical mining. In this paper, we have summarized the different uses of classification in dermatology. It is one of the most important methods for diagnosis of erythemato-squamous diseases. There are different methods like Neural Networks, Genetic Algorithms and fuzzy classifiaction in this topic. Clustering is a useful method in medical images mining. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. Clustering has some applications in dermatology. Besides introducing different mining methods, we have investigated some challenges which exist in mining skin data
The LSST Data Mining Research Agenda
We describe features of the LSST science database that are amenable to
scientific data mining, object classification, outlier identification, anomaly
detection, image quality assurance, and survey science validation. The data
mining research agenda includes: scalability (at petabytes scales) of existing
machine learning and data mining algorithms; development of grid-enabled
parallel data mining algorithms; designing a robust system for brokering
classifications from the LSST event pipeline (which may produce 10,000 or more
event alerts per night); multi-resolution methods for exploration of petascale
databases; indexing of multi-attribute multi-dimensional astronomical databases
(beyond spatial indexing) for rapid querying of petabyte databases; and more.Comment: 5 pages, Presented at the "Classification and Discovery in Large
Astronomical Surveys" meeting, Ringberg Castle, 14-17 October, 200
Evaluation and optimization of frequent association rule based classification
Deriving useful and interesting rules from a data mining system is an essential and important task. Problems
such as the discovery of random and coincidental patterns or patterns with no significant values, and the
generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness
of rules generated by data mining algorithms are actively and constantly being examined and developed. In this
paper, a systematic way to evaluate the association rules discovered from frequent itemset mining algorithms,
combining common data mining and statistical interestingness measures, and outline an appropriated sequence of usage is presented. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided. Empirical results show that with a proper combination of data mining and statistical analysis, the framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy and coverage rules when used in the classification problem. Moreover, the results reveal the important characteristics of mining frequent itemsets, and the impact of confidence measure for the classification task
Clinical data mining and classification
Dissertação para obtenção do Grau de Mestre em Engenharia Informática e de ComputadoresDeterminar os genes que contribuem para o desenvolvimento de certas doenças, como o cancro, Ă© um objectivo importante na vanguarda da investigação clĂnica de hoje. Isto pode fornecer conhecimentos sobre como as doenças se desenvolvem, pode levar a novos tratamentos e a testes de diagnĂłstico que detectam doenças mais cedo no seu desenvolvimento, aumentando as hipĂłteses de recuperação dos pacientes.
Hoje em dia, muitos conjuntos de dados de expressĂŁo genĂ©tica estĂŁo disponĂveis publicamente. Estes consistem geralmente em dados de microarray com informação sobre a activação (ou nĂŁo) de milhares de genes, em pacientes especĂficos, que exibem uma determinada doença. No entanto, estes conjuntos de dados clĂnicos consistem em vetores de caracterĂsticas de elevada dimensionalidade, o que levanta dificuldades Ă análise humana clĂnica e Ă interpretabilidade - dadas as grandes quantidades de caracterĂsticas e as quantidades comparativamente pequenas de instâncias, Ă© difĂcil identificar os genes mais relevantes relacionados com a presença de uma determinada doença. Nesta tese, exploramos a utilização da discretização de caracterĂsticas, selecção de caracterĂsticas e tĂ©cnicas de classificação aplicadas ao problema de identificação do conjunto mais relevante de caracterĂsticas (genes), dentro de conjuntos de dados de microarray, que podem prever a presença de uma dada doença. ConstruĂmos um pipeline onde aplicamos diferentes tĂ©cnicas de discretização, selecção e classificação, a diferentes conjuntos de dados, e comparamos/interpretamos os resultados obtidos com cada combinação de tĂ©cnicas. Na maioria dos conjuntos de dados, conseguimos obter erros de classificação mais baixos aplicando quer tĂ©cnicas de discretização quer tĂ©cnicas de selecção (mas nĂŁo ambas). Ao aplicar tĂ©cnicas de selecção, conseguimos tambĂ©m reduzir o nĂşmero de caracterĂsticas alimentadas a cada classificador, mantendo ou melhorando os resultados da classificação. Estes pequenos subconjuntos de genes sĂŁo assim mais fáceis de interpretar pelos especialistas clĂnicos humanos, melhorando a explicabilidade dos resultados.Determining which genes contribute to the development of certain diseases, such as cancer, is an important goal in the forefront of today’s clinical research. This can provide insights on how diseases develop, can lead to new treatments and to diagnostic tests that detect diseases earlier in their development, increasing patients chances of recovery. Today, many gene expression datasets are publicly available. These generally consist of DNA microarray data with information on the activation (or not) of thousands of genes, in specific patients, that exhibit a certain disease. However, these clinical datasets consist of high-dimensional feature vectors, which raises difficulties for clinical human analysis and interpretability - given the large amounts of features and the comparatively small amounts of instances, it is difficult to identify the most relevant genes related to the presence of a particular disease. In this thesis, we explore the usage of feature discretization, feature selection, and classification techniques applied towards the problem of identifying the most relevant set of features (genes), within DNA microarray datasets, that can predict the presence of a given disease. We propose a machine learning pipeline with different feature discretization, feature selection, and classification techniques, to different datasets, and compare/interpret the achieved results with different combinations of techniques. On most datasets, we were able to obtain lower classification errors by applying either feature discretization or feature selection techniques (but not both). When applying feature selection techniques, we were also able to reduce the number of features fed to each classifier, while maintaining or improving the classification results. These smaller subsets of genes are thus easier to interpret by human clinical experts, improving the explainability of the results.N/
- …