8 research outputs found
Practical Aspects of Data Mining Using LISp-Miner
The paper describes some practical aspects of using LISp-Miner for data mining. LISp-Miner is a software tool that is under development at the University of Economics, Prague. We will review the different types of knowledge patterns discovered by the system, and discuss their applicability for various data mining tasks. We also compare LISp-Miner 18.16 with Weka 3.6.9 and Rapid Miner 5.3
Topics in Statistical Methods for Human Gene Mapping
Statistical approaches used for gene mapping can be divided into two types: linkage and association analysis. This dissertation work addresses statistical methods in both areas.In the area of linkage analysis, I consider the problem of QTL (Quantitative Trait Locus) linkage analysis. Linkage analysis requires family data, and if the families are selected according to phenotype or if the trait of interest has a non-Gaussian distribution, standard analysis methods may be inappropriate. The score statistic, derived by taking the first derivative of the likelihood with respect to the linkage parameter, maintains the power of likelihood-based methods and with the use of an empirical variance estimator is robust against non-normal traits and selected samples. I investigate a number of empirical variance estimators that can be used for general pedigrees and evaluate the effects of different variance estimators and trait parameter estimates on the power of the score statistic.In the area of association analysis, I consider the question of what is the best model for a simple genome-scan analysis of a case-control study. In a case-control genome-wide association study, hundreds of thousands of SNPs are genotyped and statistical analysis usually starts with 1 or 2 df chi-squared test or logistic regression model. Power comparisons among subsets of these methods have been done but none of these papers have comprehensively tackled the question of which method is best for univariate scanning in a genome scan. I compare different test procedures and regression models for case-control studies starting from single-locus analysis followed by scanning with covariates and then genome-wide analysis. Based on the simulation results, I offer guidelines for choosing robust test procedures or regression models for testing the genetic effect.The methods proposed here can be used to improve the efficiency of gene mapping studies. This will lead to quicker and more reliable discoveries of genetic risk factors for many different diseases with great public health importance, which should in turn lead to improved prevention and treatment strategies
Archetypes for histogram-valued data
Il principale sviluppo innovativo del lavoro è quello di propone una estensione dell'analisi archetipale per dati ad istogramma. Per quanto concerne l'impianto metodologico nell'approccio all'analisi di dati ad istogramma, che sono di natura complessa, il presente lavora utilizza le intuizioni della "Symbolic Data Analysis" (SDA) e le relazioni intrinseche tra dati valutati ad intervallo e dati valutati ad istogramma. Dopo aver discusso la tecnica sviluppata in ambiente Matlab, il suo funzionamento e le sue proprietà su di un esempio di comodo, tale tecnica viene proposta, nella sezione applicativa, come strumento per effettuare una analisi di tipo "benchmarking" quantitativo. Nello specifico, si propongono i principali risultati ottenuti da una applicazione degli archetipi per dati ad istogramma ad un caso di benchmarking interno del sistema scolastico, utilizzando dati provenienti dal test INVALSI relativi all'anno scolastico 2015/2016. In questo contesto l'unità di analisi è considerata essere la singola scuola, definita operativamente attraverso le distribuzioni dei punteggi dei propri alunni valutate, congiuntamente, sotto forma di oggetti simbolici ad istogramma
Linkage mapping for complex traits : a regression-based approach
Linkage analysis makes use of genetic markers to measure genetic similarity between relatives. By comparing this index of genetic similarity with phenotypic similarity, we can identify chromosomal regions harbouring genes involved in the architecture of a phenotype of interest. Although linkage has been very successful in discovering genes responsible for simple Mendelian diseases, results have often been disappointing in gene mapping for complex traits. This thesis presents some attempts to improve the current design and analysis of linkage studies for complex traits. The statistical methodology adopted is driven by the fact that genes involved in complex traits have small effects, it therefore seems legitimate to use score tests because of their local optimality properties. In addition, score tests often give rise to tractable expressions, in the context of linkage these can be meaningfully interpreted in terms of regressions and quickly computed which is a crucial feature in genetics.Fonds Medische Statistiek - The Genomeutwin project (European Union Contract No QLG2-CT-2002-01254)UBL - phd migration 201
Etude des facteurs contrôlant l'efficacité de la sélection génomique chez le palmier à huile (Elaeis guineensis Jacq.)
Agricultural production must increase at an unprecedented rate to meet the strong growth expected in food demand. Genomic selection (GS) could contribute to reaching this goal by allowing selection of individuals on their sole genotype, making breeding more efficient. Breeding for yield in oil palm, the first oil crop in the world, is currently based on hybrid production by reciprocal recurrent selection. The integration of GS to this scheme would have major repercussions. This thesis aims to assess the potential of GS to predict hybrid combining abilities in parental populations (Deli and group B). Data from the last breeding cycle were used to obtain the first empirical estimate of GS accuracy. Despite the small populations available to calibrate the genomic model, the study showed that with candidates related to the training population (sibs, progenies), the accuracy was sufficient to make a pre-selection in the group B on some yield components. In addition, simulations over four generations showed that the accuracy of several GS strategies (especially when training the model only in the first generation using hybrid genotypes) was high enough for non progeny tested individuals to allow selecting among them on their genotype. This resulted in an increase of more than 50% of annual genetic gain compared to traditional breeding. A faster increase in inbreeding was also demonstrated, but this could be limited by conventional methods of inbreeding management. Finally, the experimental and simulated data indicated that GS could reduce the average generation interval and increase the selection intensity, vastly speeding up the genetic progress for oil palm yield. A recurrent reciprocal genomic selection scheme was suggested for oil palm. Its application requires an experimental confirmation of the simulations, by estimating GS accuracy over several generations without retraining the model. Future research should use new GS models, potentially more effective (taking into account non additive effects or a priori information on marker effects, etc.).La production agricole doit augmenter à un rythme jamais atteint pour faire face à la forte hausse attendue de la demande alimentaire. La sélection génomique (SG) pourrait y contribuer en donnant la possibilité de sélectionner des individus uniquement sur leur génotype, rendant ainsi l’amélioration génétique des rendements plus efficace. L’amélioration actuelle de la production du palmier à huile, première plante oléagineuse au monde, se fait par sélection récurrente réciproque pour produire des hybrides. L’intégration de la SG à ce schéma aurait des retombées majeures. Cette thèse vise à évaluer le potentiel de la SG pour prédire les aptitudes à la combinaison hybride dans les populations parentales (Deli et groupe B).Des données du dernier cycle d’amélioration ont permis d’obtenir la première estimation empirique de la précision de la SG. Malgré les petites populations disponibles pour calibrer le modèle génomique, cette étude a montré qu’avec des candidats à la sélection apparentés à la population de calibration (même fratrie, descendants), la précision était suffisante pour faire une présélection sur certaines composantes du rendement dans le groupe B. Par ailleurs, des simulations sur quatre générations ont montré que, pour plusieurs stratégies de SG (en particulier avec une calibration faite uniquement à la première génération en incluant des génotypes d’hybrides), la précision de sélection chez les individus non testés en croisement était suffisante pour sélectionner des parents uniquement sur leur génotype. Ceci a abouti à une augmentation de plus de 50% du gain génétique annuel par rapport à la méthode classique. Une augmentation plus rapide de la consanguinité a aussi été mise en évidence, mais elle pourrait être limitée par des méthodes classiques de gestion de la consanguinité. Finalement, les données expérimentales et simulées indiquent que la SG pourrait diminuer l’intervalle moyen de génération et accroître l’intensité de sélection, accélérant ainsi considérablement le progrès génétique sur le rendement en huile de palme. Un schéma de sélection génomique récurrente réciproque est proposé pour le palmier à huile. Son application nécessite de confirmer expérimentalement les simulations en estimant sur plusieurs générations la précision de sélection sans recalibration du modèle. Ces futures recherches devraient utiliser les nouveaux modèles de SG, potentiellement plus efficaces (prise en compte des effets non additifs ou d’informations a priori sur les effets des marqueurs, etc.)
Machine learning methods for quantitative structure-property relationship modeling
Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de CiĂŞncias, 2014Due to the high rate of new compounds discovered each day and the morosity/cost of experimental measurements there will always be a significant gap between the number of known chemical compounds and the amount of chemical compounds for which experimental properties are available. This research work is motivated by the fact that the development of new methods for predicting properties and organize huge collections of molecules to reveal certain chemical categories/patterns and select diverse/representative samples for exploratory experiments are becoming essential. This work aims to increase the capability to predict physical, chemical and biological properties, using data mining methods applied to complex non-homogeneous data (chemical structures), for large information repositories. In the first phase of this work, current methodologies in quantitative structure-property modelling were studied. These methodologies attempt to relate a set of selected structure-derived features of a compound to its property using model-based learning. This work focused on solving major issues identified when predicting properties of chemical compounds and on the solutions explored using different molecular representations, feature selection techniques and data mining approaches. In this context, an innovative hybrid approach was proposed in order to improve the prediction power and comprehensibility of QSPR/QSAR problems using Random Forests for feature selection. It is acknowledged that, in general, similar molecules tend to have similar properties; therefore, on the second phase of this work, an instance-based machine learning methodology for predicting properties of compounds using the similarity-based molecular space was developed. However, this type of methodology requires the quantification of structural similarity between molecules, which is often subjective, ambiguous and relies upon comparative judgements, and consequently, there is currently no absolute standard of molecular similarity. In this context, a new similarity method was developed, the non-contiguous atom matching (NAMS), based on the optimal atom alignment using pairwise matching algorithms that take into account both topological profiles and atoms/bonds characteristics. NAMS can then be used for property inference over the molecular metric space using ordinary kriging in order to obtain robust and interpretable predictive results, providing a better understanding of the underlying relationship structure-property.Devido ao crescimento exponencial do nĂşmero de compostos quĂmicos descobertos diariamente e Ă morosidade/custo de medições experimentais, existe uma diferença significativa entre o nĂşmero de compostos quĂmicos conhecidos e a quantidade de compostos para os quais estĂŁo disponĂveis propriedades experimentais. O desenvolvimento de novos mĂ©todos para a previsĂŁo de propriedades e organização de grandes coleções de molĂ©culas que permitam revelar certas categorias/padrões quĂmicos e selecionar amostras diversas/representativas para estudos exploratĂłrios estĂŁo a tornar-se essenciais. Este trabalho tem como objetivo melhorar a capacidade de prever propriedades fĂsicas, quĂmicas e biolĂłgicas, atravĂ©s de mĂ©todos de aprendizagem automática aplicados a dados complexos nĂŁo homogeneos (estruturas quĂmicas), para grandes repositĂłrios de informação. Numa primeira fase deste trabalho, foi feito o estudo de metodologias atualmente aplicadas para a modelação quantitativa entre estruturapropriedades. Estas metodologias tentam relacionar um conjunto seleccionado de descritores estruturais de uma molĂ©cula com as suas propriedades, utilizando uma abordagem baseada em modelos. Este trabalho centrou-se em solucionar as principais dificuldades identificadas na previsĂŁo de propriedades de compostos quĂmicos e nas soluções exploradas utilizando diferentes representações moleculares, tĂ©cnicas de seleção de descritores e abordagens de aprendizagem automática. Neste contexto, foi proposta uma abordagem hĂbrida inovadora para melhorar o capacidade de previsĂŁo e compreensĂŁo de problemas QSPR/QSAR utilizando o algoritmo "Random Forests" (Florestas AleatĂłrias) para seleção de descritores. É reconhecido que, em geral, molĂ©culas semelhantes tendem a ter propriedades semelhantes; assim, numa segunda fase deste trabalho foi desenvolvida uma metodologia de aprendizagem automática baseada em instâncias para a previsĂŁo de propriedades de compostos quĂmicos utilizando o espaço mĂ©trico construĂdo a partir da semelhança estrutural entre molĂ©culas. No entanto, este tipo de metodologia requer a quantificação de semelhança estrutural entre molĂ©culas, o que Ă© muitas vezes uma tarefa subjetiva, ambĂgua e dependente de julgamentos comparativos e, consequentemente, nĂŁo existe atualmente nenhum padrĂŁo absoluto para definir semelhança molecular. Neste âmbito, foi desenvolvido um novo mĂ©todo de semelhança molecular, o “Non-Contiguous Atom Matching Structural Similarity” (NAMS), que se baseia no alinhamento de átomos utilizando algoritmos de emparelhamento que tĂŞm em conta os perfis topolĂłgicos das ligações e as caracterĂsticas dos átomos e ligações. O espaço mĂ©trico molecular construĂdo utilizando o NAMS pode ser aplicado Ă inferĂŞncia de propriedades usando uma tĂ©cnica de interpolação espacial, a "krigagem", que tem em conta a relação espacial entre as instâncias, com o objetivo de se obter uma previsĂŁo consistente e interpretável, proporcionando uma melhor compreensĂŁo da relação entre estrutura-propriedades.Fundação para a CiĂŞncia e a Tecnologia (FCT