8 research outputs found

    Practical Aspects of Data Mining Using LISp-Miner

    Get PDF
    The paper describes some practical aspects of using LISp-Miner for data mining. LISp-Miner is a software tool that is under development at the University of Economics, Prague. We will review the different types of knowledge patterns discovered by the system, and discuss their applicability for various data mining tasks. We also compare LISp-Miner 18.16 with Weka 3.6.9 and Rapid Miner 5.3

    Topics in Statistical Methods for Human Gene Mapping

    Get PDF
    Statistical approaches used for gene mapping can be divided into two types: linkage and association analysis. This dissertation work addresses statistical methods in both areas.In the area of linkage analysis, I consider the problem of QTL (Quantitative Trait Locus) linkage analysis. Linkage analysis requires family data, and if the families are selected according to phenotype or if the trait of interest has a non-Gaussian distribution, standard analysis methods may be inappropriate. The score statistic, derived by taking the first derivative of the likelihood with respect to the linkage parameter, maintains the power of likelihood-based methods and with the use of an empirical variance estimator is robust against non-normal traits and selected samples. I investigate a number of empirical variance estimators that can be used for general pedigrees and evaluate the effects of different variance estimators and trait parameter estimates on the power of the score statistic.In the area of association analysis, I consider the question of what is the best model for a simple genome-scan analysis of a case-control study. In a case-control genome-wide association study, hundreds of thousands of SNPs are genotyped and statistical analysis usually starts with 1 or 2 df chi-squared test or logistic regression model. Power comparisons among subsets of these methods have been done but none of these papers have comprehensively tackled the question of which method is best for univariate scanning in a genome scan. I compare different test procedures and regression models for case-control studies starting from single-locus analysis followed by scanning with covariates and then genome-wide analysis. Based on the simulation results, I offer guidelines for choosing robust test procedures or regression models for testing the genetic effect.The methods proposed here can be used to improve the efficiency of gene mapping studies. This will lead to quicker and more reliable discoveries of genetic risk factors for many different diseases with great public health importance, which should in turn lead to improved prevention and treatment strategies

    Archetypes for histogram-valued data

    Get PDF
    Il principale sviluppo innovativo del lavoro è quello di propone una estensione dell'analisi archetipale per dati ad istogramma. Per quanto concerne l'impianto metodologico nell'approccio all'analisi di dati ad istogramma, che sono di natura complessa, il presente lavora utilizza le intuizioni della "Symbolic Data Analysis" (SDA) e le relazioni intrinseche tra dati valutati ad intervallo e dati valutati ad istogramma. Dopo aver discusso la tecnica sviluppata in ambiente Matlab, il suo funzionamento e le sue proprietà su di un esempio di comodo, tale tecnica viene proposta, nella sezione applicativa, come strumento per effettuare una analisi di tipo "benchmarking" quantitativo. Nello specifico, si propongono i principali risultati ottenuti da una applicazione degli archetipi per dati ad istogramma ad un caso di benchmarking interno del sistema scolastico, utilizzando dati provenienti dal test INVALSI relativi all'anno scolastico 2015/2016. In questo contesto l'unità di analisi è considerata essere la singola scuola, definita operativamente attraverso le distribuzioni dei punteggi dei propri alunni valutate, congiuntamente, sotto forma di oggetti simbolici ad istogramma

    Linkage mapping for complex traits : a regression-based approach

    Get PDF
    Linkage analysis makes use of genetic markers to measure genetic similarity between relatives. By comparing this index of genetic similarity with phenotypic similarity, we can identify chromosomal regions harbouring genes involved in the architecture of a phenotype of interest. Although linkage has been very successful in discovering genes responsible for simple Mendelian diseases, results have often been disappointing in gene mapping for complex traits. This thesis presents some attempts to improve the current design and analysis of linkage studies for complex traits. The statistical methodology adopted is driven by the fact that genes involved in complex traits have small effects, it therefore seems legitimate to use score tests because of their local optimality properties. In addition, score tests often give rise to tractable expressions, in the context of linkage these can be meaningfully interpreted in terms of regressions and quickly computed which is a crucial feature in genetics.Fonds Medische Statistiek - The Genomeutwin project (European Union Contract No QLG2-CT-2002-01254)UBL - phd migration 201

    Etude des facteurs contrôlant l'efficacité de la sélection génomique chez le palmier à huile (Elaeis guineensis Jacq.)

    Get PDF
    Agricultural production must increase at an unprecedented rate to meet the strong growth expected in food demand. Genomic selection (GS) could contribute to reaching this goal by allowing selection of individuals on their sole genotype, making breeding more efficient. Breeding for yield in oil palm, the first oil crop in the world, is currently based on hybrid production by reciprocal recurrent selection. The integration of GS to this scheme would have major repercussions. This thesis aims to assess the potential of GS to predict hybrid combining abilities in parental populations (Deli and group B). Data from the last breeding cycle were used to obtain the first empirical estimate of GS accuracy. Despite the small populations available to calibrate the genomic model, the study showed that with candidates related to the training population (sibs, progenies), the accuracy was sufficient to make a pre-selection in the group B on some yield components. In addition, simulations over four generations showed that the accuracy of several GS strategies (especially when training the model only in the first generation using hybrid genotypes) was high enough for non progeny tested individuals to allow selecting among them on their genotype. This resulted in an increase of more than 50% of annual genetic gain compared to traditional breeding. A faster increase in inbreeding was also demonstrated, but this could be limited by conventional methods of inbreeding management. Finally, the experimental and simulated data indicated that GS could reduce the average generation interval and increase the selection intensity, vastly speeding up the genetic progress for oil palm yield. A recurrent reciprocal genomic selection scheme was suggested for oil palm. Its application requires an experimental confirmation of the simulations, by estimating GS accuracy over several generations without retraining the model. Future research should use new GS models, potentially more effective (taking into account non additive effects or a priori information on marker effects, etc.).La production agricole doit augmenter à un rythme jamais atteint pour faire face à la forte hausse attendue de la demande alimentaire. La sélection génomique (SG) pourrait y contribuer en donnant la possibilité de sélectionner des individus uniquement sur leur génotype, rendant ainsi l’amélioration génétique des rendements plus efficace. L’amélioration actuelle de la production du palmier à huile, première plante oléagineuse au monde, se fait par sélection récurrente réciproque pour produire des hybrides. L’intégration de la SG à ce schéma aurait des retombées majeures. Cette thèse vise à évaluer le potentiel de la SG pour prédire les aptitudes à la combinaison hybride dans les populations parentales (Deli et groupe B).Des données du dernier cycle d’amélioration ont permis d’obtenir la première estimation empirique de la précision de la SG. Malgré les petites populations disponibles pour calibrer le modèle génomique, cette étude a montré qu’avec des candidats à la sélection apparentés à la population de calibration (même fratrie, descendants), la précision était suffisante pour faire une présélection sur certaines composantes du rendement dans le groupe B. Par ailleurs, des simulations sur quatre générations ont montré que, pour plusieurs stratégies de SG (en particulier avec une calibration faite uniquement à la première génération en incluant des génotypes d’hybrides), la précision de sélection chez les individus non testés en croisement était suffisante pour sélectionner des parents uniquement sur leur génotype. Ceci a abouti à une augmentation de plus de 50% du gain génétique annuel par rapport à la méthode classique. Une augmentation plus rapide de la consanguinité a aussi été mise en évidence, mais elle pourrait être limitée par des méthodes classiques de gestion de la consanguinité. Finalement, les données expérimentales et simulées indiquent que la SG pourrait diminuer l’intervalle moyen de génération et accroître l’intensité de sélection, accélérant ainsi considérablement le progrès génétique sur le rendement en huile de palme. Un schéma de sélection génomique récurrente réciproque est proposé pour le palmier à huile. Son application nécessite de confirmer expérimentalement les simulations en estimant sur plusieurs générations la précision de sélection sans recalibration du modèle. Ces futures recherches devraient utiliser les nouveaux modèles de SG, potentiellement plus efficaces (prise en compte des effets non additifs ou d’informations a priori sur les effets des marqueurs, etc.)

    Machine learning methods for quantitative structure-property relationship modeling

    Get PDF
    Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2014Due to the high rate of new compounds discovered each day and the morosity/cost of experimental measurements there will always be a significant gap between the number of known chemical compounds and the amount of chemical compounds for which experimental properties are available. This research work is motivated by the fact that the development of new methods for predicting properties and organize huge collections of molecules to reveal certain chemical categories/patterns and select diverse/representative samples for exploratory experiments are becoming essential. This work aims to increase the capability to predict physical, chemical and biological properties, using data mining methods applied to complex non-homogeneous data (chemical structures), for large information repositories. In the first phase of this work, current methodologies in quantitative structure-property modelling were studied. These methodologies attempt to relate a set of selected structure-derived features of a compound to its property using model-based learning. This work focused on solving major issues identified when predicting properties of chemical compounds and on the solutions explored using different molecular representations, feature selection techniques and data mining approaches. In this context, an innovative hybrid approach was proposed in order to improve the prediction power and comprehensibility of QSPR/QSAR problems using Random Forests for feature selection. It is acknowledged that, in general, similar molecules tend to have similar properties; therefore, on the second phase of this work, an instance-based machine learning methodology for predicting properties of compounds using the similarity-based molecular space was developed. However, this type of methodology requires the quantification of structural similarity between molecules, which is often subjective, ambiguous and relies upon comparative judgements, and consequently, there is currently no absolute standard of molecular similarity. In this context, a new similarity method was developed, the non-contiguous atom matching (NAMS), based on the optimal atom alignment using pairwise matching algorithms that take into account both topological profiles and atoms/bonds characteristics. NAMS can then be used for property inference over the molecular metric space using ordinary kriging in order to obtain robust and interpretable predictive results, providing a better understanding of the underlying relationship structure-property.Devido ao crescimento exponencial do número de compostos químicos descobertos diariamente e à morosidade/custo de medições experimentais, existe uma diferença significativa entre o número de compostos químicos conhecidos e a quantidade de compostos para os quais estão disponíveis propriedades experimentais. O desenvolvimento de novos métodos para a previsão de propriedades e organização de grandes coleções de moléculas que permitam revelar certas categorias/padrões químicos e selecionar amostras diversas/representativas para estudos exploratórios estão a tornar-se essenciais. Este trabalho tem como objetivo melhorar a capacidade de prever propriedades físicas, químicas e biológicas, através de métodos de aprendizagem automática aplicados a dados complexos não homogeneos (estruturas químicas), para grandes repositórios de informação. Numa primeira fase deste trabalho, foi feito o estudo de metodologias atualmente aplicadas para a modelação quantitativa entre estruturapropriedades. Estas metodologias tentam relacionar um conjunto seleccionado de descritores estruturais de uma molécula com as suas propriedades, utilizando uma abordagem baseada em modelos. Este trabalho centrou-se em solucionar as principais dificuldades identificadas na previsão de propriedades de compostos químicos e nas soluções exploradas utilizando diferentes representações moleculares, técnicas de seleção de descritores e abordagens de aprendizagem automática. Neste contexto, foi proposta uma abordagem híbrida inovadora para melhorar o capacidade de previsão e compreensão de problemas QSPR/QSAR utilizando o algoritmo "Random Forests" (Florestas Aleatórias) para seleção de descritores. É reconhecido que, em geral, moléculas semelhantes tendem a ter propriedades semelhantes; assim, numa segunda fase deste trabalho foi desenvolvida uma metodologia de aprendizagem automática baseada em instâncias para a previsão de propriedades de compostos químicos utilizando o espaço métrico construído a partir da semelhança estrutural entre moléculas. No entanto, este tipo de metodologia requer a quantificação de semelhança estrutural entre moléculas, o que é muitas vezes uma tarefa subjetiva, ambígua e dependente de julgamentos comparativos e, consequentemente, não existe atualmente nenhum padrão absoluto para definir semelhança molecular. Neste âmbito, foi desenvolvido um novo método de semelhança molecular, o “Non-Contiguous Atom Matching Structural Similarity” (NAMS), que se baseia no alinhamento de átomos utilizando algoritmos de emparelhamento que têm em conta os perfis topológicos das ligações e as características dos átomos e ligações. O espaço métrico molecular construído utilizando o NAMS pode ser aplicado à inferência de propriedades usando uma técnica de interpolação espacial, a "krigagem", que tem em conta a relação espacial entre as instâncias, com o objetivo de se obter uma previsão consistente e interpretável, proporcionando uma melhor compreensão da relação entre estrutura-propriedades.Fundação para a Ciência e a Tecnologia (FCT
    corecore