115 research outputs found

    Mining Quantitative Association Rules in Microarray Data Using Evolutive Algorithms

    Get PDF
    The microarray technique is able to monitor the change in concentration of RNA in thousands of genes simultaneously. The interest in this technique has grown exponentially in recent years and the difficulties in analyzing data from such experiments, which are characterized by the high number of genes to be analyzed in relation to the low number of experiments or samples available. In this paper we show the result of applying a data mining method based on quantitative association rules for microarray data. These rules work with intervals on the attributes, without discretizing the data before. The rules are generated by an evolutionary algorithm.Ministerio de Ciencia y Tecnología TIN2007-68084-C-00Junta de Andalucía P07-TIC-0261

    Mining quantitative association rules based on evolutionary computation and its application to atmospheric pollution

    Get PDF
    This research presents the mining of quantitative association rules based on evolutionary computation techniques. First, a real-coded genetic algorithm that extends the well-known binary-coded CHC algorithm has been projected to determine the intervals that define the rules without needing to discretize the attributes. The proposed algorithm is evaluated in synthetic datasets under different levels of noise in order to test its performance and the reported results are then compared to that of a multi-objective differential evolution algorithm, recently published. Furthermore, rules from real-world time series such as temperature, humidity, wind speed and direction of the wind, ozone, nitrogen monoxide and sulfur dioxide have been discovered with the objective of finding all existing relations between atmospheric pollution and climatological conditions.Ministerio de Ciencia y Tecnología TIN2007-68084-C-00Junta de Andalucía P07-TIC-0261

    Managing Requirement Volatility in an Ontology-Driven Clinical LIMS Using Category Theory. International Journal of Telemedicine and Applications

    Get PDF
    Requirement volatility is an issue in software engineering in general, and in Web-based clinical applications in particular, which often originates from an incomplete knowledge of the domain of interest. With advances in the health science, many features and functionalities need to be added to, or removed from, existing software applications in the biomedical domain. At the same time, the increasing complexity of biomedical systems makes them more difficult to understand, and consequently it is more difficult to define their requirements, which contributes considerably to their volatility. In this paper, we present a novel agent-based approach for analyzing and managing volatile and dynamic requirements in an ontology-driven laboratory information management system (LIMS) designed for Web-based case reporting in medical mycology. The proposed framework is empowered with ontologies and formalized using category theory to provide a deep and common understanding of the functional and nonfunctional requirement hierarchies and their interrelations, and to trace the effects of a change on the conceptual framework.Comment: 36 Pages, 16 Figure

    A survey on pre-processing techniques: relevant issues in the context of environmental data mining

    Get PDF
    One of the important issues related with all types of data analysis, either statistical data analysis, machine learning, data mining, data science or whatever form of data-driven modeling, is data quality. The more complex the reality to be analyzed is, the higher the risk of getting low quality data. Unfortunately real data often contain noise, uncertainty, errors, redundancies or even irrelevant information. Useless models will be obtained when built over incorrect or incomplete data. As a consequence, the quality of decisions made over these models, also depends on data quality. This is why pre-processing is one of the most critical steps of data analysis in any of its forms. However, pre-processing has not been properly systematized yet, and little research is focused on this. In this paper a survey on most popular pre-processing steps required in environmental data analysis is presented, together with a proposal to systematize it. Rather than providing technical details on specific pre-processing techniques, the paper focus on providing general ideas to a non-expert user, who, after reading them, can decide which one is the more suitable technique required to solve his/her problem.Peer ReviewedPostprint (author's final draft

    Graphical Model approaches for Biclustering

    Get PDF
    In many scientific areas, it is crucial to group (cluster) a set of objects, based on a set of observed features. Such operation is widely known as Clustering and it has been exploited in the most different scenarios ranging from Economics to Biology passing through Psychology. Making a step forward, there exist contexts where it is crucial to group objects and simultaneously identify the features that allow to recognize such objects from the others. In gene expression analysis, for instance, the identification of subsets of genes showing a coherent pattern of expression in subsets of objects/samples can provide crucial information about active biological processes. Such information, which cannot be retrieved by classical clustering approaches, can be extracted with the so called Biclustering, a class of approaches which aim at simultaneously clustering both rows and columns of a given data matrix (where each row corresponds to a different object/sample and each column to a different feature). The problem of biclustering, also known as co-clustering, has been recently exploited in a wide range of scenarios such as Bioinformatics, market segmentation, data mining, text analysis and recommender systems. Many approaches have been proposed to address the biclustering problem, each one characterized by different properties such as interpretability, effectiveness or computational complexity. A recent trend involves the exploitation of sophisticated computational models (Graphical Models) to face the intrinsic complexity of biclustering, and to retrieve very accurate solutions. Graphical Models represent the decomposition of a global objective function to analyse in a set of smaller/local functions defined over a subset of variables. The advantages in using Graphical Models relies in the fact that the graphical representation can highlight useful hidden properties of the considered objective function, plus, the analysis of smaller local problems can be dealt with less computational effort. Due to the difficulties in obtaining a representative and solvable model, and since biclustering is a complex and challenging problem, there exist few promising approaches in literature based on Graphical models facing biclustering. 3 This thesis is inserted in the above mentioned scenario and it investigates the exploitation of Graphical Models to face the biclustering problem. We explored different type of Graphical Models, in particular: Factor Graphs and Bayesian Networks. We present three novel algorithms (with extensions) and evaluate such techniques using available benchmark datasets. All the models have been compared with the state-of-the-art competitors and the results show that Factor Graph approaches lead to solid and efficient solutions for dataset of contained dimensions, whereas Bayesian Networks can manage huge datasets, with the overcome that setting the parameters can be not trivial. As another contribution of the thesis, we widen the range of biclustering applications by studying the suitability of these approaches in some Computer Vision problems where biclustering has been never adopted before. Summarizing, with this thesis we provide evidence that Graphical Model techniques can have a significant impact in the biclustering scenario. Moreover, we demonstrate that biclustering techniques are ductile and can produce effective solutions in the most different fields of applications

    Evaluating the impact of horizontally acquired genes on the metabolism of nonconventional yeast lineage

    Get PDF
    The Wickerhamiella/Starmerella (W/S) clade is a group of non-conventional yeast with atypical metabolic functions that enable the colonization of specialized niches, such as flowers or the gut of insects that visit flowers. Comparative genomics demonstrates that one of the main forces that drives metabolic evolution in the W/S clade is the unusual high frequency of horizontally acquired genes, some of which have been deeply studied and characterized. Yet, a high proportion of acquired genes with unknown impact in the W/S-clade metabolism still remains. This work aimed to advance the current understanding of the metabolic evolution in the W/S clade, by analyzing the complete transcriptomes of W. versatilis, W. domercqiae and S. bombicola grown in two different conditions. Comparative transcriptomic analyses between native and acquired genes across the three species were performed in order to provide a first high-throughput evaluation of the impact of the acquired genes on the host metabolism. Quantitative levels of gene expression and patterns of differential expression were studied and analyzed together with functional annotation and the role of acquired genes evaluating e.g. whether it enabled a function normally absent in yeasts. The results indicate that, depending on the species analyzed, the expression levels of acquired genes can either be similar or sig-nificantly lower than native genes. Yet, in all instances, an important proportion of these genes are actively regulated. Expressed acquired genes tend to be fixed by replacing pre-existing genes in the genomes, which were often involved in the assimilation of carbon and nitrogen from minority resources. The whole transcriptome analysis is a tool that perfectly complements current knowledge of whole genome evolution in the W/S clade, especially in understanding the evolutionary impact of horizontal gene transfer events in these yeasts.O clado Wickerhamiella/Starmerella (W/S) é um grupo de leveduras não convencionais com funções metabólicas atípicas que permitem a colonização de nichos especializados, como flores ou intestino de insetos que visitam flores. A genómica comparativa demonstra que uma das principais forças que impulsionam a evolução metabólica no clado W/S é a grande frequência de genes adquiridos horizontalmente, alguns dos quais já caracterizados em detalhe. No entanto, ainda persiste uma elevada proporção de genes horizontalmente adquiridos com impacto desconhecido no metabolismo do clado W/S. Este trabalho teve como objetivo avançar no conhecimento da evolução metabólica no clado W/S, analisando os transcriptomas completos de W. versatilis, W. domercqiae e S. bombicola cultivados em duas condições diferentes. Análises de transcriptómica comparativa entre genes nativos e adquiridos nas três espécies foram realizadas a fim de fornecer uma primeira avaliação do impacto do conjunto dos genes adquiridos no metabolismo do hospedeiro. Níveis quantitativos de expressão génica e padrões de expressão diferencial foram estudados e analisados juntamente com a anotação funcional e o contexto evolutivo dos genes adquiridos. Os resultados indicam que, depen-dendo da espécie analisada, os níveis de expressão dos genes adquiridos podem ser seme-lhantes ou significativamente inferiores aos dos genes nativos. No entanto, em todos os casos, uma proporção importante destes genes é activamente regulada. Os genes adquiridos que são expressos frequentemente substituem genes pré-existentes nos genomas, frequentemente relacionados com a assimilação de fontes menos comuns de carbono e azoto. A análise do transcriptoma completo é uma ferramenta que complementa perfeitamente o conhecimento actual sobre a evolução do genoma no clado W/S, especialmente no que respeita à compreensão do impacto evolutivo de eventos de transferência horizontal de genes nestas leveduras

    Genetic association analysis of complex diseases through information theoretic metrics and linear pleiotropy

    Get PDF
    The main goal of this thesis was to help in the identification of genetic variants that are responsible for complex traits, combining both linear and nonlinear approaches. First, two one-locus approaches were proposed. The first one defined and characterized a novel nonlinear test of genetic association, based on the mutual information measure. This test takes into account the genetic structure of the population. It was applied to the GAW17 dataset and compared to the standard linear test of association. Since the solution of the GAW17 simulation model was known, this study served to characterize the performance of the proposed nonlinear methods in comparison to the linear one. The proposed nonlinear test was able to recover the results obtained with linear methods but also detected an additional SNP in a gene related with the phenotype. In addition, the performance of both tests in terms of their accuracy in classification (AUC) was similar. In contrast, the second approach was an exploratory study on the relationship between SNP variability among species and SNP association with disease, at different genetic regions. Two sets of SNPs were compared, one containing deleterious SNPs and the other defined by neutral SNPs. Both sets were stratified depending on the region where the polymorphisms were located, a feature that may have influenced their conservation across species. It was observed that, for most functional regions, SNPs associated to diseases tend to be significantly less variable across species than neutral SNPs. Second, a novel nonlinear methodology for multiloci genetic association was proposed with the goal of detecting association between combinations of SNPs and a phenotype. The proposed method was based on the mutual information of statistical significance, called MISS. This approach was compared with MLR, the standard linear method used for genetic association based on multiple linear regressions. Both were applied as a relevance criterion of a new multi-solution floating feature selection algorithm (MSSFFS), proposed in the context of multi-loci genetic association for complex diseases. Both were also compared with MECPM, an algorithm for searching predictive multi-loci interactions with a criterion of maximum entropy. The three methods were tested on the SNPs of the F7 gene, and the FVII levels in blood, with the data from the GAIT project. The proposed nonlinear method (MISS) improved the results of traditional genetic association methods, detecting new SNP-SNP interactions. Most of the obtained sets of SNPs were in concordance with the functional results found in the literature where the obtained SNPs have been described as functional elements correlated with the phenotype. Third, a linear methodological framework for the simultaneous study of several phenotypes was proposed. The methodology consisted in building new phenotypic variables, named metaphenotypes, that capture the joint activity of sets of phenotypes involved in a metabolic pathway. These new variables were used in further association tests with the aim of identifying genetic elements related with the underlying biological process as a whole. As a practical implementation, the methodology was applied to the GAIT project dataset with the aim of identifying genetic markers that could be related to the coagulation process as a whole and thus to thrombosis. Three mathematical models were used for the definition of metaphenotypes, corresponding to one PCA and two ICA models. Using this novel approach, already known associations were retrieved but also new candidates were proposed as regulatory genes with a global effect on the coagulation pathway as a whole

    Integrative Bioinformatics of Functional and Genomic Profiles for Cancer Systems Medicine

    Get PDF
    Cancer is a leading cause of death worldwide and a major public health burden. The rapid advancements in high-throughput techniques have now made it possible to molecularly characterize large number of patient tumors, and large-scale genomic and functional profiles are routinely being generated. Such datasets hold immense potential to reveal novel genes driving cancer, biomarkers with prognostic value, and also identify promising targets for drug treatment. But the ‘big data’ nature of these highly complex datasets require concurrent development of computational models and data analysis strategies to be able to mine useful knowledge and unlock the potential of the information content that is latent in such datasets. This thesis presents computational and analytical approaches to extract potentially useful information by integrating genomic and functional profiles of cancer cells.Syöpä on maailmanlaajuisesti johtava kuolinsyy sekä suuri kansanterveystaakka. Edistyneen teknologian ansiosta voimme nykyään tutkia syöpäsoluja molekyylitasolla sekä tuottaa valtavia määriä tietoa. Tällaisissa tietomäärissä piilee suuria mahdollisuuksia uusien syöpää aiheuttavien geenien löytämiseen ja lupaavien syöpähoitokohteiden tunnistamiseen. Näiden erittäin monimutkaisten tietomäärien ”Big data” -luonne vaatii kuitenkin myös laskennallisten mallien kehittämistä ja strategioita tiedon analysointiin, jotta voidaan löytää käyttökelpoista tietoa, joka voisi olla hyödyllistä terveydenhoidossa. Tämä väitöskirja esittelee laskennallisia ja analyyttisiä tapoja löytää mahdollisesti hyödyllistä tietoa yhdistämällä erilaisia syöpäsolujen molekulaarisia malleja, kuten niiden genomisia ja toiminnallisia profiileja
    corecore