512 research outputs found

    Discovering novelty in sequential patterns: application for analysis of microarray data on Alzheimer disease

    Get PDF
    [Departement_IRSTEA]Territoires [TR1_IRSTEA]SYNERGIEInternational audienceAnalyzing microarrays data is still a great challenge since existing methods produce huge amounts of useless results. We propose a new method called NoDisco for discovering novelties in gene sequences obtained by applying data-mining techniques to microarray data. Method: We identify popular genes, which are often cited in the literature, and innovative genes, which are linked to the popular genes in the sequences but are not mentioned in the literature. We also identify popular and innovative sequences containing these genes. Biologists can thus select interesting sequences from the two sets and obtain the k-best documents. Results: We show the efficiency of this method by applying it on real data used to decipher the mechanisms underlying Alzheimer disease. Conclusion: The first selection of sequences based on popularity and innovation help experts focus on relevant sequences while the top-k documents help them understand the sequences

    Discovering Novelty in Gene Data : From Sequential Patterns to Visualization

    Get PDF
    International audienceData mining techniques allow users to discover novelty in huge amounts of data. Frequent pattern methods have proved to be efficient, but the extracted patterns are often too numerous and thus difficult to analyse by end-users. In this paper, we focus on sequential pattern mining and propose a new visualization system, which aims at helping end-users to analyse extracted nowledge and to highlight the novelty according to referenced biological document databases. Our system is based on two visualization techniques: Clouds and solar systems. We show that these techniques are very helpful for identifying associations and hierarchical relationships between patterns among related documents. Sequential patterns extracted from gene data using our system were successfully evaluated by two biology laboratories working on Alzheimers disease and cancer

    Discovering Temporal Associations among Significant Changes in Gene Expression

    Get PDF
    Abstract-One of the most demanding problems in mining temporal data is to identify how multivariate change associations might be discovered and used to better understand data interactions and dependencies. This paper introduces a framework to mine associations among significant changes in multivariate time-series data. Building on statistical methods, we detect significant changes in timeseries data and use marginal change rates to qualify the direction of change at significant change points. Furthermore, a propositional confirmation-guided rule discovery method is used to discover associations among these significant changes. We apply our approach to gene expression data measured in yeast cell cycles and demonstrate that our method can learn novel and highquality significant change associations among different genes. Such associations can be used to cluster genes and build gene interaction networks

    Fouille de données de santé

    Get PDF
    Dans le domaine de la santé, les techniques d’analyse de données sont de plus en plus populaires et se révèlent même indispensables pour gérer les gros volumes de données produits pour un patient et par le patient. Deux thématiques seront abordées dans cette présentation d'HDR.La première porte sur la définition, la formalisation, l’implémentation et la validation de méthodes d’analyse permettant de décrire le contenu de bases de données médicales. Je me suis particulièrement intéressée aux données séquentielles. J’ai fait évoluer la classique notion de motif séquentiel pour y intégrer des composantes contextuelles, spatiales et sur l’ordre partiel des éléments composant les motifs. Ces nouvelles informations enrichissent la sémantique initiale de ces motifs.La seconde thématique se focalise sur l’analyse des productions et des interactions des patients au travers des médias sociaux. J’ai principalement travaillé sur des méthodes permettant d’analyser les productions narratives des patients selon leurs temporalités, leurs thématiques, les sentiments associés ou encore le rôle et la réputation du locuteur s’étant exprimé dans les messages

    Data Mining Using the Crossing Minimization Paradigm

    Get PDF
    Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis

    Functional Analysis of Human Long Non-coding RNAs and Their Associations with Diseases

    Get PDF
    Within this study, we sought to leverage knowledge from well-characterized protein coding genes to characterize the lesser known long non-coding RNA (lncRNA) genes using computational methods to find functional annotations and disease associations. Functional genome annotation is an essential step to a systems-level view of the human genome. With this knowledge, we can gain a deeper understanding of how humans develop and function, and a better understanding of human disease. LncRNAs are transcripts greater than 200 nucleotides, which do not code for proteins. LncRNAs have been found to regulate development, tissue and cell differentiation, and organ formation. Their dysregulation has been linked to several diseases including autism spectrum disorder (ASD) and cancer. While a great deal of research has been dedicated to protein-coding genes, the relatively recently discovered lncRNA genes have yet to be characterized. LncRNA function is tied closely to when and where they are expressed. Co-expression network analysis offer a means of functional annotation of uncharacterized genes through a guilt by association approach. We have constructed two co-expression networks using known disease-associated protein-coding genes and lncRNA genes. Through clustering of the networks, gene set enrichment analysis, and centrality measures, we found enrichment for disease association and functions as well as identified high-confidence lncRNA disease gene targets. We present a novel approach to the identification of disease state associations by demonstrating genes that are associated with the same disease states share patterns that can be discerned from transcriptomes of healthy tissues. Using a machine learning algorithm, we built a model to classify ASD versus non-ASD genes using their expression profiles from healthy developing human brain tissues. Feature selection during the model-building process also identified critical temporospatial points for the determination of ASD genes. We constructed a webserver tool for the prioritization of genes for ASD association. The webserver tool has a database containing prioritization and co-expression information for nearly every gene in the human genome

    A Review on the Role of Nano-Communication in Future Healthcare Systems: A Big Data Analytics Perspective

    Get PDF
    This paper presents a first-time review of the open literature focused on the significance of big data generated within nano-sensors and nano-communication networks intended for future healthcare and biomedical applications. It is aimed towards the development of modern smart healthcare systems enabled with P4, i.e. predictive, preventive, personalized and participatory capabilities to perform diagnostics, monitoring, and treatment. The analytical capabilities that can be produced from the substantial amount of data gathered in such networks will aid in exploiting the practical intelligence and learning capabilities that could be further integrated with conventional medical and health data leading to more efficient decision making. We have also proposed a big data analytics framework for gathering intelligence, form the healthcare big data, required by futuristic smart healthcare to address relevant problems and exploit possible opportunities in future applications. Finally, the open challenges, future directions for researchers in the evolving healthcare domain, are presented

    Unique networks: a method to identity disease-specific regulatory networks from microarray data

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The survival of any organismis determined by the mechanisms triggered in response to the inputs received. Underlying mechanisms are described by graphical networks that can be inferred from different types of data such as microarrays. Deriving robust and reliable networks can be complicated due to the microarray structure of the data characterized by a discrepancy between the number of genes and samples of several orders of magnitude, bias and noise. Researchers overcome this problem by integrating independent data together and deriving the common mechanisms through consensus network analysis. Different conditions generate different inputs to the organism which reacts triggering different mechanisms with similarities and differences. A lot of effort has been spent into identifying the commonalities under different conditions. Highlighting similarities may overshadow the differences which often identify the main characteristics of the triggered mechanisms. In this thesis we introduce the concept of study-specific mechanism. We develop a pipeline to semiautomatically identify study-specific networks called unique-networks through a combination of consensus approach, graphical similarities and network analysis. The main pipeline called UNIP (Unique Networks Identification Pipeline) takes a set of independent studies, builds gene regulatory networks for each of them, calculates an adaptation of the sensitivity measure based on the networks graphical similarities, applies clustering to group the studies who generate the most similar networks into study-clusters and derives the consensus networks. Once each study-cluster is associated with a consensus-network, we identify the links that appear only in the consensus network under consideration but not in the others (unique-connections). Considering the genes involved in the unique-connections we build Bayesian networks to derive the unique-networks. Finally, we exploit the inference tool to calculate each gene prediction-accuracy across all studies to further refine the unique-networks. Biological validation through different software and the literature are explored to validate our method. UNIP is first applied to a set of synthetic data perturbed with different levels of noise to study the performance and verify its reliability. Then, wheat under stress conditions and different types of cancer are explored. Finally, we develop a user-friendly interface to combine the set of studies by using AND and NOT logic operators. Based on the findings, UNIP is a robust and reliable method to analyse large sets of transcriptomic data. It easily detects the main complex relationships between transcriptional expression of genes specific for different conditions and also highlights structures and nodes that could be potential targets for further research
    • …
    corecore