50 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    ΠžΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½Ρ‹ΠΉ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ ΠΊ Π²Ρ‹Π±ΠΎΡ€Ρƒ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… тСкстовых коллСкциях

    Get PDF
    The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity. The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated. The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms. The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine). The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.РассматриваСтся Π·Π°Π΄Π°Ρ‡Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½Ρ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² Π² тСкстовых коллСкциях. Π‘ΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΠ΅ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ выявлСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡƒΠ½ΠΈΠ²Π΅Ρ€ΡΠ°Π»ΡŒΠ½Ρ‹ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°ΡŽΡ‚ ΡΡ‚Π°Π±ΠΈΠ»ΡŒΠ½Ρ‹ΠΉ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ Π½Π° Ρ€Π°Π·Π½Ρ‹Ρ… Π½Π°Π±ΠΎΡ€Π°Ρ… Π΄Π°Π½Π½Ρ‹Ρ…. Π’ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ² зависит ΠΎΡ‚ Π²Ρ‹Π±ΠΎΡ€Π° ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· шагов Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ°, ΠΈ для Ρ€Π°Π·Π½Ρ‹Ρ… ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ Π½Π°Π±ΠΎΡ€Ρ‹ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². НС всС ΠΈΠ· ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΡ… Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ эффСктивно Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‚ с тСкстовыми Π΄Π°Π½Π½Ρ‹ΠΌΠΈ, Π²Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ прСдставлСниС ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… характСризуСтся большой Ρ€Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒΡŽ ΠΏΡ€ΠΈ сильной разрСТСнности. Π—Π°Π΄Π°Ρ‡Π° поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ рассматриваСтся Π² ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅ΠΉ постановкС: трСбуСтся ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΈΡ‚ΡŒ Π½ΠΎΠ²Ρ‹ΠΉ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚, Π·Π°Π³Ρ€ΡƒΠΆΠ°Π΅ΠΌΡ‹ΠΉ Π² ΠΏΡ€ΠΈΠΊΠ»Π°Π΄Π½ΡƒΡŽ ΠΈΠ½Ρ‚Π΅Π»Π»Π΅ΠΊΡ‚ΡƒΠ°Π»ΡŒΠ½ΡƒΡŽ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½ΡƒΡŽ систСму (ПИИБ), Π½Π° соотвСтствиС хранящСйся Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ². Π’ ПИИБ, ΠΎΠ±Ρ€Π°Π±Π°Ρ‚Ρ‹Π²Π°ΡŽΡ‰ΠΈΡ… ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Π΅ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρ‹, Π½Π° ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄Ρ‹Π²Π°ΡŽΡ‚ΡΡ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠ΅ ограничСния: высокая Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ, Π²Ρ‹Ρ‡ΠΈΡΠ»ΠΈΡ‚Π΅Π»ΡŒΠ½Π°Ρ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ, Π²ΠΎΡΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ², Π° Ρ‚Π°ΠΊΠΆΠ΅ ΠΎΠ±ΡŠΡΡΠ½ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ. Π˜ΡΡΠ»Π΅Π΄ΡƒΡŽΡ‚ΡΡ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹, ΡƒΠ΄ΠΎΠ²Π»Π΅Ρ‚Π²ΠΎΡ€ΡΡŽΡ‰ΠΈΠ΅ этим условиям. Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ изучаСтся Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ ΠΎΡ†Π΅Π½ΠΊΠΈ тСкстовых Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ шкалС Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΏΡƒΡ‚Π΅ΠΌ внСдрСния Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΡŽ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°. ΠŸΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° стратСгия обнаруТСния Π² Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ‹ ΠΏΠΎ ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΡŽ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ, ΠΏΡ€Π΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡŽΡ‰Π°Ρ обоснованный ΠΏΠΎΠ΄Π±ΠΎΡ€ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². Показано, ΠΊΠ°ΠΊ Π½Π° Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ влияСт Π²Ρ‹Π±ΠΎΡ€ Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΎΠ² Ρ‚ΠΎΠΊΠ΅Π½ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² сниТСния размСрности ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ЭкспСримСнт ΠΏΡ€ΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡƒΡ… ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… коллСкциях Π½ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½ΠΎ-тСхничСских Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²: стандартов Π² ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² сфСрС ΠΆΠ΅Π»Π΅Π·Π½Ρ‹Ρ… Π΄ΠΎΡ€ΠΎΠ³. Использовались ΠΏΠΎΠ΄Ρ…ΠΎΠ΄Ρ‹: вычислСниС индСкса Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΊΠ°ΠΊ расстояния Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅Ρ€Π° ΠΌΠ΅ΠΆΠ΄Ρƒ распрСдСлСниями близости Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΊ Ρ†Π΅Π½Ρ‚Ρ€Ρƒ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΌΡƒ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρƒ; оптимизация Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² зависимости ΠΎΡ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈ сниТСния размСрности. Π’Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ пространство ΡΡ‚Ρ€ΠΎΠΈΠ»ΠΎΡΡŒ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ прСобразования TF-IDF ΠΈ тСматичСского модСлирования ARTM. Π’Π΅ΡΡ‚ΠΈΡ€ΠΎΠ²Π°Π»ΠΈΡΡŒ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΡ‹ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠΉ лСс), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡŒΠ½Ρ‹ΠΉ Ρ„Π°ΠΊΡ‚ΠΎΡ€ выброса), OneClass SVM (Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ²). ЭкспСримСнт ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€Π΄ΠΈΠ» ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½ΠΎΠΉ стратСгии для опрСдСлСния подходящСго ΠΌΠ΅Ρ‚ΠΎΠ΄Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ для Π·Π°Π΄Π°Π½Π½ΠΎΠΉ тСкстовой ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ. ΠŸΡ€ΠΈ поискС Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² Ρ€Π°ΠΌΠΊΠ°Ρ… тСматичСской кластСризации ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² эффСктивСн ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰Π΅Π³ΠΎ лСса. ΠŸΡ€ΠΈ Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ TF-IDF цСлСсообразно ΠΏΠΎΠ΄ΠΎΠ±Ρ€Π°Ρ‚ΡŒ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹Π΅ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ словаря ΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² с ΡΠΎΠΎΡ‚Π²Π΅Ρ‚ΡΡ‚Π²ΡƒΡŽΡ‰Π΅ΠΉ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠ΅ΠΉ прСобразования ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ пространства

    Bioinformatics applied to human genomics and proteomics: development of algorithms and methods for the discovery of molecular signatures derived from omic data and for the construction of co-expression and interaction networks

    Get PDF
    [EN] The present PhD dissertation develops and applies Bioinformatic methods and tools to address key current problems in the analysis of human omic data. This PhD has been organised by main objectives into four different chapters focused on: (i) development of an algorithm for the analysis of changes and heterogeneity in large-scale omic data; (ii) development of a method for non-parametric feature selection; (iii) integration and analysis of human protein-protein interaction networks and (iv) integration and analysis of human co-expression networks derived from tissue expression data and evolutionary profiles of proteins. In the first chapter, we developed and tested a new robust algorithm in R, called DECO, for the discovery of subgroups of features and samples within large-scale omic datasets, exploring all feature differences possible heterogeneity, through the integration of both data dispersion and predictor-response information in a new statistic parameter called h (heterogeneity score). In the second chapter, we present a simple non-parametric statistic to measure the cohesiveness of categorical variables along any quantitative variable, applicable to feature selection in all types of big data sets. In the third chapter, we describe an analysis of the human interactome integrating two global datasets from high-quality proteomics technologies: HuRI (a human protein-protein interaction network generated by a systematic experimental screening based on Yeast-Two-Hybrid technology) and Cell-Atlas (a comprehensive map of subcellular localization of human proteins generated by antibody imaging). This analysis aims to create a framework for the subcellular localization characterization supported by the human protein-protein interactome. In the fourth chapter, we developed a full integration of three high-quality proteome-wide resources (Human Protein Atlas, OMA and TimeTree) to generate a robust human co-expression network across tissues assigning each human protein along the evolutionary timeline. In this way, we investigate how old in evolution and how correlated are the different human proteins, and we place all them in a common interaction network. As main general comment, all the work presented in this PhD uses and develops a wide variety of bioinformatic and statistical tools for the analysis, integration and enlighten of molecular signatures and biological networks using human omic data. Most of this data corresponds to sample cohorts generated in recent biomedical studies on specific human diseases
    corecore