50 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
ΠΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΡΠΉ ΠΏΠΎΠ΄Ρ ΠΎΠ΄ ΠΊ Π²ΡΠ±ΠΎΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity. The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated. The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms. The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine). The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.Π Π°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π·Π°Π΄Π°ΡΠ° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² Π² ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
. Π‘ΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΠ΅ ΠΌΠ΅ΡΠΎΠ΄Ρ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡΠ½ΠΈΠ²Π΅ΡΡΠ°Π»ΡΠ½Ρ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°ΡΡ ΡΡΠ°Π±ΠΈΠ»ΡΠ½ΡΠΉ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ Π½Π° ΡΠ°Π·Π½ΡΡ
Π½Π°Π±ΠΎΡΠ°Ρ
Π΄Π°Π½Π½ΡΡ
. Π’ΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² Π·Π°Π²ΠΈΡΠΈΡ ΠΎΡ Π²ΡΠ±ΠΎΡΠ° ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· ΡΠ°Π³ΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠ°, ΠΈ Π΄Π»Ρ ΡΠ°Π·Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΉ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ Π½Π°Π±ΠΎΡΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠ΅ Π²ΡΠ΅ ΠΈΠ· ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ ΡΠ°Π±ΠΎΡΠ°ΡΡ Ρ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΠΌΠΈ Π΄Π°Π½Π½ΡΠΌΠΈ, Π²Π΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΊΠΎΡΠΎΡΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΠ·ΡΠ΅ΡΡΡ Π±ΠΎΠ»ΡΡΠΎΠΉ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΡΡ ΠΏΡΠΈ ΡΠΈΠ»ΡΠ½ΠΎΠΉ ΡΠ°Π·ΡΠ΅ΠΆΠ΅Π½Π½ΠΎΡΡΠΈ. ΠΠ°Π΄Π°ΡΠ° ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π² ΡΠ»Π΅Π΄ΡΡΡΠ΅ΠΉ ΠΏΠΎΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅: ΡΡΠ΅Π±ΡΠ΅ΡΡΡ ΠΏΡΠΎΠ²Π΅ΡΠΈΡΡ Π½ΠΎΠ²ΡΠΉ Π΄ΠΎΠΊΡΠΌΠ΅Π½Ρ, Π·Π°Π³ΡΡΠΆΠ°Π΅ΠΌΡΠΉ Π² ΠΏΡΠΈΠΊΠ»Π°Π΄Π½ΡΡ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ ΡΠΈΡΡΠ΅ΠΌΡ (ΠΠΠΠ‘), Π½Π° ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΠΈΠ΅ Ρ
ΡΠ°Π½ΡΡΠ΅ΠΉΡΡ Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ². Π ΠΠΠΠ‘, ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°ΡΡΠΈΡ
ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΠ΅ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ, Π½Π° ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄ΡΠ²Π°ΡΡΡΡ ΡΠ»Π΅Π΄ΡΡΡΠΈΠ΅ ΠΎΠ³ΡΠ°Π½ΠΈΡΠ΅Π½ΠΈΡ: Π²ΡΡΠΎΠΊΠ°Ρ ΡΠΎΡΠ½ΠΎΡΡΡ, Π²ΡΡΠΈΡΠ»ΠΈΡΠ΅Π»ΡΠ½Π°Ρ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ, Π²ΠΎΡΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ², Π° ΡΠ°ΠΊΠΆΠ΅ ΠΎΠ±ΡΡΡΠ½ΠΈΠΌΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ. ΠΡΡΠ»Π΅Π΄ΡΡΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Ρ, ΡΠ΄ΠΎΠ²Π»Π΅ΡΠ²ΠΎΡΡΡΡΠΈΠ΅ ΡΡΠΈΠΌ ΡΡΠ»ΠΎΠ²ΠΈΡΠΌ. Π ΡΠ°Π±ΠΎΡΠ΅ ΠΈΠ·ΡΡΠ°Π΅ΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ ΠΎΡΠ΅Π½ΠΊΠΈ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ ΡΠΊΠ°Π»Π΅ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΏΡΡΠ΅ΠΌ Π²Π½Π΅Π΄ΡΠ΅Π½ΠΈΡ Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ°. ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° ΡΡΡΠ°ΡΠ΅Π³ΠΈΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π² Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ ΠΏΠΎ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ, ΠΏΡΠ΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡΡΠ°Ρ ΠΎΠ±ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΠΉ ΠΏΠΎΠ΄Π±ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠΎΠΊΠ°Π·Π°Π½ΠΎ, ΠΊΠ°ΠΊ Π½Π° ΡΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π²Π»ΠΈΡΠ΅Ρ Π²ΡΠ±ΠΎΡ Π²Π°ΡΠΈΠ°Π½ΡΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ, ΠΏΡΠΈΠ½ΡΠΈΠΏΠΎΠ² ΡΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΠΈ, ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡΡ
ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
Π½ΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΠΎ-ΡΠ΅Ρ
Π½ΠΈΡΠ΅ΡΠΊΠΈΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ²: ΡΡΠ°Π½Π΄Π°ΡΡΠΎΠ² Π² ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² ΡΡΠ΅ΡΠ΅ ΠΆΠ΅Π»Π΅Π·Π½ΡΡ
Π΄ΠΎΡΠΎΠ³. ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π»ΠΈΡΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄Ρ: Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠ΅ ΠΈΠ½Π΄Π΅ΠΊΡΠ° Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΊΠ°ΠΊ ΡΠ°ΡΡΡΠΎΡΠ½ΠΈΡ Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅ΡΠ° ΠΌΠ΅ΠΆΠ΄Ρ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡΠΌΠΈ Π±Π»ΠΈΠ·ΠΎΡΡΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΊ ΡΠ΅Π½ΡΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠΌΡ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ; ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ ΠΈ ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ. ΠΠ΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²ΠΎ ΡΡΡΠΎΠΈΠ»ΠΎΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ TF-IDF ΠΈ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ARTM. Π’Π΅ΡΡΠΈΡΠΎΠ²Π°Π»ΠΈΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠΈΠΉ Π»Π΅Ρ), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡΠ½ΡΠΉ ΡΠ°ΠΊΡΠΎΡ Π²ΡΠ±ΡΠΎΡΠ°), OneClass SVM (Π²Π°ΡΠΈΠ°Π½Ρ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ²). ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΠΎΠ΄ΡΠ²Π΅ΡΠ΄ΠΈΠ» ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΠΎΠΉ ΡΡΡΠ°ΡΠ΅Π³ΠΈΠΈ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΡΡΠ΅Π³ΠΎ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π΄Π»Ρ Π·Π°Π΄Π°Π½Π½ΠΎΠΉ ΡΠ΅ΠΊΡΡΠΎΠ²ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ. ΠΡΠΈ ΠΏΠΎΠΈΡΠΊΠ΅ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² ΡΠ°ΠΌΠΊΠ°Ρ
ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΊΠ»Π°ΡΡΠ΅ΡΠΈΠ·Π°ΡΠΈΠΈ ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΡΡΡΠ΅ΠΊΡΠΈΠ²Π΅Π½ ΠΌΠ΅ΡΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠ΅Π³ΠΎ Π»Π΅ΡΠ°. ΠΡΠΈ Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ TF-IDF ΡΠ΅Π»Π΅ΡΠΎΠΎΠ±ΡΠ°Π·Π½ΠΎ ΠΏΠΎΠ΄ΠΎΠ±ΡΠ°ΡΡ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΡΠ΅ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΡ ΡΠ»ΠΎΠ²Π°ΡΡ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ ΠΌΠ΅ΡΠΎΠ΄ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² Ρ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΡΡΠ΅ΠΉ ΡΡΠ½ΠΊΡΠΈΠ΅ΠΉ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²Π°
Bioinformatics applied to human genomics and proteomics: development of algorithms and methods for the discovery of molecular signatures derived from omic data and for the construction of co-expression and interaction networks
[EN] The present PhD dissertation develops and applies Bioinformatic methods and tools to address key current problems in the analysis of human omic data. This PhD has been organised by main objectives into four different chapters focused on: (i) development of an algorithm for the analysis of changes and heterogeneity in large-scale omic data; (ii) development of a method for non-parametric feature selection; (iii) integration and analysis of human protein-protein interaction networks and (iv) integration and analysis of human co-expression networks derived from tissue expression data and evolutionary profiles of proteins.
In the first chapter, we developed and tested a new robust algorithm in R, called DECO, for the discovery of subgroups of features and samples within large-scale omic datasets, exploring all feature differences possible heterogeneity, through the integration of both data dispersion and predictor-response information in a new statistic parameter called h (heterogeneity score). In the second chapter, we present a simple non-parametric statistic to measure the cohesiveness of categorical variables along any quantitative variable, applicable to feature selection in all types of big data sets. In the third chapter, we describe an analysis of the human interactome integrating two global datasets from high-quality proteomics technologies: HuRI (a human protein-protein interaction network generated by a systematic experimental screening based on Yeast-Two-Hybrid technology) and Cell-Atlas (a comprehensive map of subcellular localization of human proteins generated by antibody imaging). This analysis aims to create a framework for the subcellular localization characterization supported by the human protein-protein interactome. In the fourth chapter, we developed a full integration of three high-quality proteome-wide resources (Human Protein Atlas, OMA and TimeTree) to generate a robust human co-expression network across tissues assigning each human protein along the evolutionary timeline. In this way, we investigate how old in evolution and how correlated are the different human proteins, and we place all them in a common interaction network.
As main general comment, all the work presented in this PhD uses and develops a wide variety of bioinformatic and statistical tools for the analysis, integration and enlighten of molecular signatures and biological networks using human omic data. Most of this data corresponds to sample cohorts generated in recent biomedical studies on specific human diseases