1,114 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Integrative Transcriptomic Analysis of Long Intergenic Non-Coding RNAs in Cancer.
Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017
Analysis of High-dimensional and Left-censored Data with Applications in Lipidomics and Genomics
Recently, there has been an occurrence of new kinds of high- throughput measurement techniques enabling biological research to focus on fundamental building blocks of living organisms such as genes, proteins, and lipids. In sync with the new type of data that is referred to as the omics data, modern data analysis techniques have emerged. Much of such research is focusing on finding biomarkers for detection of abnormalities in the health status of a person as well as on learning unobservable network structures representing functional associations of biological regulatory systems. The omics data have certain specific qualities such as left-censored observations due to the limitations of the measurement instruments, missing data, non-normal observations and very large dimensionality, and the interest often lies in the connections between the large number of variables.
There are two major aims in this thesis. First is to provide efficient methodology for dealing with various types of missing or censored omics data that can be used for visualisation and biomarker discovery based on, for example, regularised regression techniques. Maximum likelihood based covariance estimation method for data with censored values is developed and the algorithms are described in detail. Second major aim is to develop novel approaches for detecting interactions displaying functional associations from large-scale observations. For more complicated data connections, a technique based on partial least squares regression is investigated. The technique is applied for network construction as well as for differential network analyses both on multiple imputed censored data and next- generation sequencing count data.Uudet mittausteknologiat ovat mahdollistaneet kokonaisvaltaisen ymmärryksen lisäämisen elollisten organismien molekyylitason prosesseista. Niin kutsutut omiikka-teknologiat, kuten genomiikka, proteomiikka ja lipidomiikka, kykenevät tuottamaan valtavia määriä mittausdataa yksittäisten geenien, proteiinien ja lipidien ekspressio- tai konsentraatiotasoista ennennäkemättömällä tarkkuudella. Samanaikaisesti tarve uusien analyysimenetelmien kehittämiselle on kasvanut. Kiinnostuksen kohteena ovat olleet erityisesti tiettyjen sairauksien riskiä tai prognoosia ennustavien merkkiaineiden tunnistaminen sekä biologisten verkkojen rekonstruointi.
Omiikka-aineistoilla on useita erityisominaisuuksia, jotka rajoittavat tavanomaisten menetelmien suoraa ja tehokasta soveltamista. Näistä tärkeimpiä ovat vasemmalta sensuroidut ja puuttuvat havainnot, sekä havaittujen muuttujien suuri lukumäärä. Tämän väitöskirjan ensimmäisenä tavoitteena on tarjota räätälöityjä analyysimenetelmiä epätäydellisten omiikka-aineistojen visualisointiin ja mallin valintaan käyttäen esimerkiksi regularisoituja regressiomalleja. Kuvailemme myös sensuroidulle aineistolle sopivan suurimman uskottavuuden estimaattorin kovarianssimatriisille. Toisena tavoitteena on kehittää uusia menetelmiä omiikka-aineistojen assosiaatiorakenteiden tarkasteluun. Monimutkaisempien rakenteiden tarkasteluun, visualisoimiseen ja vertailuun esitetään erilaisia variaatioita osittaisen pienimmän neliösumman menetelmään pohjautuvasta algoritmista, jonka avulla voidaan rekonstruoida assosiaatioverkkoja sekä multi-imputoidulle sensuroidulle että lukumääräaineistoille.Siirretty Doriast
Main findings and advances in bioinformatics and biomedical engineeringIWBBIO 2018
We want to thank the great work done by the reviewers of each of the papers, together with the great interest shown by
the editorial of BMC Bioinformatics in IWBBIO Conference. Special thanks to D. Omar El Bakry for his interest and great
help to make this Special Issue. Thank the Ministry of Spain for the economic resources within the project with reference
RTI2018-101674-B-I00.In the current supplement, we are proud to present seventeen relevant contributions
from the 6th International Work-Conference on Bioinformatics and Biomedical
Engineering (IWBBIO 2018), which was held during April 25-27, 2018 in Granada (Spain).
These contributions have been chosen because of their quality and the importance of
their findings.This research has been partially supported by the proyects with reference RTI2018-101674-B-I00 (Ministry of Spain) and
B-TIC-414-UGR18 (FEDER, Junta Andalucia and UGR)
Multivariate models from RNA-Seq SNVs yield candidate molecular targets for biomarker discovery: SNV-DA
Breast: up and downstream SNVs model. (CSV 22.1 kb
Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling
Background: Nowadays, many public repositories containing large microarray gene expression datasets are
available. However, the problem lies in the fact that microarray technology are less powerful and accurate than more
recent Next Generation Sequencing technologies, such as RNA-Seq. In any case, information from microarrays is
truthful and robust, thus it can be exploited through the integration of microarray data with RNA-Seq data.
Additionally, information extraction and acquisition of large number of samples in RNA-Seq still entails very high costs
in terms of time and computational resources.This paper proposes a new model to find the gene signature of breast
cancer cell lines through the integration of heterogeneous data from different breast cancer datasets, obtained from
microarray and RNA-Seq technologies. Consequently, data integration is expected to provide a more robust statistical
significance to the results obtained. Finally, a classification method is proposed in order to test the robustness of the
Differentially Expressed Genes when unseen data is presented for diagnosis.
Results: The proposed data integration allows analyzing gene expression samples coming from different
technologies. The most significant genes of the whole integrated data were obtained through the intersection of the
three gene sets, corresponding to the identified expressed genes within the microarray data itself, within the RNA-Seq
data itself, and within the integrated data from both technologies. This intersection reveals 98 possible
technology-independent biomarkers. Two different heterogeneous datasets were distinguished for the classification
tasks: a training dataset for gene expression identification and classifier validation, and a test dataset with unseen data
for testing the classifier. Both of them achieved great classification accuracies, therefore confirming the validity of the
obtained set of genes as possible biomarkers for breast cancer. Through a feature selection process, a final small
subset made up by six genes was considered for breast cancer diagnosis.
Conclusions: This work proposes a novel data integration stage in the traditional gene expression analysis pipeline
through the combination of heterogeneous data from microarrays and RNA-Seq technologies. Available samples
have been successfully classified using a subset of six genes obtained by a feature selection method. Consequently, a
new classification and diagnosis tool was built and its performance was validated using previously unseen samples.This work was supported by Project TIN2015-71873-R (Spanish Ministry of
Economy and Competitiveness -MINECO- and the European Regional
Development Fund -ERDF)
- …