3,267 research outputs found
Stable Feature Selection for Biomarker Discovery
Feature selection techniques have been used as the workhorse in biomarker
discovery applications for a long time. Surprisingly, the stability of feature
selection with respect to sampling variations has long been under-considered.
It is only until recently that this issue has received more and more attention.
In this article, we review existing stable feature selection methods for
biomarker discovery using a generic hierarchal framework. We have two
objectives: (1) providing an overview on this new yet fast growing topic for a
convenient reference; (2) categorizing existing methods under an expandable
framework for future research and development
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Pathway and network analysis in proteomics
Proteomics is inherently a systems science that studies not only measured protein and their expressions in a cell, but also the interplay of proteins, protein complexes, signaling pathways, and network modules. There is a rapid accumulation of Proteomics data in recent years. However, Proteomics data are highly variable, with results sensitive to data preparation methods, sample condition, instrument types, and analytical methods. To address the challenge in Proteomics data analysis, we review current tools being developed to incorporate biological function and network topological information. We categorize these tools into four types: tools with basic functional information and little topological features (e.g., GO category analysis), tools with rich functional information and little topological features (e.g., GSEA), tools with basic functional information and rich topological features (e.g., Cytoscape), and tools with rich functional information and rich topological features (e.g., PathwayExpress). We first review the potential application of these tools to Proteomics; then we review tools that can achieve automated learning of pathway modules and features, and tools that help perform integrated network visual analytics
Computational diagnosis and risk evaluation for canine lymphoma
The canine lymphoma blood test detects the levels of two biomarkers, the
acute phase proteins (C-Reactive Protein and Haptoglobin). This test can be
used for diagnostics, for screening, and for remission monitoring as well. We
analyze clinical data, test various machine learning methods and select the
best approach to these problems. Three family of methods, decision trees, kNN
(including advanced and adaptive kNN) and probability density evaluation with
radial basis functions, are used for classification and risk estimation.
Several pre-processing approaches were implemented and compared. The best of
them are used to create the diagnostic system. For the differential diagnosis
the best solution gives the sensitivity and specificity of 83.5% and 77%,
respectively (using three input features, CRP, Haptoglobin and standard
clinical symptom). For the screening task, the decision tree method provides
the best result, with sensitivity and specificity of 81.4% and >99%,
respectively (using the same input features). If the clinical symptoms
(Lymphadenopathy) are considered as unknown then a decision tree with CRP and
Hapt only provides sensitivity 69% and specificity 83.5%. The lymphoma risk
evaluation problem is formulated and solved. The best models are selected as
the system for computational lymphoma diagnosis and evaluation the risk of
lymphoma as well. These methods are implemented into a special web-accessed
software and are applied to problem of monitoring dogs with lymphoma after
treatment. It detects recurrence of lymphoma up to two months prior to the
appearance of clinical signs. The risk map visualisation provides a friendly
tool for explanatory data analysis.Comment: 24 pages, 86 references in the bibliography, Significantly extended
version with review of lymphoma biomarkers and data mining methods (Three new
sections are added: 1.1. Biomarkers for canine lymphoma, 1.2. Acute phase
proteins as lymphoma biomarkers and 3.1. Data mining methods for biomarker
cancer diagnosis. Flowcharts of data analysis are included as supplementary
material (20 pages
Deciphering the interplay of molecular alterations underpinning renal cell carcinoma by label-free mass spectrometry and clinical proteomics: A systems medicine approach for precision diagnosis
Renal neoplasia is the 14th most common tumor type diagnosed worldwide. With a vast heterogeneity, renal neoplasia encompasses different subtypes. 90% of the neoplasms arise from the epithelial layer of the nephron and vary from benign renal masses (renal oncocytoma, RO) to more indolent or aggressive cancers (renal cell carcinomas, RCC). As RCC subtypes, clear cell (ccRCC) subtype is the most predominant subtype, followed by papillary (pRCC) and chromophobe (chRCC). Despite the different outcomes, some overlapped histological and morphological features difficult their differentiation and diagnosis. Therefore, new approaches for a clear and accurate diagnosis are still needed.
To achieve this goal, renal tissue biopsies diagnosed with ccRCC (n = 7), pRCC (n = 5), chRCC (n = 5), RO (n = 5) and normal adjacent tissue (NAT, n= 5) were enrolled in this study. As a very resourceful tool for proteome analysis and biomarker discovery, mass spectrometry (MS)-based methods were used to interrogate the proteome of each tumor in order to undisclosed differences trough which to develop faster and accurate diagnostics.
The results achieved with this doctoral thesis include i) the accomplishment of an effective ultrasonic workflow to recover the proteome of optimal cutting temperature (OCT)-embedded tissues, ii) a novel analytical approach based on MALDI-MS profiling to distinguish chRCC from RO, iii) a 109-protein panel to discriminate between chRCC and RO and NAT, iv) a top 24-protein panel to diagnose ccRCC, pRCC, chRCC and RO based on absolute concentration values, v) the translation and validation of three promising biomarkers by immunohistochemical analysis, and vi) an approach for phosphopeptide enrichment.
This work brings new insights into the different mechanisms underlying formation of these tumors as well as it provides valuable information to improve clinical diagnosis by opening new avenues for immunohistochemistry and mass spectrometry-based approaches
Metabolomics-based biomarker discovery for bee health monitoring : a proof of concept study concerning nutritional stress in Bombus terrestris
Bee pollinators are exposed to multiple natural and anthropogenic stressors. Understanding the effects of a single stressor in the complex environmental context of antagonistic/synergistic interactions is critical to pollinator monitoring and may serve as early warning system before a pollination crisis. This study aimed to methodically improve the diagnosis of bee stressors using a simultaneous untargeted and targeted metabolomics-based approach. Analysis of 84 Bombus terrestris hemolymph samples found 8 metabolites retained as potential biomarkers that showed excellent discrimination for nutritional stress. In parallel, 8 significantly altered metabolites, as revealed by targeted profiling, were also assigned as candidate biomarkers. Furthermore, machine learning algorithms were applied to the above-described two biomarker sets, whereby the untargeted eight components showed the best classification performance with sensitivity and specificity up to 99% and 100%, respectively. Based on pathway and biochemistry analysis, we propose that gluconeogenesis contributed significantly to blood sugar stability in bumblebees maintained on a low carbohydrate diet. Taken together, this study demonstrates that metabolomics-based biomarker discovery holds promising potential for improving bee health monitoring and to identify stressor related to energy intake and other environmental stressors
A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels
Background Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue. Results The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd=2.7%), 97.6% (sd=2.8%) and 90.8% (sd=5.5%) and average specificities of: 93.6% (sd=4.1%), 99% (sd=2.2%) and 79.4% (sd=9.8%) in 100 independent two-fold cross-validations. Conclusions We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a sub-mode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy
Development of computational tools for the analysis of 2D-nuclear magnetic resonance data
Dissertação de mestrado em BioinformaticsMetabolomics is one of the omics’ sciences that has been gaining a lot of interest
due to its potential on correlating an organism’s biochemical activity and its phenotype.
The applications of metabolomics are being extended as new techniques reveal new
information on metabolic profiles and molecules, thus elucidating biological, chemical
and functional knowledge. The main techniques that collect data are based on mass
spectrometry and nuclear magnetic resonance (NMR) spectroscopy. The last one has
the advantage of analyzing a sample in vivo without damaging it and while its sensitivity
is pointed out as a disadvantage, multidimensional NMR delivers a solution to this issue.
It adds layers of information, generating new data that requires advanced bioinformatics
methods in order to extract biological meaning.
Since multidimensional NMR has different approaches within itself, the need to estab lish an integrated framework that allows a researcher to load its data and extract relevant
knowledge has become more imperative over the years. Also, establishing common data
analysis pipelines on one-dimensional and multidimensional NMR remains a challenge
in current scientific research hindering reproducibility across research groups.
In recent work from the host group, specmine, an R package for metabolomics and
spectral data analysis/mining, has been developed to wrap and deliver key metabolomic
methods that allow a researcher to perform a complete analysis.
In this dissertation, tools integrated in specmine were developed to read, visualize
and analyze two-dimensional (2D) NMR. A new specmine structure was created for
this type of data, easing interpretation and data visualization. In terms of visualization
a novel approach towards three-dimensional environments enables users to interact
with their data allowing peak hovering or identification of rich resonance regions. The
selection of which samples to plot, when the user does not specify an input, is based
on a signal-to-noise ratio scale which plots samples with opposite signal-to-noise ratios.
A method to perform peak detection on 2D NMR based on local maximum search was
implemented to obtain a data structure that best benefits from specmine’s functionalities.
These include preprocessing, univariate and multivariate analysis as well as machine
learning and feature selection methods.
The 2D NMR functions were validated using experimental data from two scientific
papers, available on metabolomic databases and applying the necessary preprocessing
steps to compare spectra and results. These data originated two case studies from
different NMR sources, Bruker and Varian, which reinforces specmine’s flexibility. The
case studies were carried out using mainly specmine and other packages for specific
processing steps, such as, probabilistic quotient normalization. A pipeline to analyze 2D
NMR was added to specmine, in a form of a vignette, to provide a guideline for the newly
developed functionalities.A metabolómica é uma das ciências ómicas que tem vindo a ganhar muito interesse devido ao seu potencial para correlacionar a atividade bioquímica de um organismo com o seu fenótipo. As aplicações da metabolómica estão em constante crescimento à medida que novas técnicas revelam nova informação sobre perfis metabólicos e moleculares, elucidando conhecimento biológico, químico e funcional. As principais técnicas para recolher este tipo de dados são baseadas em espectrometria de massa e em ressonância magnética nuclear (RMN). Esta última tem a vantagem de analisar uma amostra in vivo sem a danificar e enquanto a sensibilidade da mesma tem sido apontada como uma desvantagem, surge a abordagem de RMN multidimensional melhorando a versão tradicional. Através da medição de outros núcleos adiciona camadas de informação, gerando um novo tipo de dados que requere métodos bioinformáticos avançados para se extrair significado biológico. A existência de várias abordagens para realizar RMN multidimensional leva à crescente necessidade da existência de uma ferramenta que integre este tipo de dados, de forma a permitir ao investigador executar a sua análise de forma eficaz. Adicionalmente, a consolidação de pipelines comuns para analisar dados de RMN uni- e multidimensional permanece um desafio a investigação científica, dificultando a reprodutibilidade de resultados por diferentes grupos de investigação. Em trabalhos recentes do grupo de acolhimento foi desenvolvido um package para o programa R focado na metabolómica e na análise/mineração de dados. Este package, specmine, tem sido melhorado desde o seu desenvolvimento funcionando como uma ferramenta que engloba diferentes métodos permitindo uma análise total a um determinado conjunto de dados. Baseado neste package, mais recentemente foi desenvolvida uma plataforma web integrada, WebSpecmine, com o mesmo propósito que providencia ao utilizador uma interface de utilizador mais fácil e amigável. Nesta dissertação, ferramentas que permitem a leitura, visualização e análise de NMR bidimensional (2D) foram desenvolvidas tendo em conta a sua integração no specmine. Uma nova estrutura foi adicionada ao package, facilitando a interpretação e esquemetazição dos dados. Quanto a visualização, uma abordagem inovadora para ambientes tridimensionais permite ao utilizador interagir com os seus dados através da identificação de regiões espectrais de interesse ou reconhecimento de picos. A visualização de espectros 2D, sem especificação por parte do utilizador, tem por base uma escala de relação sinal/ruído que permite numa primeira instância visualizar as amostras com uma maior e menor diferença entre sinal e ruído. Foi também implementado um método para realizar a deteção de picos em RMN 2D baseado na procura por valores máximos locais. Esta operação tem por objectivo obter uma estrutura de dados simplificada que melhor beneficia das funcionalidades do specmine. Estas incluem operações de pré-processamento, análises uni- e multivariada, métodos de seleção de variáveis e aprendizagem máquina. As funções desenvolvidas para RMN 2D foram validadas com dados experimentais recolhidos de dois artigos científicos, disponíveis em bases de dados de metabolómica e sobre os quais foram aplicados os passos de pré-processamento que permitissem a comparação de resultados. Estes dados originaram dois casos de estudos que abordavam diferentes instrumentos utilizados em RMN, Bruker e Varian, reforçando desta forma a flexibilidade do specmine relativamente as tipologias de dados capazes de serem lidas. Estes casos foram realizados utilizando principalmente o specmine, no entanto, a utilização de packages externos foi necessária para passos de processamento específicos, como por exemplo, a normalização por quociente probabilístico. Uma pipeline para analise de dados RMN 2D foi adicionada ao specmine, sob a forma de vignette, um formato de documentação longa adequado a packages implementados no programa R. Desta forma e proporcionado ao utilizador um conjunto de procedimentos, orientados a utilização correta das funcionalidades implementadas
- …