864 research outputs found

    Improving the value of public RNA-seq expression data by phenotype prediction.

    Get PDF
    Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible

    Construction of a cancer-perturbed protein-protein interaction network for discovery of apoptosis drug targets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cancer is caused by genetic abnormalities, such as mutations of oncogenes or tumor suppressor genes, which alter downstream signal transduction pathways and protein-protein interactions. Comparisons of the interactions of proteins in cancerous and normal cells can shed light on the mechanisms of carcinogenesis.</p> <p>Results</p> <p>We constructed initial networks of protein-protein interactions involved in the apoptosis of cancerous and normal cells by use of two human yeast two-hybrid data sets and four online databases. Next, we applied a nonlinear stochastic model, maximum likelihood parameter estimation, and Akaike Information Criteria (AIC) to eliminate false-positive protein-protein interactions in our initial protein interaction networks by use of microarray data. Comparisons of the networks of apoptosis in HeLa (human cervical carcinoma) cells and in normal primary lung fibroblasts provided insight into the mechanism of apoptosis and allowed identification of potential drug targets. The potential targets include BCL2, caspase-3 and TP53. Our comparison of cancerous and normal cells also allowed derivation of several party hubs and date hubs in the human protein-protein interaction networks involved in caspase activation.</p> <p>Conclusion</p> <p>Our method allows identification of cancer-perturbed protein-protein interactions involved in apoptosis and identification of potential molecular targets for development of anti-cancer drugs.</p

    Evidence-Based Detection of Pancreatic Canc

    Get PDF
    This study is an effort to develop a tool for early detection of pancreatic cancer using evidential reasoning. An evidential reasoning model predicts the likelihood of an individual developing pancreatic cancer by processing the outputs of a Support Vector Classifier, and other input factors such as smoking history, drinking history, sequencing reads, biopsy location, family and personal health history. Certain features of the genomic data along with the mutated gene sequence of pancreatic cancer patients was obtained from the National Cancer Institute (NIH) Genomic Data Commons (GDC). This data was used to train the SVC. A prediction accuracy of ~85% with a ROC AUC of 83.4% was achieved. Synthetic data was assembled in different combinations to evaluate the working of evidential reasoning model. Using this, variations in the belief interval of developing pancreatic cancer are observed. When the model is provided with an input of high smoking history and family history of cancer, an increase in the evidential reasoning interval in belief of pancreatic cancer and support in the machine learning model prediction is observed. Likewise, decrease in the quantity of genetic material and an irregularity in the cellular structure near the pancreas increases support in the machine learning classifier’s prediction of having pancreatic cancer. This evidence-based approach is an attempt to diagnose the pancreatic cancer at a premalignant stage. Future work includes using the real sequencing reads as well as accurate habits and real medical and family history of individuals to increase the efficiency of the evidential reasoning model. Next steps also involve trying out different machine learning models to observe their performance on the dataset considered in this study

    NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS

    Get PDF
    Technological advances in next-generation sequencing and biomedical imaging have led to a rapid increase in biomedical data dimension and acquisition rate, which is challenging the conventional data analysis strategies. Modern machine learning techniques promise to leverage large data sets for finding hidden patterns within them, and for making accurate predictions. This dissertation aims to design novel machine learning-based models to transform biomedical big data into valuable biological insights. The research presented in this dissertation focuses on three bioinformatics domains: splice junction classification, gene regulatory network reconstruction, and lesion detection in mammograms. A critical step in defining gene structures and mRNA transcript variants is to accurately identify splice junctions. In the first work, we built the first deep learning-based splice junction classifier, DeepSplice. It outperforms the state-of-the-art classification tools in terms of both classification accuracy and computational efficiency. To uncover transcription factors governing metabolic reprogramming in non-small-cell lung cancer patients, we developed TFmeta, a machine learning approach to reconstruct relationships between transcription factors and their target genes in the second work. Our approach achieves the best performance on benchmark data sets. In the third work, we designed deep learning-based architectures to perform lesion detection in both 2D and 3D whole mammogram images

    Toward Early Detection Of Pancreatic Cancer: An Evidence-Based Approach

    Get PDF
    This study observes how an evidential reasoning approach can be used as a diagnostic tool for early detection of pancreatic cancer. The evidential reasoning model combines the output of a linear Support Vector Classifier (SVC) with factors such as smoking history, health history, biopsy location, NGS technology used, and more to predict the likelihood of the disease. The SVC was trained using genomic data of pancreatic cancer patients derived from the National Cancer Institute (NIH) Genomic Data Commons (GDC). To test the evidential reasoning model, a variety of synthetic data was compiled to test the impact of combinations of different factors. Through experimentation, we monitored how the evidential interval for pancreatic cancer fluctuated based on the inputs that were provided. We observed how the pancreatic cancer evidential interval increased and the machine learning prediction of pancreatic cancer was supported when the input changed from a non-smoker and non-drinker to an individual with a highly active smoking and drinking history. Similarly, we observed how the evidential interval for pancreatic cancer increased significantly when the machine learning prediction for pancreatic cancer was maintained as high and the input of the quality of the sequencing read was changed from a high quantity of cytosine guanine content and homopolymer regions to a moderate quantity of cytosine guanine content and low homopolymer regions; indicating that there was initially a higher likelihood of error in the sequencing reads, resulting in a more inaccurate machine learning output. This experiment shows that an evidence-based approach has the potential to contribute as a diagnostic tool for screening for high-risk groups. Future work should focus on improving the machine learning model by using a larger pancreatic cancer genomic database. Next steps will involve programmatically analyzing real sequencing reads for irregular guanine cytosine content and high homopolymer regions

    Better prognostic markers for nonmuscle invasive papillary urothelial carcinomas

    Get PDF
    Bladder cancer is a common type of cancer, especially among men in developed countries. Most cancers in the urinary bladder are papillary urothelial carcinomas. They are characterized by a high recurrence frequency (up to 70 %) after local resection. It is crucial for prognosis to discover these recurrent tumours at an early stage, especially before they become muscle-invasive. Reliable prognostic biomarkers for tumour recurrence and stage progression are lacking. This is why patients diagnosed with a non-muscle invasive bladder cancer follow extensive follow-up regimens with possible serious side effects and with high costs for the healthcare systems. WHO grade and tumour stage are two central biomarkers currently having great impact on both treatment decisions and follow-up regimens. However, there are concerns regarding the reproducibility of WHO grading, and stage classification is challenging in small and fragmented tumour material. In Paper I, we examined the reproducibility and the prognostic value of all the individual microscopic features making up the WHO grading system. Among thirteen extracted features there was considerable variation in both reproducibility and prognostic value. The only feature being both reasonably reproducible and statistically significant prognostic was cell polarity. We concluded that further validation studies are needed on these features, and that future grading systems should be based on well-defined features with true prognostic value. With the implementation of immunotherapy, there is increasing interest in tumour immune response and the tumour microenvironment. In a search for better prognostic biomarkers for tumour recurrence and stage progression, in Paper II, we investigated the prognostic value of tumour infiltrating immune cells (CD4, CD8, CD25 and CD138) and previously investigated cell proliferation markers (Ki-67, PPH3 and MAI). Low Ki 67 and tumour multifocality were associated with increased recurrence risk. Recurrence risk was not affected by the composition of immune cells. For stage progression, the only prognostic immune cell marker was CD25. High values for MAI was also strongly associated with stage progression. However, in a multivariate analysis, the most prognostic feature was a combination of MAI and CD25. BCG-instillations in the bladder are indicated in intermediate and high-risk non-muscle invasive bladder cancer patients. This old-fashion immunotherapy has proved to reduce both recurrence- and progression-risk, although it is frequently followed by unpleasant side-effects. As many as 30-50% of high-risk patients receiving BCG instillations, fail by develop high-grade recurrences. They do not only suffer from unnecessary side-effects, but will also have a delay in further treatment. Together with colleagues at three different Dutch hospitals, in Paper III, we looked at the prognostic and predictive value of T1-substaging. A T1-tumour invades the lamina propria, and we wanted to separate those with micro- from those with extensive invasion. We found that BCG-failure was more common among patients with extensive invasion. Furthermore, T1-substaging was associated with both high-grade recurrence-free and progression-free survival. Finally, in Paper IV, we wanted to investigate the prognostic value of two classical immunohistochemical markers, p53 and CK20, and compare them with previously investigated proliferation markers. p53 is a surrogate marker for mutations in the gene TP53, considered to be a main characteristic for muscle-invasive tumours. CK20 is a surrogate marker for luminal tumours in the molecular classification of bladder cancer, and is frequently used to distinguish reactive urothelial changes from urothelial carcinoma in situ. We found both positivity for p53 and CK20 to be significantly associated with stage progression, although not performing better than WHO grade and stage. The proliferation marker MAI, had the highest prognostic value in our study. Any combination of variables did not perform better in a multivariate analysis than MAI alone

    Deep Functional Mapping For Predicting Cancer Outcome

    Get PDF
    The effective understanding of the biological behavior and prognosis of cancer subtypes is becoming very important in-patient administration. Cancer is a diverse disorder in which a significant medical progression and diagnosis for each subtype can be observed and characterized. Computer-aided diagnosis for early detection and diagnosis of many kinds of diseases has evolved in the last decade. In this research, we address challenges associated with multi-organ disease diagnosis and recommend numerous models for enhanced analysis. We concentrate on evaluating the Magnetic Resonance Imaging (MRI), Computed Tomography (CT), and Positron Emission Tomography (PET) for brain, lung, and breast scans to detect, segment, and classify types of cancer from biomedical images. Moreover, histopathological, and genomic classification of cancer prognosis has been considered for multi-organ disease diagnosis and biomarker recommendation. We considered multi-modal, multi-class classification during this study. We are proposing implementing deep learning techniques based on Convolutional Neural Network and Generative Adversarial Network. In our proposed research we plan to demonstrate ways to increase the performance of the disease diagnosis by focusing on a combined diagnosis of histology, image processing, and genomics. It has been observed that the combination of medical imaging and gene expression can effectively handle the cancer detection situation with a higher diagnostic rate rather than considering the individual disease diagnosis. This research puts forward a blockchain-based system that facilitates interpretations and enhancements pertaining to automated biomedical systems. In this scheme, a secured sharing of the biomedical images and gene expression has been established. To maintain the secured sharing of the biomedical contents in a distributed system or among the hospitals, a blockchain-based algorithm is considered that generates a secure sequence to identity a hash key. This adaptive feature enables the algorithm to use multiple data types and combines various biomedical images and text records. All data related to patients, including identity, pathological records are encrypted using private key cryptography based on blockchain architecture to maintain data privacy and secure sharing of the biomedical contents

    Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges

    Get PDF
    Background: Systems biology has embraced computational modeling in response to the quantitative nature and increasing scale of contemporary data sets. The onslaught of data is accelerating as molecular profiling technology evolves. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) is a community effort to catalyze discussion about the design, application, and assessment of systems biology models through annual reverse-engineering challenges. Methodology and Principal Findings: We describe our assessments of the four challenges associated with the third DREAM conference which came to be known as the DREAM3 challenges: signaling cascade identification, signaling response prediction, gene expression prediction, and the DREAM3 in silico network challenge. The challenges, based on anonymized data sets, tested participants in network inference and prediction of measurements. Forty teams submitted 413 predicted networks and measurement test sets. Overall, a handful of best-performer teams were identified, while a majority of teams made predictions that were equivalent to random. Counterintuitively, combining the predictions of multiple teams (including the weaker teams) can in some cases improve predictive power beyond that of any single method. Conclusions: DREAM provides valuable feedback to practitioners of systems biology modeling. Lessons learned from the predictions of the community provide much-needed context for interpreting claims of efficacy of algorithms described in the scientific literature

    De novo sequencing of circulating miRNAs identifies novel markers predicting clinical outcome of locally advanced breast cancer

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>MicroRNAs (miRNAs) have been recently detected in the circulation of cancer patients, where they are associated with clinical parameters. Discovery profiling of circulating small RNAs has not been reported in breast cancer (BC), and was carried out in this study to identify blood-based small RNA markers of BC clinical outcome.</p> <p>Methods</p> <p>The pre-treatment sera of 42 stage II-III locally advanced and inflammatory BC patients who received neoadjuvant chemotherapy (NCT) followed by surgical tumor resection were analyzed for marker identification by deep sequencing all circulating small RNAs. An independent validation cohort of 26 stage II-III BC patients was used to assess the power of identified miRNA markers.</p> <p>Results</p> <p>More than 800 miRNA species were detected in the circulation, and observed patterns showed association with histopathological profiles of BC. Groups of circulating miRNAs differentially associated with ER/PR/HER2 status and inflammatory BC were identified. The relative levels of selected miRNAs measured by PCR showed consistency with their abundance determined by deep sequencing. Two circulating miRNAs, miR-375 and miR-122, exhibited strong correlations with clinical outcomes, including NCT response and relapse with metastatic disease. In the validation cohort, higher levels of circulating miR-122 specifically predicted metastatic recurrence in stage II-III BC patients.</p> <p>Conclusions</p> <p>Our study indicates that certain miRNAs can serve as potential blood-based biomarkers for NCT response, and that miR-122 prevalence in the circulation predicts BC metastasis in early-stage patients. These results may allow optimized chemotherapy treatments and preventive anti-metastasis interventions in future clinical applications.</p

    Statistical Modeling for Cellular Heterogeneity Problems in Cancer Research: Deconvolution, Gaussian Graphical Models and Logistic Regression

    Get PDF
    Tumor tissue samples comprise a mixture of cancerous and surrounding normal cells. Investigating cellular heterogeneity in tumors is crucial to genomic analyses associated with cancer prognosis and treatment decisions, where the contamination of non-cancerous cells may substantially affect gene expression profiling in clinically derived malignant tumor samples. For this purpose, we first computationally purify tumor profiles, and then develop new statistical modeling techniques to incorporate tumor purity estimates for genetic correlation and prediction of clinical outcome in cancer research. In this thesis, we propose novel approaches to analyzing and modeling cellular heterogeneity problems using genomic data from three perspectives. First, we develop a computation tool, DeMixT, which applies a deconvolution algorithm to explicitly account for at most three cellular components associated with cancer. Compared with the experimental approach to isolate single cells, in silico dissection of tumor samples is faster and cheaper, but computational tools previously developed have limited ability to estimate cellular proportions and tumor-specific expression profiles, when neither is given with prior information. Our model al- lows inclusion of the infiltrating immune cells as a component as well as the tumor cells and stromal cells. We assume a linear mixture of gene expression profiles for each component satisfying a log2-normal distribution and propose an iterated conditional modes algorithm to estimate parameters. We also involve a novel two-stage estimation procedure for the three-component deconvolution. Our method is computationally feasible and yields accurate estimates through simulations and real data analyses. The estimated cellular proportions and purified expression profiles can pro- vide deeper insight for cancer biomarker studies. Second, we propose a novel edge regression model for undirected graphs, which incorporates subject-level covariates to estimate the conditional dependencies. Current work for constructing graphical models for multivariate data does not take into account the subject specific information, which can bias the conditional independence structure in heterogeneous data. Especially for tumor samples with inherent contamination from normal cells, ignoring the cellular heterogeneity and modeling the population-level genomic graphs may inhibit the discovery of the true tumor graph, which would be attenuated towards the normal graph. Our model allows undirected networks to vary with the exogenous covariates and is able to borrow strength from different related graphs for estimating more robust covariate-specific graphs. Bayesian shrinkage algorithms are presented to efficiently estimate and induce sparsity for generating subject-level graphs. We demonstrate the good performance of our method through simulation studies and apply our method to cytokine measurements from blood plasma samples from hepatocellular carcinoma (HCC) patients and normal controls. Third, we build a model with respect to logistic regression that includes tumor purity as a scaling factor to improve model robustness for the purpose of both estimation and prediction. Penalized logistic regression is used to identify variables (genes) and predict clinical status with binary outcomes that are associated with cancers in high-dimensional genomic data. We aim to reduce the uncertainty introduced by cellular heterogeneity through incorporating the measure of tumor purity to quantify the power of data for each sample. We provide strategies of choosing scaling parameters. Our model is finally shown to work well through a set of simulation studies. We believe that the statistical modeling, technical pipelines and computational results included in our work will serve as a first guide for the development of statistical methods accounting for cellular heterogeneity in cancer research
    corecore