106 research outputs found

    Novel pattern recognition approaches for transcriptomics data analysis

    Get PDF
    We proposed a family of methods for transcriptomics and genomics data analysis based on multi-level thresholding approach, such as OMTG for sub-grid and spot detection in DNA microarrays, and OMT for detecting significant regions based on next generation sequencing data. Extensive experiments on real-life datasets and a comparison to other methods show that the proposed methods perform these tasks fully automatically and with a very high degree of accuracy. Moreover, unlike previous methods, the proposed approaches can be used in various types of transcriptome analysis problems such as microarray image gridding with different resolutions and spot sizes as well as finding the interacting regions of DNA with a protein of interest using ChIP-Seq data without any need for parameter adjustment. We also developed constrained multi-level thresholding (CMT), an algorithm used to detect enriched regions on ChIP-Seq data with the ability of targeting regions within a specific range. We show that CMT has higher accuracy in detecting enriched regions (peaks) by objectively assessing its performance relative to other previously proposed peak finders. This is shown by testing three algorithms on the well-known FoxA1 Data set, four transcription factors (with a total of six antibodies) for Drosophila melanogaster and the H3K4ac antibody dataset. Finally, we propose a tree-based approach that conducts gene selection and builds a classifier simultaneously, in order to select the minimal number of genes that would reliably predict a given breast cancer subtype. Our results support that this modified approach to gene selection yields a small subset of genes that can predict subtypes with greater than 95%overall accuracy. In addition to providing a valuable list of targets for diagnostic purposes, the gene ontologies of the selected genes suggest that these methods have isolated a number of potential genes involved in breast cancer biology, etiology and potentially novel therapeutics

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Univariate and multivariate statistical approaches for the analyses of omics data: sample classification and two-block integration.

    Get PDF
    The wealth of information generated by high-throughput omics technologies in the context of large-scale epidemiological studies has made a significant contribution to the identification of factors influencing the onset and progression of common diseases. Advanced computational and statistical modelling techniques are required to manipulate and extract meaningful biological information from these omics data as several layers of complexity are associated with them. Recent research efforts have concentrated in the development of novel statistical and bioinformatic tools; however, studies thoroughly investigating the applicability and suitability of these novel methods in real data have often fallen behind. This thesis focuses in the analyses of proteomics and transcriptomics data from the EnviroGenoMarker project with the purpose of addressing two main research objectives: i) to critically appraise established and recently developed statistical approaches in their ability to appropriately accommodate the inherently complex nature of real-world omics data and ii) to improve the current understanding of a prevalent condition by identifying biological markers predictive of disease as well as possible biological mechanisms leading to its onset. The specific disease endpoint of interest corresponds to B-cell Lymphoma, a common haematological malignancy for which many challenges related to its aetiology remain unanswered. The seven chapters comprising this thesis are structured in the following manner: the first two correspond to introductory chapters where I describe the main omics technologies and statistical methods employed for their analyses. The third chapter provides a description of the epidemiological project giving rise to the study population and the disease outcome of interest. These are followed by three results chapters that address the research aims described above by applying univariate and multivariate statistical approaches for sample classification and data integration purposes. A summary of findings, concluding general remarks and discussion of open problems offering potential avenues for future research are presented in the final chapter.Open Acces

    Visualization and analysis of RNA-Seq assembly graphs.

    Get PDF
    RNA-Seq is a powerful transcriptome profiling technology enabling transcript discovery and quantification. Whilst most commonly used for gene-level quantification, the data can be used for the analysis of transcript isoforms. However, when the underlying transcript assemblies are complex, current visualization approaches can be limiting, with splicing events a challenge to interpret. Here, we report on the development of a graph-based visualization method as a complementary approach to understanding transcript diversity from short-read RNA-Seq data. Following the mapping of reads to a reference genome, a read-to-read comparison is performed on all reads mapping to a given gene, producing a weighted similarity matrix between reads. This is used to produce an RNA assembly graph, where nodes represent reads and edges similarity scores between them. The resulting graphs are visualized in 3D space to better appreciate their sometimes large and complex topology, with other information being overlaid on to nodes, e.g. transcript models. Here we demonstrate the utility of this approach, including the unusual structure of these graphs and how they can be used to identify issues in assembly, repetitive sequences within transcripts and splice variants. We believe this approach has the potential to significantly improve our understanding of transcript complexity

    Quantifying Glial-Glial Interactions In Drosophila Using Automated Image Analysis

    Get PDF
    Imaging is an immensely powerful tool in biomedical research. Technological advances in the last half century have led to the development of new tools for image analysis, with major strides being made in the last 20 years especially with machine and deep learning. However, researchers still often hit a bottleneck during the image analysis phase of their projects that often leads to delays and sometimes even limits the scope of their studies. In this thesis I demonstrate some of the issues that arise while quantifying images to answer a biological question by using a dataset of fly central nervous system images to elucidate interactions between different cells. I present an overview of the types of methods that can be used to perform this analysis including a discussion of their advantages and disadvantages. Finally, I present steps for creating and validating an automated image analysis pipeline that was used to analyze a large section of the fly ventral nerve cord, akin to the spinal cord. Automating image quantifying allowed us to maximize the size of the dataset analyzed, which revealed subtle patterns in cell-cell interactions that would not have been uncovered with manual quantification of a smaller dataset

    Linking quantitative radiology to molecular mechanism for improved vascular disease therapy selection and follow-up

    Get PDF
    Objective: Therapeutic advancements in atherosclerotic cardiovascular disease have improved the prevention of ischemic stroke and myocardial infarction. However, diagnostic methods for atherosclerotic plaque phenotyping to aid individualized therapy are lacking. In this thesis, we aimed to elucidate plaque biology through the analysis of computed-tomography angiography (CTA) with sufficient sensitivity and specificity to capture the differentiated drivers of the disease. We then aimed to use such data to calibrate a systems biology model of atherosclerosis with adequate granularity to be clinically relevant. Such development may be possible with computational modeling, but given, the multifactorial biology of atherosclerosis, modeling must be based on complete biological networks that capture protein-protein interactions estimated to drive disease progression. Approach and Results: We employed machine intelligence using CTA paired with a molecular assay to determine cohort-level associations and individual patient predictions. Examples of predicted transcripts included ion transporters, cytokine receptors, and a number of microRNAs. Pathway analyses elucidated enrichment of several biological processes relevant to atherosclerosis and plaque pathophysiology. The ability of the models to predict plaque gene expression from CTAs was demonstrated using sequestered patients with transcriptomes of corresponding lesions. We further performed a case study exploring the relationship between biomechanical quantities and plaque morphology, indicating the ability to determine stress and strain from tissue characteristics. Further, we used a uniquely constituted plaque proteomic dataset to create a comprehensive systems biology disease model, which was finally used to simulate responses to different drug categories in individual patients. Individual patient response was simulated for intensive lipid-lowering, anti-inflammatory drugs, anti-diabetic, and combination therapy. Plaque tissue was collected from 18 patients with 6735 proteins at two locations per patient. 113 pathways were identified and included in the systems biology model of endothelial cells, vascular smooth muscle cells, macrophages, lymphocytes, and the integrated intima, altogether spanning 4411 proteins, demonstrating a range of 39-96% plaque instability. Simulations of drug responses varied in patients with initially unstable lesions from high (20%, on combination therapy) to marginal improvement, whereas patients with initially stable plaques showed generally less improvement, but importantly, variation across patients. Conclusion: The results of this thesis show that atherosclerotic plaque phenotyping by multi-scale image analysis of conventional CTA can elucidate the molecular signatures that reflect atherosclerosis. We further showed that calibrated system biology models may be used to simulate drug response in terms of atherosclerotic plaque instability at the individual level, providing a potential strategy for improved personalized management of patients with cardiovascular disease. These results hold promise for optimized and personalized therapy in the prevention of myocardial infarction and ischemic stroke, which warrants further investigations in larger cohorts

    ESTIMATION OF LARGE-SCALE CROSS-COVARIANCE MATRIX WITH GROUP INFORMATION

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Advancing the cell culture landscape:the instructive potential of artificial and natural geometries

    Get PDF
    This research focuses on how surface structures can influence the behaviour of cells. There is a great diversity of surface structures, which makes the identification of an optimal physical environment for a specific phenotype difficult. Therefore, platforms that allow screening of many different designs at the same time facilitate the identification of an optimal cultural environment. Using the TopoChip, which contains 2176 unique microtopographies, structures have been identified that support the tenocyte phenotype, the primary cell type of the tendon. In addition, this also applies to mesenchymal stem cells (MSCs), which experience an activation of tendon-related genes. Furthermore, the library has been creatively expanded by using natural surface topographies that cause unique cell behaviour, such as promoting osteogenesis

    The molecular basis for pore pattern morphogenesis in diatom silica

    Get PDF
    Biomineral-forming organisms produce inorganic materials with complex, genetically encoded morphologies that are unmatched by current synthetic chemistry. It is poorly understood which genes are involved in biomineral morphogenesis and how the encoded proteins guide this process. We addressed these questions using diatoms, which are paradigms for the self-assembly of hierarchically meso- and macroporous silica under mild reaction conditions. Proteomics analysis of the intracellular organelle for silica biosynthesis led to the identification of new biomineralization proteins. Three of these, coined dAnk1-3, contain a common protein–protein interaction domain (ankyrin repeats), indicating a role in coordinating assembly of the silica biomineralization machinery. Knocking out individual dank genes led to aberrations in silica biogenesis that are consistent with liquid–liquid phase separation as underlying mechanism for pore pattern morphogenesis. Our work provides an unprecedented path for the synthesis of tailored mesoporous silica materials using synthetic biology
    corecore