116 research outputs found

    A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels

    Get PDF
    Background Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue. Results The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd=2.7%), 97.6% (sd=2.8%) and 90.8% (sd=5.5%) and average specificities of: 93.6% (sd=4.1%), 99% (sd=2.2%) and 79.4% (sd=9.8%) in 100 independent two-fold cross-validations. Conclusions We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a sub-mode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy

    Computational Tools for the Untargeted Assignment of FT-MS Metabolomics Datasets

    Get PDF
    Metabolomics is the study of metabolomes, the sets of metabolites observed in living systems. Metabolism interconverts these metabolites to provide the molecules and energy necessary for life processes. Many disease processes, including cancer, have a significant metabolic component that manifests as differences in what metabolites are present and in what quantities they are produced and utilized. Thus, using metabolomics, differences between metabolomes in disease and non-disease states can be detected and these differences improve our understanding of disease processes at the molecular level. Despite the potential benefits of metabolomics, the comprehensive investigation of metabolomes remains difficult. A popular analytical technique for metabolomics is mass spectrometry. Advances in Fourier transform mass spectrometry (FT-MS) instrumentation have yielded simultaneous improvements in mass resolution, mass accuracy, and detection sensitivity. In the metabolomics field, these advantages permit more complicated, but more informative experimental designs such as the use of multiple isotope-labeled precursors in stable isotope-resolved metabolomics (SIRM) experiments. However, despite these potential applications, several outstanding problems hamper the use of FT-MS for metabolomics studies. First, artifacts and data quality problems in FT-MS spectra can confound downstream data analyses, confuse machine learning models, and complicate the robust detection and assignment of metabolite features. Second, the assignment of observed spectral features to metabolites remains difficult. Existing targeted approaches for assignment often employ databases of known metabolites; however, metabolite databases are incomplete, thus limiting or biasing assignment results. Additionally, FT-MS provides limited structural information for observed metabolites, which complicates the determination of metabolite class (e.g. lipid, sugar, etc. ) for observed metabolite spectral features, a necessary step for many metabolomics experiments. To address these problems, a set of tools were developed. The first tool identifies artifacts with high peak density observed in many FT-MS spectra and removes them safely. Using this tool, two previously unreported types of high peak density artifact were identified in FT-MS spectra: fuzzy sites and partial ringing. Fuzzy sites were particularly problematic as they confused and reduced the accuracy of machine learning models trained on datasets containing these artifacts. Second, a tool called SMIRFE was developed to assign isotope-resolved molecular formulas to observed spectral features in an untargeted manner without a database of expected metabolites. This new untargeted method was validated on a gold-standard dataset containing both unlabeled and 15N-labeled compounds and was able to identify 18 of 18 expected spectral features. Third, a collection of machine learning models was constructed to predict if a molecular formula corresponds to one or more lipid categories. These models accurately predict the correct one of eight lipid categories on our training dataset of known lipid and non-lipid molecular formulas with precisions and accuracies over 90% for most categories. These models were used to predict lipid categories for untargeted SMIRFE-derived assignments in a non-small cell lung cancer dataset. Subsequent differential abundance analysis revealed a sub-population of non-small cell lung cancer samples with a significantly increased abundance in sterol lipids. This finding implies a possible therapeutic role of statins in the treatment and/or prevention of non-small cell lung cancer. Collectively these tools represent a pipeline for FT-MS metabolomics datasets that is compatible with isotope labeling experiments. With these tools, more robust and untargeted metabolic analyses of disease will be possible

    Mammography

    Get PDF
    In this volume, the topics are constructed from a variety of contents: the bases of mammography systems, optimization of screening mammography with reference to evidence-based research, new technologies of image acquisition and its surrounding systems, and case reports with reference to up-to-date multimodality images of breast cancer. Mammography has been lagged in the transition to digital imaging systems because of the necessity of high resolution for diagnosis. However, in the past ten years, technical improvement has resolved the difficulties and boosted new diagnostic systems. We hope that the reader will learn the essentials of mammography and will be forward-looking for the new technologies. We want to express our sincere gratitude and appreciation?to all the co-authors who have contributed their work to this volume

    Computer-aided diagnosis of gynaecological abnormality using B-mode ultrasound images

    Get PDF
    Ultrasound scan is one of the most reliable imaging for detecting/diagnosing of gynaecological abnormalities. Ultrasound imaging is widely used during pregnancy and has become central in the management of the problems of early pregnancy, particularly in miscarriage diagnosis. Also ultrasound is considered as the most important imaging modality in the evaluation of different types of ovarian tumours. The early detection of ovarian carcinoma and miscarriage continues to be a challenging task. It mostly relies on manual examination, interpretation by gynaecologists, of the ultrasound scan images that may use morphology features extracted from the region of interest. Diagnosis depends on using certain scoring systems that have been devised over a long time. The manual diagnostic process involves multiple subjective decisions, with increased inter- and intra-observer variations which may lead to serious errors and health implications. This thesis is devoted to developing computer-based tools that use ultrasound scan images for automatic classification of Ovarian Tumours (Benign or Malignant) and automatic detection of Miscarriage cases at early stages of pregnancy. Our intended computational tools are meant to help gynaecologists to improve accuracy of their diagnostic decisions, while serving as a tool for training radiology students/trainees on diagnosing gynaecological abnormalities. Ultimately, it is hoped that the developed techniques can be integrated into a specialised gynaecology Decision Support System. Our approach is to deal with this problem by adopting a standard image-based pattern recognition research framework that involve the extraction of appropriate feature vector modelling of the investigated tumours, select appropriate classifiers, and test the performance of such schemes using sufficiently large and relevant datasets of ultrasound scan images. We aim to complement the automation of certain parameters that gynaecologist experts and radiologists manually determine, by image-content information attributes that may not be directly accessible without advanced image transformations. This is motivated by, and benefit from, advances in computer vision that led the emergence of a variety of image processing/analysis techniques together with recent advances in data mining and machine learning technologies. An expert observer makes a diagnostic decision with a level of certainty, and if not entirely certain about their diagnostic decisions then often other experts’ opinions are sought and may be essential for diagnosing difficult “Inconclusive cases”. Here we define a quantitative measure of confidence in decisions made by automatic diagnostic schemes, independent of accuracy of decision. In the rest of the thesis, we report on the development of a variety of innovative diagnostic schemes and demonstrate their performances using extensive experimental work. The following is a summary of the main contributions made in this thesis. 1. Using a combination of spatial domain filters and operations as pre-processing procedures to enhance ultrasound images for both applications, namely miscarriage identification and ovarian tumour diagnosis. We show that the Non-local means filter is effective in reducing speckle noise from ultrasound images, and together with other filters we succeed in enhancing the inner border of malignant tumours and reliably segmenting the gestational sac. 2. Developing reliable automated procedures to extract several types of features to model gestational sac dimensional measurements, few of which are manually determined by radiologist and used by gynaecologists to identify miscarriage cases. We demonstrate that the corresponding automatic diagnostic schemes yield excellent accuracy when classified by the k-Nearest Neighbours. 3. Developing several local as well as global image-texture based features in the spatial as well as the frequency domains. The spatial domain features include the local versions of image histograms, first order statistical features and versions of local binary patterns. From the frequency domain, we propose a novel set of Fast Fourier Geometrical Features that encapsulates the image texture information that depends on all image pixel values. We demonstrate that each of these features define Ovarian Tumour diagnostic scheme that have relatively high power of discriminating Benign from Malignant tumours when classified by Support Vector Machine. We show that the Fast Fourier Geometrical Features are the best performing scheme achieving more than 85% accuracy. 4. Introducing a simple measure of confidence to quantify the goodness of the automatic diagnostic decision, regardless of decision accuracy, to emulate real life medical diagnostics. Experimental work in this theis demonstrate a strong link between this measure and accuracy rate, so that low level of confidence could raise an alarm. 5. Conducting sufficiently intensive investigations of fusion models of multi-feature schemes at different level. We show that feature level fusion yields degraded performance compared to all its single components, while score level fusion results in improved results and decision level fusion of three sets of features using majority rule is slightly less successful. Using the measure of confidence is useful in resolving conflicts when two sets of features are fused at the decision level. This leads to the emergence of a Not Sure decision which is common in medical practice. Considering the Not Sure label is a good practice and an incentive to conduct more tests, rather than misclassification, which leads to significantly improved accuracy. The thesis concludes with an intensive discussion on future work that would go beyond improving performance of the developed scheme to deal with the corresponding multi-class diagnostics essential for a comprehensive gynaecology Decision Support System tool as the ultimate goal

    Bioinformatics and Machine Learning for Cancer Biology

    Get PDF
    Cancer is a leading cause of death worldwide, claiming millions of lives each year. Cancer biology is an essential research field to understand how cancer develops, evolves, and responds to therapy. By taking advantage of a series of “omics” technologies (e.g., genomics, transcriptomics, and epigenomics), computational methods in bioinformatics and machine learning can help scientists and researchers to decipher the complexity of cancer heterogeneity, tumorigenesis, and anticancer drug discovery. Particularly, bioinformatics enables the systematic interrogation and analysis of cancer from various perspectives, including genetics, epigenetics, signaling networks, cellular behavior, clinical manifestation, and epidemiology. Moreover, thanks to the influx of next-generation sequencing (NGS) data in the postgenomic era and multiple landmark cancer-focused projects, such as The Cancer Genome Atlas (TCGA) and Clinical Proteomic Tumor Analysis Consortium (CPTAC), machine learning has a uniquely advantageous role in boosting data-driven cancer research and unraveling novel methods for the prognosis, prediction, and treatment of cancer

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

    EMT Network-based Lung Cancer Prognosis Prediction

    Get PDF
    Network-based feature selection methods on omics data have been developed in recent years. Their performance gain, however, is shown to be affected by the datasets, networks, and evaluation metrics. The reproducibility and robustness of biomarkers await to be improved. In this endeavor, one of the major challenges is the curse of dimensionality. To mitigate this issue, we proposed the Phenotype Relevant Network-based Feature Selection (PRNFS) framework. By employing a much smaller but phenotype relevant network, we could avoid irrelevant information and select robust molecular signatures. The advantages of PRNFS were demonstrated with the application of lung cancer prognosis prediction. Specifically, we constructed epithelial mesenchymal transition (EMT) networks and employed them for feature selection. We mapped multiple types of omics data on it alternatively to select single-omics signatures and further integrated them into multi-omics signatures. Then we introduced a multiplex network-based feature selection method to directly select multi-omics signatures. Both single-omics and multi-omics EMT signatures were evaluated on TCGA data as well as an independent multi-omics dataset. The results showed that EMT signatures achieved significant performance gain, although EMT networks covered less than 2.5% of the original data dimensions. Frequently selected EMT features achieved average AUC values of 0.83 on TCGA data. Employing EMT signatures on the independent dataset stratified the patients into significantly different prognostic groups. Multi-omics features showed superior performance over single-omics features on both TCGA data and the independent data. Additionally, we tested the performance of a few relational and non-relational databases for storing and retrieving omics data. Since biological data have large volume, high velocity, and wide varieties, it is necessary to have database systems that meet the need of integrative omics data analysis. Based on the results, we provided a few advices on building scalable omics data infrastructures
    • …
    corecore