1,030 research outputs found

    A Systematic Evaluation of Feature Selection and Classification Algorithms Using Simulated and Real miRNA Sequencing Data

    Get PDF
    Sequencing is widely used to discover associations between microRNAs (miRNAs) and diseases. However, the negative binomial distribution (NB) and high dimensionality of data obtained using sequencing can lead to low-power results and low reproducibility. Several statistical learning algorithms have been proposed to address sequencing data, and although evaluation of these methods is essential, such studies are relatively rare. The performance of seven feature selection (FS) algorithms, including baySeq, DESeq, edgeR, the rank sum test, lasso, particle swarm optimistic decision tree, and random forest (RF), was compared by simulation under different conditions based on the difference of the mean, the dispersion parameter of the NB, and the signal to noise ratio. Real data were used to evaluate the performance of RF, logistic regression, and support vector machine. Based on the simulation and real data, we discuss the behaviour of the FS and classification algorithms. The Apriori algorithm identified frequent item sets (mir-133a, mir-133b, mir-183, mir-937, and mir-96) from among the deregulated miRNAs of six datasets from The Cancer Genomics Atlas. Taking these findings altogether and considering computational memory requirements, we propose a strategy that combines edgeR and DESeq for large sample sizes

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    IsomiR_Window : a system for analyzing small‑RNA‑seq data in an integrative and user‑friendly manner

    Get PDF
    Research Areas: Biochemistry & Molecular ; Biology Biotechnology & Applied ; Microbiology ; Mathematical & Computational BiologyBackground: IsomiRs are miRNA variants that vary in length and/or sequence when compared to their canonical forms. These variants display differences in length and/or sequence, including additions or deletions of one or more nucleotides (nts) at the 5′ and/or 3′ end, internal editings or untemplated 3′ end additions. Most available tools for small RNA-seq data analysis do not allow the identification of isomiRs and often require advanced knowledge of bioinformatics. To overcome this, we have developed IsomiR Window, a platform that supports the systematic identification, quantification and functional exploration of isomiR expression in small RNA-seq datasets, accessible to users with no computational skills. Methods: IsomiR Window enables the discovery of isomiRs and identification of all annotated non-coding RNAs in RNA-seq datasets from animals and plants. It comprises two main components: the IsomiR Window pipeline for data processing; and the IsomiR Window Browser interface. It integrates over ten third-party softwares for the analysis of small-RNA-seq data and holds a new algorithm that allows the detection of all possible types of isomiRs. These include 3′ and 5′end isomiRs, 3′ end tailings, isomiRs with single nucleotide polymorphisms (SNPs) or potential RNA editings, as well as all possible fuzzy combinations. IsomiR Window includes all required databases for analysis and annotation, and is freely distributed as a Linux virtual machine, including all required software. Results: IsomiR Window processes several datasets in an automated manner, without restrictions of input file size. It generates high quality interactive figures and tables which can be exported into different formats. The performance of isomiR detection and quantification was assessed using simulated small-RNA-seq data. For correctly mapped reads, it identified different types of isomiRs with high confidence and 100% accuracy. The analysis of a small RNA-seq data from Basal Cell Carcinomas (BCCs) using isomiR Window confirmed that miR-183-5p is up-regulated in Nodular BCCs, but revealed that this effect was predominantly due to a novel 5′end variant. This variant displays a different seed region motif and 1756 isoform-exclusive mRNA targets that are significantly associated with disease pathways, underscoring the biological relevance of isomiR-focused analysis. IsomiR Window is available at https ://isomi r.fc.ul.pt/.info:eu-repo/semantics/publishedVersio

    Establishment of predictive blood-based signatures in medical large scale genomic data sets : Development of novel diagnostic tests

    Get PDF
    Increasing data has led to tremendous success in discovering molecular biomarkers based on high throughput data. However, the translation of these so-called genomic signatures into clinical practice has been limited. The complexity and volume of genomic profiling requires heightened attention to robust design, methodological details, and avoidance of bias. During this thesis, novel strategies aimed at closing the gap from initially promising pilot studies to the clinical application of novel biomarkers are evaluated. First, a conventional process for genomic biomarker development comprising feature selection, algorithm and parameter optimization, and performance assessment was established. Using this approach, a RNA-stabilized whole blood diagnostic classifier for non-small cell lung cancer was built in a training set that can be used as a biomarker to discriminate between patients and control samples. Subsequently, this optimized classifier was successfully applied to two independent and blinded validation sets. Extensive permutation analysis using random feature lists supports the specificity of the established transcriptional classifier. Next, it was demonstrated that a combined approach of clinical trial simulation and adaptive learning strategies can be used to speed up biomarker development. As a model, genome-wide expression data derived from over 4,700 individuals in 37 studies addressing four clinical endpoints were used to assess over 1,800,000 classifiers. In addition to current approaches determining optimal classifiers within a defined study setting, randomized clinical trial simulation unequivocally uncovered the overall variance in the prediction performance of potential disease classifiers to predict the outcome of a large biomarker validation study from a pilot trial. Furthermore, most informative features were identified by feature ranking according to an individual classification performance score. Applying an adaptive learning strategy based on data extrapolation led to a datadriven prediction of the study size required for larger validation studies based on small pilot trials and an estimate of the expected statistical performance during validation. With these significant improvements, exceedingly robust and clinically applicable gene signatures for the diagnosis and detection of acute myeloid leukemia, active tuberculosis, HIV infection, and non-small cell lung cancer are established which could demonstrate disease-related enrichment of the obtained signatures and phenotype-related feature ranking. In further research, platform requirements for blood-based biomarker development were exemplarily examined for micro RNA expression profiling. The performance as well as the technical sample handling to provide reliable strategies for platform implementation in clinical applications were investigated. Overall, all introduced methods improve and accelerate the development of biomarker signatures for molecular diagnostics and can easily be extended to other high throughput data and other disease settings

    Algorithms for cancer genome data analysis - Learning techniques for ITH modeling and gene fusion classification

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Prognostic Methods for Integrating Data from Complex Diseases

    Get PDF
    Statistics in medical research gained a vast surge with the development of high-throughput biotechnologies that provide thousands of measurements for each patient. These multi-layered data has the clear potential to improve the disease prognosis. Data integration is increasingly becoming essential in this context, to address problems such as increasing the power, inconsistencies between studies, obtaining more reliable biomarkers and gaining a broader understanding of the disease. This thesis focuses on addressing the challenges in the development of statistical methods while contributing to the methodological advancements in this field. We propose a clinical data analysis framework to obtain a model with good prediction accuracy addressing missing data and model instability. A detailed pre-processing pipeline is proposed for miRNA data that removes unwanted noise and offers improved concordance with qRT-PCR data. Platform specific models are developed to uncover biomarkers using mRNA, protein and miRNA data, to identify the source with the most important prognostic information. This thesis explores two types of data integration: horizontal; the integration of same type of data, and vertical; the integration of data from different platforms for the same patient. We use multiple miRNA datasets to develop a meta-analysis framework addressing the challenges in horizontal data integration using a multi-step validation protocol. In the vertical data integration, we extend the pre-validation principle and derive platform dependent weights to utilise the weighted Lasso. Our study revealed that integration of multi-layered data is instrumental in improving the prediction accuracy and in obtaining more biologically relevant biomarkers. A novel visualisation technique to look at prediction accuracy at patient level revealed vital findings with translational impact in personalised medicine

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Doctor of Philosophy

    Get PDF
    dissertationFor decades, researchers have explored the e ects of clinical and biomolecular factors on disease outcomes and have identi ed several candidate prognostic markers. Now, thanks to technological advances, researchers have at their disposal unprecedented quantities of biomolecular data that may add to existing knowledge about prognosis. However, commensurate challenges accompany these advances. For example, sophisticated informatics techniques are necessary to store, retrieve, and analyze large data sets. Additionally, advanced algorithms may be necessary to account for the joint e ects of tens, hundreds, or thousands of variables. Moreover, it is essential that analyses evaluating such algorithms be conducted in a systematic and consistent way to ensure validity, repeatability, and comparability across studies. For this study, a novel informatics framework was developed to address these needs. Within this framework, the user can apply existing, general-purpose algorithms that are designed to make multivariate predictions for large, hetergeneous data sets. The framework also contains logic for aggregating evidence across multiple algorithms and data categories via ensemble-learning approaches. In this study, this informatics framework was applied to developing multivariate prognisis models for human glioblastoma multiforme, a highly aggressive form of brain cancer that results in a median survival of only 12-15 months. Data for this study came from The Cancer Genome Atlas, a publicly available repository containing clinical, treatment, histological, and biomolecular variables for hundreds of patients. A variety of variable-selection approaches and multivariate algorithms were applied in a cross-validated design, and the quality of the resulting models was measured using the error rate, area under the receiver operating characteristic curve, and log-rank statistic. Although performance of the algorithms varied substantially across the data categories, some models performed well for all three metrics|particularly models based on age, treatments, and DNA methylation. Also encouragingly, the performance of ensemble-learning methods often approximated the best individual results. As multimodal data sets become more prevalent, analytic approaches that account for multiple data categories and algorithms will be increasingly relevant. This study suggests that such approaches hold promise to guide researchers and clinicians in their quest to improve outcomes for devastating diseases like GBM

    Statistical Methods to Enhance Clinical Prediction with High-Dimensional Data and Ordinal Response

    Get PDF
    Der technologische Fortschritt ermöglicht es heute, die moleculare Konfiguration einzelner Zellen oder ganzer Gewebeproben zu untersuchen. Solche in großen Mengen produzierten hochdimensionalen Omics-Daten aus der Molekularbiologie lassen sich zu immer niedrigeren Kosten erzeugen und werden so immer häufiger auch in klinischen Fragestellungen eingesetzt. Personalisierte Diagnose oder auch die Vorhersage eines Behandlungserfolges auf der Basis solcher Hochdurchsatzdaten stellen eine moderne Anwendung von Techniken aus dem maschinellen Lernen dar. In der Praxis werden klinische Parameter, wie etwa der Gesundheitszustand oder die Nebenwirkungen einer Therapie, häufig auf einer ordinalen Skala erhoben (beispielsweise gut, normal, schlecht). Es ist verbreitet, Klassifikationsproblme mit ordinal skaliertem Endpunkt wie generelle Mehrklassenproblme zu behandeln und somit die Information, die in der Ordnung zwischen den Klassen enthalten ist, zu ignorieren. Allerdings kann das Vernachlässigen dieser Information zu einer verminderten Klassifikationsgüte führen oder sogar eine ungünstige ungeordnete Klassifikation erzeugen. Klassische Ansätze, einen ordinal skalierten Endpunkt direkt zu modellieren, wie beispielsweise mit einem kumulativen Linkmodell, lassen sich typischerweise nicht auf hochdimensionale Daten anwenden. Wir präsentieren in dieser Arbeit hierarchical twoing (hi2) als einen Algorithmus für die Klassifikation hochdimensionler Daten in ordinal Skalierte Kategorien. hi2 nutzt die Mächtigkeit der sehr gut verstandenen binären Klassifikation, um auch in ordinale Kategorien zu klassifizieren. Eine Opensource-Implementierung von hi2 ist online verfügbar. In einer Vergleichsstudie zur Klassifikation von echten wie von simulierten Daten mit ordinalem Endpunkt produzieren etablierte Methoden, die speziell für geordnete Kategorien entworfen wurden, nicht generell bessere Ergebnisse als state-of-the-art nicht-ordinale Klassifikatoren. Die Fähigkeit eines Algorithmus, mit hochdimensionalen Daten umzugehen, dominiert die Klassifikationsleisting. Wir zeigen, dass unser Algorithmus hi2 konsistent gute Ergebnisse erzielt und in vielen Fällen besser abschneidet als die anderen Methoden
    • …
    corecore