166 research outputs found

    genomation: a toolkit to summarize, annotate and visualize genomic intervals

    Get PDF
    Summary: Biological insights can be obtained through computational integration of genomics data sets consisting of diverse types of information. The integration is often hampered by a large variety of existing file formats, often containing similar information, and the necessity to use complicated tools to achieve the desired results. We have built an R package, genomation, to expedite the extraction of biological information from high throughput data. The package works with a variety of genomic interval file types and enables easy summarization and annotation of high throughput data sets with given genomic annotations. Availability and implementation: The software is currently distributed under MIT artistic license and freely available at http://bioinformatics.mdc-berlin.de/genomation, and through the Bioconductor framework. Contact: [email protected], [email protected], [email protected], or [email protected]

    netSmooth: Network-smoothing based imputation for single cell RNA-seq [version 3; referees: 2 approved]

    Get PDF
    Single cell RNA-seq (scRNA-seq) experiments suffer from a range of characteristic technical biases, such as dropouts (zero or near zero counts) and high variance. Current analysis methods rely on imputing missing values by various means of local averaging or regression, often amplifying biases inherent in the data. We present netSmooth, a network-diffusion based method that uses priors for the covariance structure of gene expression profiles on scRNA-seq experiments in order to smooth expression values. We demonstrate that netSmooth improves clustering results of scRNA-seq experiments from distinct cell populations, time-course experiments, and cancer genomics. We provide an R package for our method, available at: https://github.com/BIMSBbioinfo/netSmooth

    RNA polymerase II primes Polycomb-repressed developmental genes throughout terminal neuronal differentiation

    Get PDF
    Polycomb repression in mouse embryonic stem cells (ESCs) is tightly associated with promoter co-occupancy of RNA polymerase II (RNAPII) which is thought to prime genes for activation during early development. However, it is unknown whether RNAPII poising is a general feature of Polycomb repression, or is lost during differentiation. Here, we map the genome-wide occupancy of RNAPII and Polycomb from pluripotent ESCs to non-dividing functional dopaminergic neurons. We find that poised RNAPII complexes are ubiquitously present at Polycomb-repressed genes at all stages of neuronal differentiation. We observe both loss and acquisition of RNAPII and Polycomb at specific groups of genes reflecting their silencing or activation. Strikingly, RNAPII remains poised at transcription factor genes which are silenced in neurons through Polycomb repression, and have major roles in specifying other, non-neuronal lineages. We conclude that RNAPII poising is intrinsically associated with Polycomb repression throughout differentiation. Our work suggests that the tight interplay between RNAPII poising and Polycomb repression not only instructs promoter state transitions, but also may enable promoter plasticity in differentiated cells

    Leveraging large language models for data analysis automation

    Get PDF
    Data analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data analysis and subsequent interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. One possible solution to this shortage in experts would be making use of Large Language Models (LLMs) for generating data analysis pipelines. However, although LLMs have shown great potential when used for code generation tasks, questions regarding the accuracy of LLMs when prompted with domain expert questions such as omics related data analysis questions, remain unanswered. To address this, we developed mergen, an R package that leverages LLMs for data analysis code generation and execution. We evaluated the performance of this data analysis system using various data analysis tasks for genomics. Our primary goal is to enable researchers to conduct data analysis by simply describing their objectives and the desired analyses for specific datasets through clear text. Our approach improves code generation via specialized prompt engineering and error feedback mechanisms. In addition, our system can execute the data analysis workflows prescribed by the LLM providing the results of the data analysis workflow for human review. Our evaluation of this system reveals that while LLMs effectively generate code for some data analysis tasks, challenges remain in executable code generation, especially for complex data analysis tasks. The best performance was seen with the self-correction mechanism, in which self-correct was able to increase the percentage of executable code when compared to the simple strategy by 22.5% for tasks of complexity 2. For tasks for complexity 3, 4 and 5, this increase was 52.5%, 27.5% and 15% respectively. Using a chi-squared test, it was shown that significant differences could be found using the different prompting strategies. Our study contributes to a better understanding of LLM capabilities and limitations, providing software infrastructure and practical insights for their effective integration into data analysis workflows

    CompassDock: Comprehensive Accurate Assessment Approach for Deep Learning-Based Molecular Docking in Inference and Fine-Tuning

    Full text link
    Datasets used for molecular docking, such as PDBBind, contain technical variability - they are noisy. Although the origins of the noise have been discussed, a comprehensive analysis of the physical, chemical, and bioactivity characteristics of the datasets is still lacking. To address this gap, we introduce the Comprehensive Accurate Assessment (Compass). Compass integrates two key components: PoseCheck, which examines ligand strain energy, protein-ligand steric clashes, and interactions, and AA-Score, a new empirical scoring function for calculating binding affinity energy. Together, these form a unified workflow that assesses both the physical/chemical properties and bioactivity favorability of ligands and protein-ligand interactions. Our analysis of the PDBBind dataset using Compass reveals substantial noise in the ground truth data. Additionally, we propose CompassDock, which incorporates the Compass module with DiffDock, the state-of-the-art deep learning-based molecular docking method, to enable accurate assessment of docked ligands during inference. Finally, we present a new paradigm for enhancing molecular docking model performance by fine-tuning with Compass Scores, which encompass binding affinity energy, strain energy, and the number of steric clashes identified by Compass. Our results show that, while fine-tuning without Compass improves the percentage of docked poses with RMSD < 2Å, it leads to a decrease in physical/chemical and bioactivity favorability. In contrast, fine-tuning with Compass shows a limited improvement in RMSD < 2Å but enhances the physical/chemical and bioactivity favorability of the ligand conformation. The source code is available publicly at https://github.com/BIMSBbioinfo/CompassDock

    The RNA workbench: Best practices for RNA and high-throughput sequencing bioinformatics in Galaxy

    Get PDF
    RNA-based regulation has become a major research topic in molecular biology. The analysis of epigenetic and expression data is therefore incomplete if RNA-based regulation is not taken into account. Thus, it is increasingly important but not yet standard to combine RNA-centric data and analysis tools with other types of experimental data such as RNA-seq or ChIP-seq. Here, we present the RNA workbench, a comprehensive set of analysis tools and consolidated workflows that enable the researcher to combine these two worlds. Based on the Galaxy framework the workbench guarantees simple access, easy extension, flexible adaption to personal and security needs, and sophisticated analyses that are independent of command-line knowledge. Currently, it includes more than 50 bioinformatics tools that are dedicated to different research areas of RNA biology including RNA structure analysis, RNA alignment, RNA annotation, RNA-protein interaction, ribosome profiling, RNA-seq analysis and RNA target prediction. The workbench is developed and maintained by experts in RNA bioinformatics and the Galaxy framework. Together with the growing community evolving around this workbench, we are committed to keep the workbench up-to-date for future standards and needs, providing researchers with a reliable and robust framework for RNA data analysis

    Pathogenic mutations of human phosphorylation sites affect protein–protein interactions

    Get PDF
    Despite their lack of a defined 3D structure, intrinsically disordered regions (IDRs) of proteins play important biological roles. Many IDRs contain short linear motifs (SLiMs) that mediate protein-protein interactions (PPIs), which can be regulated by post-translational modifications like phosphorylation. 20% of pathogenic missense mutations are found in IDRs, and understanding how such mutations affect PPIs is essential for unraveling disease mechanisms. Here, we employ peptide-based interaction proteomics to investigate 36 disease-associated mutations affecting phosphorylation sites. Our results unveil significant differences in interactomes between phosphorylated and non-phosphorylated peptides, often due to disrupted phosphorylation-dependent SLiMs. We focused on a mutation of a serine phosphorylation site in the transcription factor GATAD1, which causes dilated cardiomyopathy. We find that this phosphorylation site mediates interaction with 14-3-3 family proteins. Follow-up experiments reveal the structural basis of this interaction and suggest that 14-3-3 binding affects GATAD1 nucleocytoplasmic transport by masking a nuclear localisation signal. Our results demonstrate that pathogenic mutations of human phosphorylation sites can significantly impact protein-protein interactions, offering insights into potential molecular mechanisms underlying pathogenesis

    Transcriptional features of genomic regulatory blocks

    Get PDF
    CAGE tag mapping of transcription start sites across different human tissues shows that genomic regulatory blocks have unique features that are the likely cause of their ability to respond to regulatory inputs from very long distances

    PiGx: reproducible genomics analysis pipelines with GNU Guix

    Get PDF
    In bioinformatics, as well as other computationally intensive research fields, there is a need for workflows that can reliably produce consistent output, from known sources, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for controlled comparison between different observations and for the wider dissemination of workflows. However, providing this type of reproducibility and traceability is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which generally comes in various versions. Moreover, in many fields (bioinformatics being a prime example), these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. Here, we propose a principled approach for building analysis pipelines and managing their dependencies with GNU Guix. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines called PiGx for the analysis of RNA sequencing, chromatin immunoprecipitation sequencing, bisulfite-treated DNA sequencing, and single-cell resolution RNA sequencing. All pipelines process raw experimental data and generate reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise beyond the use of the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own datasets or bioinformaticians seeking to automate all, or parts of, their analyses. In the long term, we hope our approach to reproducibility will serve as a blueprint for reproducible workflows in other areas. Our pipelines, along with their corresponding documentation and sample reports, are available at http://bioinformatics.mdc-berlin.de/pigx Document type: Articl

    Predicting lethal courses in critically ill COVID-19 patients using a machine learning model trained on patients with non-COVID-19 viral pneumonia

    Get PDF
    In a pandemic with a novel disease, disease-specific prognosis models are available only with a delay. To bridge the critical early phase, models built for similar diseases might be applied. To test the accuracy of such a knowledge transfer, we investigated how precise lethal courses in critically ill COVID-19 patients can be predicted by a model trained on critically ill non-COVID-19 viral pneumonia patients. We trained gradient boosted decision tree models on 718 (245 deceased) non-COVID-19 viral pneumonia patients to predict individual ICU mortality and applied it to 1054 (369 deceased) COVID-19 patients. Our model showed a significantly better predictive performance (AUROC 0.86 [95% CI 0.86-0.87]) than the clinical scores APACHE2 (0.63 [95% CI 0.61-0.65]), SAPS2 (0.72 [95% CI 0.71-0.74]) and SOFA (0.76 [95% CI 0.75-0.77]), the COVID-19-specific mortality prediction models of Zhou (0.76 [95% CI 0.73-0.78]) and Wang (laboratory: 0.62 [95% CI 0.59-0.65]; clinical: 0.56 [95% CI 0.55-0.58]) and the 4C COVID-19 Mortality score (0.71 [95% CI 0.70-0.72]). We conclude that lethal courses in critically ill COVID-19 patients can be predicted by a machine learning model trained on non-COVID-19 patients. Our results suggest that in a pandemic with a novel disease, prognosis models built for similar diseases can be applied, even when the diseases differ in time courses and in rates of critical and lethal courses
    corecore