34 research outputs found
Gapless provides combined scaffolding, gap filling, and assembly correction with long reads
Continuity, correctness, and completeness of genome assemblies are important for many biological projects. Long reads represent a major driver towards delivering high-quality genomes, but not everybody can achieve the necessary coverage for good long read-only assemblies. Therefore, improving existing assemblies with low-coverage long reads is a promising alternative. The improvements include correction, scaffolding, and gap filling. However, most tools perform only one of these tasks and the useful information of reads that supported the scaffolding is lost when running separate programs successively. Therefore, we propose a new tool for combined execution of all three tasks using PacBio or Oxford Nanopore reads. gapless is available at: https://github.com/schmeing/gapless
Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability
Computational methods represent the lifeblood of modern molecular biology. Benchmarking is important for all methods, but with a focus here on computational methods, benchmarking is critical to dissect important steps of analysis pipelines, formally assess performance across common situations as well as edge cases, and ultimately guide users on what tools to use. Benchmarking can also be important for community building and advancing methods in a principled way. We conducted a meta-analysis of recent single-cell benchmarks to summarize the scope, extensibility, and neutrality, as well as technical features and whether best practices in open data and reproducible research were followed. The results highlight that while benchmarks often make code available and are in principle reproducible, they remain difficult to extend, for example, as new methods and new ways to assess methods emerge. In addition, embracing containerization and workflow systems would enhance reusability of intermediate benchmarking results, thus also driving wider adoption
ReSeq simulates realistic Illumina high-throughput sequencing data
In high-throughput sequencing data, performance comparisons between computational tools are essential for making informed decisions at each step of a project. Simulations are a critical part of method comparisons, but for standard Illumina sequencing of genomic DNA, they are often oversimplified, which leads to optimistic results for most tools. ReSeq improves the authenticity of synthetic data by extracting and reproducing key components from real data. Major advancements are the inclusion of systematic errors, a fragment-based coverage model and sampling-matrix estimates based on two-dimensional margins. These improvements lead to more faithful performance evaluations. ReSeq is available at https://github.com/schmeing/ReSeq
Divergent evolution of male-determining loci on proto-Y chromosomes of the housefly
Abstract Houseflies provide a good experimental model to study the initial evolutionary stages of a primary sex-determining locus because they possess different recently evolved proto-Y chromosomes that contain male-determining loci ( M ) with the same male-determining gene, Mdmd . We investigate M -loci genomically and cytogenetically revealing distinct molecular architectures among M -loci. M on chromosome V ( M V ) has two intact Mdmd copies in a palindrome. M on chromosome III ( M III ) has tandem duplications containing 88 Mdmd copies (only one intact) and various repeats, including repeats that are XY-prevalent. M on chromosome II ( M II ) and the Y ( M Y ) share M III -like architecture, but with fewer repeats. M Y additionally shares M V -specific sequence arrangements. Based on these data and karyograms using two probes, one derives from M III and one Mdmd -specific, we infer evolutionary histories of polymorphic M -loci, which have arisen from unique translocations of Mdmd , embedded in larger DNA fragments, and diverged independently into regions of varying complexity
Maleness-on-the-Y (MoY) orchestrates male sex determination in major agricultural fruit fly pests
In insects, rapidly evolving primary sex-determining signals are transduced by a conserved regulatory module controlling sexual differentiation. In the agricultural pest Ceratitis capitata (Mediterranean fruit fly, or Medfly), we identified a Y-linked gene, Maleness-on-the-Y (MoY), encoding a small protein that is necessary and sufficient for male development. Silencing or disruption of MoY in XY embryos causes feminization, whereas overexpression of MoY in XX embryos induces masculinization. Crosses between transformed XY females and XX males give rise to males and females, indicating that a Y chromosome can be transmitted by XY females. MoY is Y-linked and functionally conserved in other species of the Tephritidae family, highlighting its potential to serve as a tool for developing more effective control strategies against these major agricultural insect pests
Raw and processed (filtered and annotated) scRNAseq data
Single cell RNA-seq data generated and reported as part of the manuscript entitled "Human CD34+-derived plasmacytoid dendritic cells as surrogates for primary pDCs and potential cancer immunotherapy" by Fiore et al.Raw and processed (filtered and annotated) data are provided, which can be directly ingested to reproduce the findings of the paper or for ab initio data reuse:1- raw.h5ad provides concatenated raw/unfiltered table of counts as obtained from Cell Ranger, along with relevant metadata in the standard H5AD format.2- processed.h5ad provides raw and normalized counts for those cells that passed QC and were annotated as pDC, along with relevant metadata in the standard H5AD format.For instance, to load data in R, try:library(zellkonverter)raw processed ##############################scRNAseq data generation:Differentiated CB-DCs (3 independent donors) either left unprimed or primed with IFN were used to enable characterization of the heterogeneity of the in vitro differentiation protocol. For comparison, primary pan-DCs (3 independent donors) were isolated from PBMCs as described above. CB-DCs and primary pan-DCs were normalized to 10,000 pDCs per well and stimulated with TLR9 or TLR7 agonists for 4 hrs or left untreated. A total of 27 samples were included for scRNAseq. Single-cell RNA-seq was performed using Chromium Connect (10x Genomics). Next GEM Automated Single Cell 5' Reagent Kits v2 (PN-1000290, 10 x Genomics, Pleasanton, CA, USA) were used following the manufacturerâs protocol. Roughly 8000â10,000 cells per sample were diluted at a density of 100â800 cells/ÎŒL in PBS plus 1% BSA determined by Cellometer Auto 2000 Cell Viability Counter (Nexelom Bioscience, Lawrence, MA), and were loaded onto the chip. The quality and concentration of both cDNA and libraries were assessed using an Agilent BioAnalyzer with High Sensitivity kit (#5067â4626, Agilent, Santa Clara, CA USA) and Qubit Fluorometer with dsDNA HS assay kit (#Q33230, Thermo Fischer Scientific, Waltham, MA) according to the manufacturerâs recommendation. For sequencing, samples were mixed in equimolar fashion and sequenced on an Illumina Nova Seq 6000 with a targeted read depth of 20,000 reads/cell and sequencing parameters were set for Read 1 (26 cycles), i7 Index (10 cycles), i5 Index (10 cycles) and Read 2 (90 cycles). The Cell Ranger mkfastq function was used to convert the output files into FASTQ files.scRNAseq data analysis:For data processing and quality control, raw sequencing reads were mapped to the GRCh38 genome using the Cell Ranger Single Cell software (10x Genomics). Raw gene expression matrices generated per sample were merged and analyzed with the besca package. First, low quality cells and potential multiplets were excluded (minimum 600 genes, 1,000 counts, maximum 6,500 genes and 60,000 counts), resulting in 4,000 to 8,000 cells per sample and a total of 183,398 cells passing quality control for downstream analysis. Filtered cells were normalized by log-transformed UMI counts per 10,000 reads [log(CP10K+1)]. After scaling the gene expression, the most variable genes per sample were calculated (minimum mean expression of 0.0125, maximum mean expression of 3 and minimum dispersion of 0.5) and those shared by at least 50% of the samples, in total 2,208 genes, were used for principal component analysis. Finally, the first 50 PCs were used as input for calculating the 10 nearest neighbors and the neighborhood graph was then embedded into the two-dimensional space using the uniform manifold approximation and projection (UMAP) algorithm. Cell clustering was performed using the Leiden algorithm. Cell type annotation was performed using the Sig-annot semi-automated besca module. The gene sets used for different cell types can be found under:https://github.com/bedapub/besca/blob/main/besca/datasets/genesets/CellNames_scseqCMs6_sigs.gmtGitHub/besca/besca/datasets/genesets/CellNames_scseqCMs6_sigs.gmt.First, each cluster was assigned to a cell type at different levels of granularity. Subsequently, annotations were manually inspected to resolve cluster mixtures, especially for different DC types. Cell type annotations were further curated by selecting a cluster and applying heuristic cutoffs on a combination of signature scores to reannotate individual cells. The per-cell signature scores were calculated with the scanpy function scanpy.tl.score_genes, using default parameters and besca signatures. Cells annotated as doublets were excluded from downstream analyses. In order to generate visualizations, such as the expression level of selected genes across conditions, custom scripts with mainly besca and scanpy functions were used.For more details, please refer to the publication.</p