30 research outputs found
Orienting Ordered Scaffolds: Complexity and Algorithms
Despite the recent progress in genome sequencing and assembly, many of the
currently available assembled genomes come in a draft form. Such draft genomes
consist of a large number of genomic fragments (scaffolds), whose order and/or
orientation (i.e., strand) in the genome are unknown. There exist various
scaffold assembly methods, which attempt to determine the order and orientation
of scaffolds along the genome chromosomes. Some of these methods (e.g., based
on FISH physical mapping, chromatin conformation capture, etc.) can infer the
order of scaffolds, but not necessarily their orientation. This leads to a
special case of the scaffold orientation problem (i.e., deducing the
orientation of each scaffold) with a known order of the scaffolds.
We address the problem of orientating ordered scaffolds as an optimization
problem based on given weighted orientations of scaffolds and their pairs
(e.g., coming from pair-end sequencing reads, long reads, or homologous
relations). We formalize this problem using notion of a scaffold graph (i.e., a
graph, where vertices correspond to the assembled contigs or scaffolds and
edges represent connections between them). We prove that this problem is
NP-hard, and present a polynomial-time algorithm for solving its special case,
where orientation of each scaffold is imposed relatively to at most two other
scaffolds. We further develop an FPT algorithm for the general case of the OOS
problem
Endometrial receptivity in women of reproductive age with "thin" and "absolutely thin" endometrium
Aim. To evaluate the expression of steroid receptors (estrogen [ER] and progesterone [PR]) in the endometrium during the implantation window in females with a history of fertility disorders in "thin" and "absolutely thin" endometrium versus healthy females.
Materials and methods. A prospective comparative study was conducted. The study group (n=42) included patients with "thin" endometrium (7 mm M-echo 5 mm at cycle days 1113 according to ultrasound); the comparison group (n=10) included females with "absolutely thin" (5 mm according to ultrasound in the pre-ovulatory days) endometrium (females in both groups had a history of infertility and miscarriage of unclear reasons in the anamnesis); the control group included 16 healthy fertile females. A Pipelle biopsy of the uterine mucosa was performed on day 68 after ovulation, and a peripheral blood sample was obtained to measure the concentration of sex steroids (estradiol [E2] and progesterone [P]). Endometrial samples were examined by histological and immunohistochemical methods (ER, PR expression).
Results. All study participants had an ovulatory cycle of P16.1 nmol/L (day 68 after ovulation) and normal estrogen levels (E2, pmol/L). E2/P was similar in all cohorts (p0.05 for all measures). ER and PR expression in the endometrium similar to those in healthy females was detected in 20% of patients in the study and comparison groups (M-echo = 4.83.1 mm): 21% (9/42) and 20% (2/10), respectively. ER and PR expression in the endometrial glands and ER expression in the endometrial stroma were significantly different (p0.05) from healthy females in 79% (41/52) of patients with "thin" endometrium and 80% (8/10) of patients with "absolutely thin" endometrium. No differences in the ER or PR expression in the endometrium in females with hypoplastic endometrium were found (p0.05).
Conclusion. The M-echo value does not accurately determine endometrial hormonal-receptor abnormalities: 20% of the study participants with hypoplastic endometrium had ER and PR expression comparable to those in healthy females. No differences were found in the expression of endometrial estrogen and progesterone receptors in females with "thin" and "absolutely thin" endometrium
Jasmine: Population-scale structural variant comparison and analysis
The increasing availability of long-reads is revolutionizing studies of structural variants (SVs). However, because SVs vary across individuals and are discovered through imprecise read technologies and methods, they can be difficult to compare. Addressing this, we present Jasmine (https://github.com/mkirsche/Jasmine ), a
fast and accurate method for SV refinement, comparison, and population analysis. Using an SV proximity
graph, Jasmine outperforms five widely-used comparison methods, including reducing the rate of Mendelian
discordance in trio datasets by more than five-fold, and reveals a set of high confidence de novo SVs
confirmed by multiple long-read technologies. We also present a harmonized callset of 205,192 SVs from 31
samples of diverse ancestry sequenced with long reads. We genotype these SVs in 444 short read samples
from the 1000 Genomes Project with both DNA and RNA sequencing data and assess their widespread impact
on gene expression, including within several medically relevant genes
On pairwise distances and median score of three genomes under DCJ
In comparative genomics, the rearrangement distance between two genomes
(equal the minimal number of genome rearrangements required to transform them
into a single genome) is often used for measuring their evolutionary
remoteness. Generalization of this measure to three genomes is known as the
median score (while a resulting genome is called median genome). In contrast to
the rearrangement distance between two genomes which can be computed in linear
time, computing the median score for three genomes is NP-hard. This inspires a
quest for simpler and faster approximations for the median score, the most
natural of which appears to be the halved sum of pairwise distances which in
fact represents a lower bound for the median score.
In this work, we study relationship and interplay of pairwise distances
between three genomes and their median score under the model of
Double-Cut-and-Join (DCJ) rearrangements. Most remarkably we show that while a
rearrangement may change the sum of pairwise distances by at most 2 (and thus
change the lower bound by at most 1), even the most "powerful" rearrangements
in this respect that increase the lower bound by 1 (by moving one genome
farther away from each of the other two genomes), which we call strong, do not
necessarily affect the median score. This observation implies that the two
measures are not as well-correlated as one's intuition may suggest.
We further prove that the median score attains the lower bound exactly on the
triples of genomes that can be obtained from a single genome with strong
rearrangements. While the sum of pairwise distances with the factor 2/3
represents an upper bound for the median score, its tightness remains unclear.
Nonetheless, we show that the difference of the median score and its lower
bound is not bounded by a constant.Comment: Proceedings of the 10-th Annual RECOMB Satellite Workshop on
Comparative Genomics (RECOMB-CG), 2012. (to appear
SVCollector: Optimized sample selection for cost-efficient long-read population sequencing
An increasingly important scenario in population genetics is when a large cohort has been genotyped using a low-resolution approach (e.g. microarrays, exome capture, short-read WGS), from which a few individuals are selected for resequencing using a more comprehensive approach, especially long-read sequencing. The subset of individuals selected should ensure that the captured genetic diversity is fully representative and includes variants across all subpopulations. For example, human variation has historically been focused on individuals with European ancestry, but this represents a small fraction of the overall diversity. To address this goal, SVCollector ( https://github.com/fritzsedlazeck/SVCollector ) identifies the optimal subset of individuals for resequencing. SVCollector analyzes a population-level VCF file from a low resolution genotyping study. It then computes a ranked list of samples that maximizes the total number of variants present from a subset of a given size. To solve this optimization problem, SVCollector implements a fast greedy heuristic and an exact algorithm using integer linear programming. We apply SVCollector on simulated data, 2504 human genomes from the 1000 Genomes Project, and 3024 genomes from the 3K Rice Genomes Project and show the rankings it computes are more representative than widely used naive strategies. Notably, we show that when selecting an optimal subset of 100 samples in these two cohorts, SV-Collector identifies individuals from every subpopulation while naive methods yield an unbalanced selection. Finally, we show the number of variants present in cohorts of different sizes selected using this approach follows a power-law distribution that is naturally related to the population genetic concept of the allele frequency spectrum, allowing us to estimate the diversity present with increasing numbers of samples
Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing
Advancing crop genomics requires efficient genetic systems enabled by high-quality personalized genome assemblies. Here, we introduce RagTag, a toolset for automating assembly scaffolding and patching, and we establish chromosome-scale reference genomes for the widely used tomato genotype M82 along with Sweet-100, a new rapid-cycling genotype that we developed to accelerate functional genomics and genome editing in tomato. This work outlines strategies to rapidly expand genetic systems and genomic resources in other plant species
Multi-tissue integrative analysis of personal epigenomes
Evaluating the impact of genetic variants on transcriptional regulation is a central goal in biological science that has been constrained by reliance on a single reference genome. To address this, we constructed phased, diploid genomes for four cadaveric donors (using long-read sequencing) and systematically charted noncoding regulatory elements and transcriptional activity across more than 25 tissues from these donors. Integrative analysis revealed over a million variants with allele-specific activity, coordinated, locus-scale allelic imbalances, and structural variants impacting proximal chromatin structure. We relate the personal genome analysis to the ENCODE encyclopedia, annotating allele- and tissue-specific elements that are strongly enriched for variants impacting expression and disease phenotypes. These experimental and statistical approaches, and the corresponding EN-TEx resource, provide a framework for personalized functional genomics
Recommended from our members
Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples.
Many cancer genomes are extensively rearranged with aberrant chromosomal karyotypes. Deriving these karyotypes from high-throughput DNA sequencing of bulk tumor samples is complicated because most tumors are a heterogeneous mixture of normal cells and subpopulations of cancer cells, or clones, that harbor distinct somatic mutations. We introduce a new algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes from DNA sequencing data from a bulk tumor sample. RCK leverages evolutionary constraints on the somatic mutational process in cancer to reduce ambiguity in the deconvolution of admixed sequencing data into multiple haplotype-specific cancer karyotypes. RCK models mixtures containing an arbitrary number of derived genomes and allows the incorporation of information both from short-read and long-read DNA sequencing technologies. We compare RCK to existing approaches on 17 primary and metastatic prostate cancer samples. We find that RCK infers cancer karyotypes that better explain the DNA sequencing data and conform to a reasonable evolutionary model. RCK's reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is freely available as open source software