6,767 research outputs found
Recommended from our members
The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization.
Sorghum bicolor is a drought tolerant C4 grass used for the production of grain, forage, sugar, and lignocellulosic biomass and a genetic model for C4 grasses due to its relatively small genome (approximately 800 Mbp), diploid genetics, diverse germplasm, and colinearity with other C4 grass genomes. In this study, deep sequencing, genetic linkage analysis, and transcriptome data were used to produce and annotate a high-quality reference genome sequence. Reference genome sequence order was improved, 29.6 Mbp of additional sequence was incorporated, the number of genes annotated increased 24% to 34 211, average gene length and N50 increased, and error frequency was reduced 10-fold to 1 per 100 kbp. Subtelomeric repeats with characteristics of Tandem Repeats in Miniature (TRIM) elements were identified at the termini of most chromosomes. Nucleosome occupancy predictions identified nucleosomes positioned immediately downstream of transcription start sites and at different densities across chromosomes. Alignment of more than 50 resequenced genomes from diverse sorghum genotypes to the reference genome identified approximately 7.4 M single nucleotide polymorphisms (SNPs) and 1.9 M indels. Large-scale variant features in euchromatin were identified with periodicities of approximately 25 kbp. A transcriptome atlas of gene expression was constructed from 47 RNA-seq profiles of growing and developed tissues of the major plant organs (roots, leaves, stems, panicles, and seed) collected during the juvenile, vegetative and reproductive phases. Analysis of the transcriptome data indicated that tissue type and protein kinase expression had large influences on transcriptional profile clustering. The updated assembly, annotation, and transcriptome data represent a resource for C4 grass research and crop improvement
The Douglas-Fir Genome Sequence Reveals Specialization of the Photosynthetic Apparatus in Pinaceae.
A reference genome sequence for Pseudotsuga menziesii var. menziesii (Mirb.) Franco (Coastal Douglas-fir) is reported, thus providing a reference sequence for a third genus of the family Pinaceae. The contiguity and quality of the genome assembly far exceeds that of other conifer reference genome sequences (contig N50 = 44,136 bp and scaffold N50 = 340,704 bp). Incremental improvements in sequencing and assembly technologies are in part responsible for the higher quality reference genome, but it may also be due to a slightly lower exact repeat content in Douglas-fir vs. pine and spruce. Comparative genome annotation with angiosperm species reveals gene-family expansion and contraction in Douglas-fir and other conifers which may account for some of the major morphological and physiological differences between the two major plant groups. Notable differences in the size of the NDH-complex gene family and genes underlying the functional basis of shade tolerance/intolerance were observed. This reference genome sequence not only provides an important resource for Douglas-fir breeders and geneticists but also sheds additional light on the evolutionary processes that have led to the divergence of modern angiosperms from the more ancient gymnosperms
Reference genome and comparative genome analysis for the WHO reference strain for Mycobacterium bovis BCG Danish, the present tuberculosis vaccine
Background: Mycobacterium bovis bacillus Calmette-Guerin (M. bovis BCG) is the only vaccine available against tuberculosis (TB). In an effort to standardize the vaccine production, three substrains, i.e. BCG Danish 1331, Tokyo 172-1 and Russia BCG-1 were established as the WHO reference strains. Both for BCG Tokyo 172-1 as Russia BCG-1, reference genomes exist, not for BCG Danish. In this study, we set out to determine the completely assembled genome sequence for BCG Danish and to establish a workflow for genome characterization of engineering-derived vaccine candidate strains.ResultsBy combining second (Illumina) and third (PacBio) generation sequencing in an integrated genome analysis workflow for BCG, we could construct the completely assembled genome sequence of BCG Danish 1331 (07/270) (and an engineered derivative that is studied as an improved vaccine candidate, a SapM KO), including the resolution of the analytically challenging long duplication regions. We report the presence of a DU1-like duplication in BCG Danish 1331, while this tandem duplication was previously thought to be exclusively restricted to BCG Pasteur. Furthermore, comparative genome analyses of publicly available data for BCG substrains showed the absence of a DU1 in certain BCG Pasteur substrains and the presence of a DU1-like duplication in some BCG China substrains. By integrating publicly available data, we provide an update to the genome features of the commonly used BCG strains.
Conclusions: We demonstrate how this analysis workflow enables the resolution of genome duplications and of the genome of engineered derivatives of the BCG Danish vaccine strain. The BCG Danish WHO reference genome will serve as a reference for future engineered strains and the established workflow can be used to enhance BCG vaccine standardization
Complete genome sequence of the Medicago microsymbiont Ensifer (Sinorhizobium) medicae strain WSM419
Ensifer (Sinorhizobium) medicae is an effective nitrogen fixing microsymbiont of a diverse range of annual Medicago (medic) species. Strain WSM419 is an aerobic, motile, non-spore forming, Gram-negative rod isolated from a M. murex root nodule collected in Sardinia, Italy in 1981. WSM419 was manufactured commercially in Australia as an inoculant for annual medics during 1985 to 1993 due to its nitrogen fixation, saprophytic competence and acid tolerance properties. Here we describe the basic features of this organism, together with the complete genome sequence, and annotation. This is the first report of a complete genome se-quence for a microsymbiont of the group of annual medic species adapted to acid soils. We reveal that its genome size is 6,817,576 bp encoding 6,518 protein-coding genes and 81 RNA only encoding genes. The genome contains a chromosome of size 3,781,904 bp and 3 plasmids of size 1,570,951 bp, 1,245,408 bp and 219,313 bp. The smallest plasmid is a fea-ture unique to this medic microsymbiont
Compressão eficiente de sequências biológicas usando uma rede neuronal
Background: The increasing production of genomic data has led to
an intensified need for models that can cope efficiently with the lossless
compression of biosequences. Important applications include long-term
storage and compression-based data analysis. In the literature, only a
few recent articles propose the use of neural networks for biosequence
compression. However, they fall short when compared with specific
DNA compression tools, such as GeCo2. This limitation is due to the
absence of models specifically designed for DNA sequences. In this
work, we combine the power of neural networks with specific DNA and
amino acids models. For this purpose, we created GeCo3 and AC2, two
new biosequence compressors. Both use a neural network for mixing
the opinions of multiple specific models.
Findings: We benchmark GeCo3 as a reference-free DNA compressor
in five datasets, including a balanced and comprehensive dataset
of DNA sequences, the Y-chromosome and human mitogenome, two
compilations of archaeal and virus genomes, four whole genomes, and
two collections of FASTQ data of a human virome and ancient DNA.
GeCo3 achieves a solid improvement in compression over the previous
version (GeCo2) of 2:4%, 7:1%, 6:1%, 5:8%, and 6:0%, respectively.
As a reference-based DNA compressor, we benchmark GeCo3 in four
datasets constituted by the pairwise compression of the chromosomes
of the genomes of several primates. GeCo3 improves the compression in
12:4%, 11:7%, 10:8% and 10:1% over the state-of-the-art. The cost of
this compression improvement is some additional computational time
(1:7_ to 3:0_ slower than GeCo2). The RAM is constant, and the tool
scales efficiently, independently from the sequence size. Overall, these
values outperform the state-of-the-art. For AC2 the improvements and
costs over AC are similar, which allows the tool to also outperform the
state-of-the-art.
Conclusions: The GeCo3 and AC2 are biosequence compressors with
a neural network mixing approach, that provides additional gains over
top specific biocompressors. The proposed mixing method is portable,
requiring only the probabilities of the models as inputs, providing easy
adaptation to other data compressors or compression-based data analysis
tools. GeCo3 and AC2 are released under GPLv3 and are available
for free download at https://github.com/cobilab/geco3 and
https://github.com/cobilab/ac2.Contexto: O aumento da produção de dados genómicos levou a uma
maior necessidade de modelos que possam lidar de forma eficiente com
a compressão sem perdas de biosequências. Aplicações importantes
incluem armazenamento de longo prazo e análise de dados baseada em
compressão. Na literatura, apenas alguns artigos recentes propõem o
uso de uma rede neuronal para compressão de biosequências. No entanto,
os resultados ficam aquém quando comparados com ferramentas
de compressão de ADN específicas, como o GeCo2. Essa limitação
deve-se à ausência de modelos específicos para sequências de ADN.
Neste trabalho, combinamos o poder de uma rede neuronal com modelos
específicos de ADN e aminoácidos. Para isso, criámos o GeCo3 e
o AC2, dois novos compressores de biosequências. Ambos usam uma
rede neuronal para combinar as opiniões de vários modelos específicos.
Resultados: Comparamos o GeCo3 como um compressor de ADN
sem referência em cinco conjuntos de dados, incluindo um conjunto
de dados balanceado de sequências de ADN, o cromossoma Y e o mitogenoma
humano, duas compilações de genomas de arqueas e vírus,
quatro genomas inteiros e duas coleções de dados FASTQ de um viroma
humano e ADN antigo. O GeCo3 atinge uma melhoria sólida
na compressão em relação à versão anterior (GeCo2) de 2,4%, 7,1%,
6,1%, 5,8% e 6,0%, respectivamente. Como um compressor de ADN
baseado em referência, comparamos o GeCo3 em quatro conjuntos
de dados constituídos pela compressão aos pares dos cromossomas
dos genomas de vários primatas. O GeCo3 melhora a compressão em
12,4%, 11,7%, 10,8% e 10,1% em relação ao estado da arte. O custo
desta melhoria de compressão é algum tempo computacional adicional
(1,7 _ a 3,0 _ mais lento do que GeCo2). A RAM é constante e a
ferramenta escala de forma eficiente, independentemente do tamanho
da sequência. De forma geral, os rácios de compressão superam o estado
da arte. Para o AC2, as melhorias e custos em relação ao AC são
semelhantes, o que permite que a ferramenta também supere o estado
da arte.
Conclusões: O GeCo3 e o AC2 são compressores de sequências biológicas
com uma abordagem de mistura baseada numa rede neuronal,
que fornece ganhos adicionais em relação aos biocompressores específicos
de topo. O método de mistura proposto é portátil, exigindo apenas
as probabilidades dos modelos como entradas, proporcionando uma fácil
adaptação a outros compressores de dados ou ferramentas de análise
baseadas em compressão. O GeCo3 e o AC2 são distribuídos sob GPLv3
e estão disponíveis para download gratuito em https://github.com/
cobilab/geco3 e https://github.com/cobilab/ac2.Mestrado em Engenharia de Computadores e Telemátic
- …