15 research outputs found
A transcriptional sketch of a primary human breast cancer by 454 deep sequencing
Background: The cancer transcriptome is difficult to explore due to the heterogeneity of quantitative and qualitative changes in gene expression linked to the disease status. An increasing number of "unconventional" transcripts, such as novel isoforms, non-coding RNAs, somatic gene fusions and deletions have been associated with the tumoral state. Massively parallel sequencing techniques provide a framework for exploring the transcriptional complexity inherent to cancer with a limited laboratory and financial effort. We developed a deep sequencing and bioinformatics analysis protocol to investigate the molecular composition of a breast cancer poly(A)+ transcriptome. This method utilizes a cDNA library normalization step to diminish the representation of highly expressed transcripts and biology-oriented bioinformatic analyses to facilitate detection of rare and novel transcripts. Results: We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas. Conclusion: Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown transcripts, splice isoforms, gene fusion events and ncRNAs, even at a relatively low sequence sampling
Community-driven development for computational biology at Sprints, Hackathons and Codefests
Background: Computational biology comprises a wide range of technologies and approaches. Multiple technologies can be combined to create more powerful workflows if the individuals contributing the data or providing tools for its interpretation can find mutual understanding and consensus. Much conversation and joint investigation are required in order to identify and implement the best approaches. Traditionally, scientific conferences feature talks presenting novel technologies or insights, followed up by informal discussions during coffee breaks. In multi-institution collaborations, in order to reach agreement on implementation details or to transfer deeper insights in a technology and practical skills, a representative of one group typically visits the other. However, this does not scale well when the number of technologies or research groups is large. Conferences have responded to this issue by introducing Birds-of-a-Feather (BoF) sessions, which offer an opportunity for individuals with common interests to intensify their interaction. However, parallel BoF sessions often make it hard for participants to join multiple BoFs and find common ground between the different technologies, and BoFs are generally too short to allow time for participants to program together. Results: This report summarises our experience with computational biology Codefests, Hackathons and Sprints, which are interactive developer meetings. They are structured to reduce the limitations of traditional scientific meetings described above by strengthening the interaction among peers and letting the participants determine the schedule and topics. These meetings are commonly run as loosely scheduled "unconferences" (self-organized identification of participants and topics for meetings) over at least two days, with early introductory talks to welcome and organize contributors, followed by intensive collaborative coding sessions. We summarise some prominent achievements of those meetings and describe differences in how these are organised, how their audience is addressed, and their outreach to their respective communities. Conclusions: Hackathons, Codefests and Sprints share a stimulating atmosphere that encourages participants to jointly brainstorm and tackle problems of shared interest in a self-driven proactive environment, as well as providing an opportunity for new participants to get involved in collaborative projects
Characterization of Nucleotide Misincorporation Patterns in the Iceman's Mitochondrial DNA
BACKGROUND: The degradation of DNA represents one of the main issues in the genetic analysis of archeological specimens. In the recent years, a particular kind of post-mortem DNA modification giving rise to nucleotide misincorporation ("miscoding lesions") has been the object of extensive investigations. METHODOLOGY/PRINCIPAL FINDINGS: To improve our knowledge regarding the nature and incidence of ancient DNA nucleotide misincorporations, we have utilized 6,859 (629,975 bp) mitochondrial (mt) DNA sequences obtained from the 5,350-5,100-years-old, freeze-desiccated human mummy popularly known as the Tyrolean Iceman or Otzi. To generate the sequences, we have applied a mixed PCR/pyrosequencing procedure allowing one to obtain a particularly high sequence coverage. As a control, we have produced further 8,982 (805,155 bp) mtDNA sequences from a contemporary specimen using the same system and starting from the same template copy number of the ancient sample. From the analysis of the nucleotide misincorporation rate in ancient, modern, and putative contaminant sequences, we observed that the rate of misincorporation is significantly lower in modern and putative contaminant sequence datasets than in ancient sequences. In contrast, type 2 transitions represent the vast majority (85%) of the observed nucleotide misincorporations in ancient sequences. CONCLUSIONS/SIGNIFICANCE: This study provides a further contribution to the knowledge of nucleotide misincorporation patterns in DNA sequences obtained from freeze-preserved archeological specimens. In the Iceman system, ancient sequences can be clearly distinguished from contaminants on the basis of nucleotide misincorporation rates. This observation confirms a previous identification of the ancient mummy sequences made on a purely phylogenetical basis. The present investigation provides further indication that the majority of ancient DNA damage is reflected by type 2 (cytosine-->thymine/guanine-->adenine) transitions and that type 1 transitions are essentially PCR artifacts
Computational pan-genomics: Status, promises and challenges
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different Computational methods and paradigms are needed.We will witness the rapid extension of Computational pan-genomics, a new sub-area of research in Computational biology. In this article, we generalize existing definitions and understand a pangenome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a Computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations
FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation
BACKGROUND: Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. DESCRIPTION: We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. CONCLUSIONS: Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores
The Ruby UCSC API: accessing the UCSC genome database using Ruby
Background: The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby.Results: The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast. The API uses the bin index—if available—when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby).Conclusions: Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/
Whole-Genome Pyrosequencing of an Epidemic Multidrug-Resistant Acinetobacter baumannii Strain Belonging to the European Clone II Group ▿ †
The whole-genome sequence of an epidemic, multidrug-resistant Acinetobacter baumannii strain (strain ACICU) belonging to the European clone II group and carrying the plasmid-mediated blaOXA-58 carbapenem resistance gene was determined. The A. baumannii ACICU genome was compared with the genomes of A. baumannii ATCC 17978 and Acinetobacter baylyi ADP1, with the aim of identifying novel genes related to virulence and drug resistance. A. baumannii ACICU has a single chromosome of 3,904,116 bp (which is predicted to contain 3,758 genes) and two plasmids, pACICU1 and pACICU2, of 28,279 and 64,366 bp, respectively. Genome comparison showed 86.4% synteny with A. baumannii ATCC 17978 and 14.8% synteny with A. baylyi ADP1. A conspicuous number of transporters belonging to different superfamilies was predicted for A. baumannii ACICU. The relative number of transporters was much higher in ACICU than in ATCC 17978 and ADP1 (76.2, 57.2, and 62.5 transporters per Mb of genome, respectively). An antibiotic resistance island, AbaR2, was identified in ACICU and had plausibly evolved by reductive evolution from the AbaR1 island previously described in multiresistant strain A. baumannii AYE. Moreover, 36 putative alien islands (pAs) were detected in the ACICU genome; 24 of these had previously been described in the ATCC 17978 genome, 4 are proposed here for the first time and are present in both ATCC 17978 and ACICU, and 8 are unique to the ACICU genome. Fifteen of the pAs in the ACICU genome encode genes related to drug resistance, including membrane transporters and ex novo acquired resistance genes. These findings provide novel insight into the genetic basis of A. baumannii resistance