12,337 research outputs found
Improving dbNSFP
IMPROVING dbNSFP
Mingyao Lu, B.S.
Advisory Professor: Xiaoming Liu, Ph.D.
The analysis and interpretation of DNA variation are very important for the Whole Exome studies (WES). Genome research has focused on single nucleotide variants (SNVs). Since indels are as important as SNVs, especially indels in coding regions are often candidates of disease-causing variants, thus, it is necessary to expand the focus to include indel mutations.
The goal of my project is to provide an automatic annotation pipeline to the WES based disease studies project by extending the dbNSFP with a tool for automated indel annotation and deleteriousness prediction. The current sequencing results typically include both SNVs and indels. Although there have been many available tools to integrate functional prediction/annotations for SNV effects, there are no such tools for indels to my knowledge. Therefore, the aim of this thesis was to add deleteriousness prediction scores to indel annotation based on gene models, including CADD, SIFT, and PROVEAN. All those scores can be calculated on-the-fly after installing resources locally. A Docker implementing the indel annotation and deleteriousness prediction has been developed and ready to be deployed from the cloud
AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system
We have implemented a genome annotation system for prokaryotes called AGMIAL. Our approach embodies a number of key principles. First, expert manual annotators are seen as a critical component of the overall system; user interfaces were cyclically refined to satisfy their needs. Second, the overall process should be orchestrated in terms of a global annotation strategy; this facilitates coordination between a team of annotators and automatic data analysis. Third, the annotation strategy should allow progressive and incremental annotation from a time when only a few draft contigs are available, to when a final finished assembly is produced. The overall architecture employed is modular and extensible, being based on the W3 standard Web services framework. Specialized modules interact with two independent core modules that are used to annotate, respectively, genomic and protein sequences. AGMIAL is currently being used by several INRA laboratories to analyze genomes of bacteria relevant to the food-processing industry, and is distributed under an open source license
Finding the Core-Genes of Chloroplasts
Due to the recent evolution of sequencing techniques, the number of available
genomes is rising steadily, leading to the possibility to make large scale
genomic comparison between sets of close species. An interesting question to
answer is: what is the common functionality genes of a collection of species,
or conversely, to determine what is specific to a given species when compared
to other ones belonging in the same genus, family, etc. Investigating such
problem means to find both core and pan genomes of a collection of species,
\textit{i.e.}, genes in common to all the species vs. the set of all genes in
all species under consideration. However, obtaining trustworthy core and pan
genomes is not an easy task, leading to a large amount of computation, and
requiring a rigorous methodology. Surprisingly, as far as we know, this
methodology in finding core and pan genomes has not really been deeply
investigated. This research work tries to fill this gap by focusing only on
chloroplastic genomes, whose reasonable sizes allow a deep study. To achieve
this goal, a collection of 99 chloroplasts are considered in this article. Two
methodologies have been investigated, respectively based on sequence
similarities and genes names taken from annotation tools. The obtained results
will finally be evaluated in terms of biological relevance
QueryOR: a comprehensive web platform for genetic variant analysis and prioritization
Background: Whole genome and exome sequencing are contributing to the extraordinary progress in the study of
human genetic variants. In this fast developing field, appropriate and easily accessible tools are required to facilitate
data analysis.
Results: Here we describe QueryOR, a web platform suitable for searching among known candidate genes as well
as for finding novel gene-disease associations. QueryOR combines several innovative features that make it comprehensive,
flexible and easy to use. Instead of being designed on specific datasets, it works on a general XML schema specifying
formats and criteria of each data source. Thanks to this flexibility, new criteria can be easily added for future
expansion. Currently, up to 70 user-selectable criteria are available, including a wide range of gene and variant features.
Moreover, rather than progressively discarding variants taking one criterion at a time, the prioritization is achieved by a
global positive selection process that considers all transcript isoforms, thus producing reliable results. QueryOR is easy
to use and its intuitive interface allows to handle different kinds of inheritance as well as features related to sharing
variants in different patients. QueryOR is suitable for investigating single patients, families or cohorts.
Conclusions: QueryOR is a comprehensive and flexible web platform eligible for an easy user-driven variant
prioritization. It is freely available for academic institutions at http://queryor.cribi.unipd.it/
No wisdom in the crowd: genome annotation at the time of big data - current status and future prospects
Science and engineering rely on the accumulation
and dissemination of knowledge to make discoveries
and create new designs. Discovery-driven genome
research rests on knowledge passed on via gene
annotations. In response to the deluge of sequencing
big data, standard annotation practice employs automated
procedures that rely on majority rules. We
argue this hinders progress through the generation
and propagation of errors, leading investigators into
blind alleys. More subtly, this inductive process discourages
the discovery of novelty, which remains
essential in biological research and reflects the nature
of biology itself. Annotation systems, rather than
being repositories of facts, should be tools that support
multiple modes of inference. By combining
deduction, induction and abduction, investigators can
generate hypotheses when accurate knowledge is
extracted from model databases. A key stance is to
depart from ‘the sequence tells the structure tells the
function’ fallacy, placing function first. We illustrate
our approach with examples of critical or unexpected
pathways, using MicroScope to demonstrate how
tools can be implemented following the principles we
advocate. We end with a challenge to the reader
Improved annotation of 3' untranslated regions and complex loci by combination of strand-specific direct RNA sequencing, RNA-seq and ESTs
The reference annotations made for a genome sequence provide the framework
for all subsequent analyses of the genome. Correct annotation is particularly
important when interpreting the results of RNA-seq experiments where short
sequence reads are mapped against the genome and assigned to genes according to
the annotation. Inconsistencies in annotations between the reference and the
experimental system can lead to incorrect interpretation of the effect on RNA
expression of an experimental treatment or mutation in the system under study.
Until recently, the genome-wide annotation of 3-prime untranslated regions
received less attention than coding regions and the delineation of intron/exon
boundaries. In this paper, data produced for samples in Human, Chicken and A.
thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing
technology from Helicos Biosciences which locates 3-prime polyadenylation sites
to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine
examples are illustrated where this combination of data allowed: (1) gene and
3-prime UTR re-annotation (including extension of one 3-prime UTR by 5.9 kb);
(2) disentangling of gene expression in complex regions; (3) clearer
interpretation of small RNA expression and (4) identification of novel genes.
While the specific examples displayed here may become obsolete as genome
sequences and their annotations are refined, the principles laid out in this
paper will be of general use both to those annotating genomes and those seeking
to interpret existing publically available annotations in the context of their
own experimental dataComment: 44 pages, 9 figure
Bioconductor: open software development for computational biology and bioinformatics.
The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples
- …