Search CORE

BIOZON: a system for unification, management and analysis of heterogeneous biological data

Author: Birkland Aaron
Yona Golan
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Integration of heterogeneous data types is a challenging problem, especially in biology, where the number of databases and data types increase rapidly. Amongst the problems that one has to face are integrity, consistency, redundancy, connectivity, expressiveness and updatability. DESCRIPTION: Here we present a system (Biozon) that addresses these problems, and offers biologists a new knowledge resource to navigate through and explore. Biozon unifies multiple biological databases consisting of a variety of data types (such as DNA sequences, proteins, interactions and cellular pathways). It is fundamentally different from previous efforts as it uses a single extensive and tightly connected graph schema wrapped with hierarchical ontology of documents and relations. Beyond warehousing existing data, Biozon computes and stores novel derived data, such as similarity relationships and functional predictions. The integration of similarity data allows propagation of knowledge through inference and fuzzy searches. Sophisticated methods of query that span multiple data types were implemented and first-of-a-kind biological ranking systems were explored and integrated. CONCLUSION: The Biozon system is an extensive knowledge resource of heterogeneous biological data. Currently, it holds more than 100 million biological documents and 6.5 billion relations between them. The database is accessible through an advanced web interface that supports complex queries, "fuzzy" searches, data materialization and more, online at

Protein family comparison using statistical models and predicted structural information

Author: Chung Richard
Yona Golan
Publication venue: BioMed Central
Publication date: 01/11/2004
Field of study

BACKGROUND: This paper presents a simple method to increase the sensitivity of protein family comparisons by incorporating secondary structure (SS) information. We build upon the effective information theory approach towards profile-profile comparison described in [Yona & Levitt 2002]. Our method augments profile columns using PSIPRED secondary structure predictions and assesses statistical similarity using information theoretical principles. RESULTS: Our tests show that this tool detects more similarities between protein families of distant homology than the previous primary sequence-based method. A very significant improvement in performance is observed when the real secondary structure is used. CONCLUSIONS: Integration of primary and secondary structure information can substantially improve detection of relationships between remotely related protein families

Automation of gene assignments to metabolic pathways using high-throughput expression data

Author: Popescu Liviu
Yona Golan
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Accurate assignment of genes to pathways is essential in order to understand the functional role of genes and to map the existing pathways in a given genome. Existing algorithms predict pathways by extrapolating experimental data in one organism to other organisms for which this data is not available. However, current systems classify all genes that belong to a specific EC family to all the pathways that contain the corresponding enzymatic reaction, and thus introduce ambiguity. RESULTS: Here we describe an algorithm for assignment of genes to cellular pathways that addresses this problem by selectively assigning specific genes to pathways. Our algorithm uses the set of experimentally elucidated metabolic pathways from MetaCyc, together with statistical models of enzyme families and expression data to assign genes to enzyme families and pathways by optimizing correlated co-expression, while minimizing conflicts due to shared assignments among pathways. Our algorithm also identifies alternative ("backup") genes and addresses the multi-domain nature of proteins. We apply our model to assign genes to pathways in the Yeast genome and compare the results for genes that were assigned experimentally. Our assignments are consistent with the experimentally verified assignments and reflect characteristic properties of cellular pathways. CONCLUSION: We present an algorithm for automatic assignment of genes to metabolic pathways. The algorithm utilizes expression data and reduces the ambiguity that characterizes assignments that are based only on EC numbers

The distance-profile representation and its application to detection of distantly related protein families

Author: Ku Chin-Jen
Yona Golan
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Detecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Many existing approaches address this problem by measuring the similarity between proteins through sequence or structural alignment. However, these methods do not exploit collective aspects of the protein space and the computed scores are often noisy and frequently fail to recognize distantly related protein families. RESULTS: We describe an algorithm that improves over the state of the art in homology detection by utilizing global information on the proximity of entities in the protein space. Our method relies on a vectorial representation of proteins and protein families and uses structure-specific association measures between proteins and template structures to form a high-dimensional feature vector for each query protein. These vectors are then processed and transformed to sparse feature vectors that are treated as statistical fingerprints of the query proteins. The new representation induces a new metric between proteins measured by the statistical difference between their corresponding probability distributions. CONCLUSION: Using several performance measures we show that the new tool considerably improves the performance in recognizing distant homologies compared to existing approaches such as PSIBLAST and FUGUE

Hubs of knowledge: using the functional link structure in Biozon to mine for biologically significant entities

Author: Isganitis Timothy
Shafer Paul
Yona Golan
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Existing biological databases support a variety of queries such as keyword or definition search. However, they do not provide any measure of relevance for the instances reported, and result sets are usually sorted arbitrarily. RESULTS: We describe a system that builds upon the complex infrastructure of the Biozon database and applies methods similar to those of Google to rank documents that match queries. We explore different prominence models and study the spectral properties of the corresponding data graphs. We evaluate the information content of principal and non-principal eigenspaces, and test various scoring functions which combine contributions from multiple eigenspaces. We also test the effect of similarity data and other variations which are unique to the biological knowledge domain on the quality of the results. Query result sets are assessed using a probabilistic approach that measures the significance of coherence between directly connected nodes in the data graph. This model allows us, for the first time, to compare different prominence models quantitatively and effectively and to observe unique trends. CONCLUSION: Our tests show that the ranked query results outperform unsorted results with respect to our significance measure and the top ranked entities are typically linked to many other biological entities. Our study resulted in a working ranking system of biological entities that was integrated into Biozon at

Ghent University Academic Bibliography

EST2Prot: Mapping EST sequences to proteins

Author: Lin David M
Shafer Paul
Yona Golan
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: EST libraries are used in various biological studies, from microarray experiments to proteomic and genetic screens. These libraries usually contain many uncharacterized ESTs that are typically ignored since they cannot be mapped to known genes. Consequently, new discoveries are possibly overlooked. RESULTS: We describe a system (EST2Prot) that uses multiple elements to map EST sequences to their corresponding protein products. EST2Prot uses UniGene clusters, substring analysis, information about protein coding regions in existing DNA sequences and protein database searches to detect protein products related to a query EST sequence. Gene Ontology terms, Swiss-Prot keywords, and protein similarity data are used to map the ESTs to functional descriptors. CONCLUSION: EST2Prot extends and significantly enriches the popular UniGene mapping by utilizing multiple relations between known biological entities. It produces a mapping between ESTs and proteins in real-time through a simple web-interface. The system is part of the Biozon database and is accessible at

CiteSeerX

Ghent University Academic Bibliography

Archivsystem Ask23

Novel subdomains of the mouse olfactory bulb defined by molecular heterogeneity in the nascent external plexiform and glomerular layers

Author: Lin David M
Shafer Paul
Sickles Heather M
Williams Eric O
Xiao Yuanyuan
Yang Jean YH
Yona Golan
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background In the mouse olfactory system, the role of the olfactory bulb in guiding olfactory sensory neuron (OSN) axons to their targets is poorly understood. What cell types within the bulb are necessary for targeting is unknown. What genes are important for this process is also unknown. Although projection neurons are not required, other cell-types within the external plexiform and glomerular layers also form synapses with OSNs. We hypothesized that these cells are important for targeting, and express spatially differentially expressed guidance cues that act to guide OSN axons within the bulb. Results We used laser microdissection and microarray analysis to find genes that are differentially expressed along the dorsal-ventral, medial-lateral, and anterior-posterior axes of the bulb. The expression patterns of these genes divide the bulb into previously unrecognized subdomains. Interestingly, some genes are expressed in both the medial and lateral bulb, showing for the first time the existence of symmetric expression along this axis. We use a regeneration paradigm to show that several of these genes are altered in expression in response to deafferentation, consistent with the interpretation that they are expressed in cells that interact with OSNs. Conclusion We demonstrate that the nascent external plexiform and glomerular layers of the bulb can be divided into multiple domains based on the expression of these genes, several of which are known to function in axon guidance, synaptogenesis, and angiogenesis. These genes represent candidate guidance cues that may act to guide OSN axons within the bulb during targeting.</p

Crossref

Ghent University Academic Bibliography