1,649 research outputs found
Domain fusion analysis by applying relational algebra to protein sequence and domain databases
BACKGROUND: Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful. RESULTS: This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at . CONCLUSION: As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time
A Data Transformation System for Biological Data Sources
Scientific data of importance to biologists in the Human Genome Project resides not only in conventional databases, but in structured files maintained in a number of different formats (e.g. ASN.1 and ACE) as well a.s sequence analysis packages (e.g. BLAST and FASTA). These formats and packages contain a number of data types not found in conventional databases, such as lists and variants, and may be deeply nested. We present in this paper techniques for querying and transforming such data, and illustrate their use in a prototype system developed in conjunction with the Human Genome Center for Chromosome 22. We also describe optimizations performed by the system, a crucial issue for bulk data
SAFE Software and FED Database to Uncover Protein-Protein Interactions using Gene Fusion Analysis
Domain Fusion Analysis takes advantage of the fact that certain proteins in a given proteome A, are found to have statistically significant similarity with two separate proteins in another proteome B. In other words, the result of a fusion event between two separate proteins in proteome B is a specific full-length protein in proteome A. In such a case, it can be safely concluded that the protein pair has a common biological function or even interacts physically. In this paper, we present the Fusion Events Database (FED), a database for the maintenance and retrieval of fusion data both in prokaryotic and eukaryotic organisms and the Software for the Analysis of Fusion Events (SAFE), a computational platform implemented for the automated detection, filtering and visualization of fusion events (both available at: http://www.bioacademy.gr/bioinformatics/projects/ProteinFusion/index.htm). Finally, we analyze the proteomes of three microorganisms using these tools in order to demonstrate their functionality
BISON: bio-interface for the semi-global analysis of network patterns
BACKGROUND: The large amount of genomics data that have accumulated over the past decade require extensive data mining. However, the global nature of data mining, which includes pattern mining, poses difficulties for users who want to study specific questions in a more local environment. This creates a need for techniques that allow a localized analysis of globally determined patterns. RESULTS: We developed a tool that determines and evaluates global patterns based on protein property and network information, while providing all the benefits of a perspective that is targeted at biologist users with specific goals and interests. Our tool uses our own data mining techniques, integrated into current visualization and navigation techniques. The functionality of the tool is discussed in the context of the transcriptional network of regulation in the enteric bacterium Escherichia coli. Two biological questions were asked: (i) Which functional categories of proteins (identified by hidden Markov models) are regulated by a regulator with a specific domain? (ii) Which regulators are involved in the regulation of proteins that contain a common hidden Markov model? Using these examples, we explain the gene-centered and pattern-centered analysis that the tool permits. CONCLUSION: In summary, we have a tool that can be used for a wide variety of applications in biology, medicine, or agriculture. The pattern mining engine is global in the way that patterns are determined across the entire network. The tool still permits a localized analysis for users who want to analyze a subportion of the total network. We have named the tool BISON (Bio-Interface for the Semi-global analysis Of Network patterns)
CODA: Accurate Detection of Functional Associations between Proteins in Eukaryotic Genomes Using Domain Fusion
Background: In order to understand how biological systems function it is necessary to determine the interactions and associations between proteins. Gene fusion prediction is one approach to detection of such functional relationships. Its use is however known to be problematic in higher eukaryotic genomes due to the presence of large homologous domain families. Here we introduce CODA (Co-Occurrence of Domains Analysis), a method to predict functional associations based on the gene fusion idiom.Methodology/Principal Findings: We apply a novel scoring scheme which takes account of the genome-specific size of homologous domain families involved in fusion to improve accuracy in predicting functional associations. We show that CODA is able to accurately predict functional similarities in human with comparison to state-of-the-art methods and show that different methods can be complementary. CODA is used to produce evidence that a currently uncharacterised human protein may be involved in pathways related to depression and that another is involved in DNA replication.Conclusions/Significance: The relative performance of different gene fusion methodologies has not previously been explored. We find that they are largely complementary, with different methods being more or less appropriate in different genomes. Our method is the only one currently available for download and can be run on an arbitrary dataset by the user. The CODA software and datasets are freely available from ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/v6.1.0/CODA/. Predictions are also available via web services from http://funcnet.eu/
Large Scale Data Analytics with Language Integrated Query
Databases can easily reach petabytes (1,048,576 gigabytes) in scale. A system to enable users to efficiently retrieve or query data from multiple databases simultaneously is needed. This research introduces a new, cloud-based query framework, designed and built using Language Integrated Query, to query existing data sources without the need to integrate or restructure existing databases. Protein data obtained through the query framework proves its feasibility and cost effectiveness
Identification of Potential Drug Targets Implicated in Parkinson's Disease from Human Genome: Insights of Using Fused Domains in Hypothetical Proteins as Probes
High-throughput genome sequencing has led to data explosion in sequence databanks, with an imbalance of sequence-structure-function relationships, resulting in a substantial fraction of proteins known as hypothetical proteins. Functions of such proteins can be assigned based on the analysis and characterization of the domains that they are made up of. Domains are basic evolutionary units of proteins and most proteins contain multiple domains. A subset of multidomain proteins is fused domains (overlapping domains), wherein sequence overlaps between two or more domains occur. These fused domains are a result of gene fusion events and their implication in diseases is well established. Hence, an attempt has been made in this paper to identify the fused domain containing hypothetical proteins from human genome homologous to parkinsonian targets present in KEGG database. The results of this research identified 18 hypothetical proteins, with domains fused with ubiquitin domains and having homology with targets present in parkinsonian pathway
XML-based approaches for the integration of heterogeneous bio-molecular data
Background: The today's public database infrastructure spans a very large collection of heterogeneous biological data, opening new opportunities for molecular biology, bio-medical and bioinformatics research, but raising also new problems for their integration and computational processing. Results: In this paper we survey the most interesting and novel approaches for the representation, integration and management of different kinds of biological data by exploiting XML and the related recommendations and approaches. Moreover, we present new and interesting cutting edge approaches for the appropriate management of heterogeneous biological data represented through XML. Conclusion: XML has succeeded in the integration of heterogeneous biomolecular information, and has established itself as the syntactic glue for biological data sources. Nevertheless, a large variety of XML-based data formats have been proposed, thus resulting in a difficult effective integration of bioinformatics data schemes. The adoption of a few semantic-rich standard formats is urgent to achieve a seamless integration of the current biological resources. </p
Recommended from our members
Molecular characterization and evolutionary plasticity of protein-protein interfaces
Abstract
The sequencing of the human genome provides the parts list for understanding cellular processes. However, as 70% of eukaryotic genes work through multi-protein systems, it is only through detailed study of the interactions of these components, that a more complete, systems-level understanding can be gained. This thesis is centred on the establishment of PICCOLO - a comprehensive database of structurally characterized
protein interactions. In generating the resource, issues of interface definition, quaternary structure, data redundancy, structural environment and interaction type are addressed. The resource enables a variety of analyses to be performed concerning interface properties including residue propensity, hydropathy, polarity, interface size, sequence entropy and residue contact preference.
PICCOLO has been applied to probing the patterns of substitutions that are accepted in protein interfaces across evolution, and whether these patterns are distinguishable from those seen in other structural environments. The derivation of a high-quality set of multiple structural alignments in the form of the database TOCCATA, a prerequisite for such analysis, is described, as well as procedures to derive
environment-specific substitution tables.
The Blundell group has contributed a series of methods to predict the likely effect of non-synonymous Single Nucleotide Polymorphisms (nsSNPs) on protein stability, function and interactions in order to
triage the large volumes of data created from high-throughput genetic screening studies, enabling prioritization of those nsSNPs most likely to be phenotypically detrimental. PICCOLO's contribution to these predictions is described.
Historically there has been little focus on protein-protein interactions as drug targets for small-molecule therapeutics. However, alanine-scanning mutagenesis studies have revealed that only a subset of residues contribute the greater part of free energy to binding - so-called "hot-spots". Molecular characterization of hot-spots performed using PICCOLO, probes the molecular basis underlying this important phenomenon leading to the possibility of predictive methods to identify hot-spots 'in silico'
- …