Search CORE

635 research outputs found

SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters

Author: Lefkowitz Elliot J
Wang Chunlin
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: Large-scale sequence comparison is a powerful tool for biological inference in modern molecular biology. Comparing new sequences to those in annotated databases is a useful source of functional and structural information about these sequences. Using software such as the basic local alignment search tool (BLAST) or HMMPFAM to identify statistically significant matches between newly sequenced segments of genetic material and those in databases is an important task for most molecular biologists. Searching algorithms are intrinsically slow and data-intensive, especially in light of the rapid growth of biological sequence databases due to the emergence of high throughput DNA sequencing techniques. Thus, traditional bioinformatics tools are impractical on PCs and even on dedicated UNIX servers. To take advantage of larger databases and more reliable methods, high performance computation becomes necessary. RESULTS: We describe the implementation of SS-Wrapper (Similarity Search Wrapper), a package of wrapper applications that can parallelize similarity search applications on a Linux cluster. Our wrapper utilizes a query segmentation-search (QS-search) approach to parallelize sequence database search applications. It takes into consideration load balancing between each node on the cluster to maximize resource usage. QS-search is designed to wrap many different search tools, such as BLAST and HMMPFAM using the same interface. This implementation does not alter the original program, so newly obtained programs and program updates should be accommodated easily. Benchmark experiments using QS-search to optimize BLAST and HMMPFAM showed that QS-search accelerated the performance of these programs almost linearly in proportion to the number of CPUs used. We have also implemented a wrapper that utilizes a database segmentation approach (DS-BLAST) that provides a complementary solution for BLAST searches when the database is too large to fit into the memory of a single node. CONCLUSIONS: Used together, QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments. Their ease of use and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

High performance reconfigurable architectures for biological sequence alignment

Author: Isa Mohammad Nazrin
Publication venue: The University of Edinburgh
Publication date: 01/07/2013
Field of study

Bioinformatics and computational biology (BCB) is a rapidly developing multidisciplinary field which encompasses a wide range of domains, including genomic sequence alignments. It is a fundamental tool in molecular biology in searching for homology between sequences. Sequence alignments are currently gaining close attention due to their great impact on the quality aspects of life such as facilitating early disease diagnosis, identifying the characteristics of a newly discovered sequence, and drug engineering. With the vast growth of genomic data, searching for a sequence homology over huge databases (often measured in gigabytes) is unable to produce results within a realistic time, hence the need for acceleration. Since the exponential increase of biological databases as a result of the human genome project (HGP), supercomputers and other parallel architectures such as the special purpose Very Large Scale Integration (VLSI) chip, Graphic Processing Unit (GPUs) and Field Programmable Gate Arrays (FPGAs) have become popular acceleration platforms. Nevertheless, there are always trade-off between area, speed, power, cost, development time and reusability when selecting an acceleration platform. FPGAs generally offer more flexibility, higher performance and lower overheads. However, they suffer from a relatively low level programming model as compared with off-the-shelf microprocessors such as standard microprocessors and GPUs. Due to the aforementioned limitations, the need has arisen for optimized FPGA core implementations which are crucial for this technology to become viable in high performance computing (HPC). This research proposes the use of state-of-the-art reprogrammable system-on-chip technology on FPGAs to accelerate three widely-used sequence alignment algorithms; the Smith-Waterman with affine gap penalty algorithm, the profile hidden Markov model (HMM) algorithm and the Basic Local Alignment Search Tool (BLAST) algorithm. The three novel aspects of this research are firstly that the algorithms are designed and implemented in hardware, with each core achieving the highest performance compared to the state-of-the-art. Secondly, an efficient scheduling strategy based on the double buffering technique is adopted into the hardware architectures. Here, when the alignment matrix computation task is overlapped with the PE configuration in a folded systolic array, the overall throughput of the core is significantly increased. This is due to the bound PE configuration time and the parallel PE configuration approach irrespective of the number of PEs in a systolic array. In addition, the use of only two configuration elements in the PE optimizes hardware resources and enables the scalability of PE systolic arrays without relying on restricted onboard memory resources. Finally, a new performance metric is devised, which facilitates the effective comparison of design performance between different FPGA devices and families. The normalized performance indicator (speed-up per area per process technology) takes out advantages of the area and lithography technology of any FPGA resulting in fairer comparisons. The cores have been designed using Verilog HDL and prototyped on the Alpha Data ADM-XRC-5LX card with the Virtex-5 XC5VLX110-3FF1153 FPGA. The implementation results show that the proposed architectures achieved giga cell updates per second (GCUPS) performances of 26.8, 29.5 and 24.2 respectively for the acceleration of the Smith-Waterman with affine gap penalty algorithm, the profile HMM algorithm and the BLAST algorithm. In terms of speed-up improvements, comparisons were made on performance of the designed cores against their corresponding software and the reported FPGA implementations. In the case of comparison with equivalent software execution, acceleration of the optimal alignment algorithm in hardware yielded an average speed-up of 269x as compared to the SSEARCH 35 software. For the profile HMM-based sequence alignment, the designed core achieved speed-up of 103x and 8.3x against the HMMER 2.0 and the latest version of HMMER (version 3.0) respectively. On the other hand, the implementation of the gapped BLAST with the two-hit method in hardware achieved a greater than tenfold speed-up compared to the latest NCBI BLAST software. In terms of comparison against other reported FPGA implementations, the proposed normalized performance indicator was used to evaluate the designed architectures fairly. The results showed that the first architecture achieved more than 50 percent improvement, while acceleration of the profile HMM sequence alignment in hardware gained a normalized speed-up of 1.34. In the case of the gapped BLAST with the two-hit method, the designed core achieved 11x speed-up after taking out advantages of the Virtex-5 FPGA. In addition, further analysis was conducted in terms of cost and power performances; it was noted that, the core achieved 0.46 MCUPS per dollar spent and 958.1 MCUPS per watt. This shows that FPGAs can be an attractive platform for high performance computation with advantages of smaller area footprint as well as represent economic ‘green’ solution compared to the other acceleration platforms. Higher throughput can be achieved by redeploying the cores on newer, bigger and faster FPGAs with minimal design effort

Edinburgh Research Archive

Automated Genome-Wide Protein Domain Exploration

Author: Rekepalli Bhanu Prasad
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2007
Field of study

Exploiting the exponentially growing genomics and proteomics data requires high quality, automated analysis. Protein domain modeling is a key area of molecular biology as it unravels the mysteries of evolution, protein structures, and protein functions. A plethora of sequences exist in protein databases with incomplete domain knowledge. Hence this research explores automated bioinformatics tools for faster protein domain analysis. Automated tool chains described in this dissertation generate new protein domain models thus enabling more effective genome-wide protein domain analysis. To validate the new tool chains, the Shewanella oneidensis and Escherichia coli genomes were processed, resulting in a new peptide domain database, detection of poor domain models, and identification of likely new domains. The automated tool chains will require months or years to model a small genome when executing on a single workstation. Therefore the dissertation investigates approaches with grid computing and parallel processing to significantly accelerate these bioinformatics tool chains

University of Tennessee, Knoxville: Trace

Bioinformatics

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

Directory of Open Access Books (DOAB)

Efficient Learning Machines

Author: Awad Mariette
Khanna Rahul
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Computer scienc

OAPEN Library

Process mining : conformance and extension

Author: Rozinat A.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2010
Field of study

Today’s business processes are realized by a complex sequence of tasks that are performed throughout an organization, often involving people from different departments and multiple IT systems. For example, an insurance company has a process to handle insurance claims for their clients, and a hospital has processes to diagnose and treat patients. Because there are many activities performed by different people throughout the organization, there is a lack of transparency about how exactly these processes are executed. However, understanding the process reality (the "as is" process) is the first necessary step to save cost, increase quality, or ensure compliance. The field of process mining aims to assist in creating process transparency by automatically analyzing processes based on existing IT data. Most processes are supported by IT systems nowadays. For example, Enterprise Resource Planning (ERP) systems such as SAP log all transaction information, and Customer Relationship Management (CRM) systems are used to keep track of all interactions with customers. Process mining techniques use these low-level log data (so-called event logs) to automatically generate process maps that visualize the process reality from different perspectives. For example, it is possible to automatically create process models that describe the causal dependencies between activities in the process. So far, process mining research has mostly focused on the discovery aspect (i.e., the extraction of models from event logs). This dissertation broadens the field of process mining to include the aspect of conformance and extension. Conformance aims at the detection of deviations from documented procedures by comparing the real process (as recorded in the event log) with an existing model that describes the assumed or intended process. Conformance is relevant for two reasons: 1. Most organizations document their processes in some form. For example, process models are created manually to understand and improve the process, comply with regulations, or for certification purposes. In the presence of existing models, it is often more important to point out the deviations from these existing models than to discover completely new models. Discrepancies emerge because business processes change, or because the models did not accurately reflect the real process in the first place (due to the manual and subjective creation of these models). If the existing models do not correspond to the actual processes, then they have little value. 2. Automatically discovered process models typically do not completely "fit" the event logs from which they were created. These discrepancies are due to noise and/or limitations of the used discovery techniques. Furthermore, in the context of complex and diverse process environments the discovered models often need to be simplified to obtain useful insights. Therefore, it is crucial to be able to check how much a discovered process model actually represents the real process. Conformance techniques can be used to quantify the representativeness of a mined model before drawing further conclusions. They thus constitute an important quality measurement to effectively use process discovery techniques in a practical setting. Once one is confident in the quality of an existing or discovered model, extension aims at the enrichment of these models by the integration of additional characteristics such as time, cost, or resource utilization. By extracting aditional information from an event log and projecting it onto an existing model, bottlenecks can be highlighted and correlations with other process perspectives can be identified. Such an integrated view on the process is needed to understand root causes for potential problems and actually make process improvements. Furthermore, extension techniques can be used to create integrated simulation models from event logs that resemble the real process more closely than manually created simulation models. In Part II of this thesis, we provide a comprehensive framework for the conformance checking of process models. First, we identify the evaluation dimensions fitness, decision/generalization, and structure as the relevant conformance dimensions.We develop several Petri-net based approaches to measure conformance in these dimensions and describe five case studies in which we successfully applied these conformance checking techniques to real and artificial examples. Furthermore, we provide a detailed literature review of related conformance measurement approaches (Chapter 4). Then, we study existing model evaluation approaches from the field of data mining. We develop three data mining-inspired evaluation approaches for discovered process models, one based on Cross Validation (CV), one based on the Minimal Description Length (MDL) principle, and one using methods based on Hidden Markov Models (HMMs). We conclude that process model evaluation faces similar yet different challenges compared to traditional data mining. Additional challenges emerge from the sequential nature of the data and the higher-level process models, which include concurrent dynamic behavior (Chapter 5). Finally, we point out current shortcomings and identify general challenges for conformance checking techniques. These challenges relate to the applicability of the conformance metric, the metric quality, and the bridging of different process modeling languages. We develop a flexible, language-independent conformance checking approach that provides a starting point to effectively address these challenges (Chapter 6). In Part III, we develop a concrete extension approach, provide a general model for process extensions, and apply our approach for the creation of simulation models. First, we develop a Petri-net based decision mining approach that aims at the discovery of decision rules at process choice points based on data attributes in the event log. While we leverage classification techniques from the data mining domain to actually infer the rules, we identify the challenges that relate to the initial formulation of the learning problem from a process perspective. We develop a simple approach to partially overcome these challenges, and we apply it in a case study (Chapter 7). Then, we develop a general model for process extensions to create integrated models including process, data, time, and resource perspective.We develop a concrete representation based on Coloured Petri-nets (CPNs) to implement and deploy this model for simulation purposes (Chapter 8). Finally, we evaluate the quality of automatically discovered simulation models in two case studies and extend our approach to allow for operational decision making by incorporating the current process state as a non-empty starting point in the simulation (Chapter 9). Chapter 10 concludes this thesis with a detailed summary of the contributions and a list of limitations and future challenges. The work presented in this dissertation is supported and accompanied by concrete implementations, which have been integrated in the ProM and ProMimport frameworks. Appendix A provides a comprehensive overview about the functionality of the developed software. The results presented in this dissertation have been presented in more than twenty peer-reviewed scientific publications, including several high-quality journals

Repository TU/e

Pure OAI Repository

Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL): adapting the Partial Phylogenetic Profiling algorithm to scan sequences for signatures that predict protein function

Author: Haft Daniel H
Rusch Douglas B
Selengut Jeremy D
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Springer - Publisher Connector

PubMed Central

Adaptive Computing Systems for Aerospace

Author: Panerati Jacopo
Publication venue
Publication date: 01/05/2017
Field of study

RÉSUMÉ En raison de leur complexité croissante, les systèmes informatiques modernes nécessitent de nouvelles méthodologies permettant d’automatiser leur conception et d’améliorer leurs performances. L’espace, en particulier, constitue un environnement très défavorable au maintien de la performance de ces systèmes : sans protection des rayonnements ionisants et des particules, l’électronique basée sur CMOS peut subir des erreurs transitoires, une dégradation des performances et une usure accélérée causant ultimement une défaillance du système. Les approches traditionnellement adoptees pour garantir la fiabilité du système et prolonger sa durée de vie sont basées sur la redondance, généralement établie durant la conception. En revanche, ces solutions sont coûteuses et parfois inefficaces, puisqu'elles augmentent la taille et la complexité du système, l'exposant à des risques plus élevés de surchauffe et d'erreurs. Les conséquences de ces limites sont d'autant plus importantes lorsqu'elles s’appliquent aux systèmes critiques (e.g., contraintes par le temps ou dont l’accès est limité) qui doivent être en mesure de prendre des décisions sans intervention humaine. Sur la base de ces besoins et limites, le développement en aérospatial de systèmes informatiques avec capacités adaptatives peut être considéré comme la solution la plus appropriée pour les dispositifs intégrés à haute performance. L’informatique auto-adaptative offre un potentiel sans égal pour assurer la création d’une génération d’ordinateurs plus intelligents et fiables. Qui plus est, elle répond aux besoins modernes de concevoir et programmer des systèmes informatiques capables de répondre à des objectifs en conflit. En nous inspirant des domaines de l’intelligence artificielle et des systèmes reconfigurables, nous aspirons à développer des systèmes informatiques auto-adaptatifs pour l’aérospatiale qui répondent aux enjeux et besoins actuels. Notre objectif est d’améliorer l’efficacité de ces systèmes, leur tolerance aux pannes et leur capacité de calcul. Afin d’atteindre cet objectif, une analyse expérimentale et comparative des algorithmes les plus populaires pour l’exploration multi-objectifs de l’espace de conception est d’abord effectuée. Les algorithmes ont été recueillis suite à une revue de la plus récente littérature et comprennent des méthodes heuristiques, évolutives et statistiques. L’analyse et la comparaison de ceux-ci permettent de cerner les forces et limites de chacun et d'ainsi définir des lignes directrices favorisant un choix optimal d’algorithmes d’exploration. Pour la création d’un système d’optimisation autonome—permettant le compromis entre plusieurs objectifs—nous exploitons les capacités des modèles graphiques probabilistes. Nous introduisons une méthodologie basée sur les modèles de Markov cachés dynamiques, laquelle permet d’équilibrer la disponibilité et la durée de vie d’un système multiprocesseur. Ceci est obtenu en estimant l'occurrence des erreurs permanentes parmi les erreurs transitoires et en migrant dynamiquement le calcul sur les ressources supplémentaires en cas de défaillance. La nature dynamique du modèle rend celui-ci adaptable à différents profils de mission et taux d’erreur. Les résultats montrent que nous sommes en mesure de prolonger la durée de vie du système tout en conservant une disponibilité proche du cas idéal. En raison des contraintes de temps rigoureuses imposées par les systèmes aérospatiaux, nous étudions aussi l’optimisation de la tolérance aux pannes en présence d'exigences d’exécution en temps réel. Nous proposons une méthodologie pour améliorer la fiabilité du calcul en présence d’erreurs transitoires pour les tâches en temps réel d’un système multiprocesseur homogène avec des capacités de réglage de tension et de fréquence. Dans ce cadre, nous définissons un nouveau compromis probabiliste entre la consommation d’énergie et la tolérance aux erreurs. Comme nous reconnaissons que la résilience est une propriété d’intérêt omniprésente (par exemple, pour la conception et l’analyse de systems complexes génériques), nous adaptons une définition formelle de celle-ci à un cadre probabiliste dérivé à nouveau de modèles de Markov cachés. Ce cadre nous permet de modéliser de façon réaliste l’évolution stochastique et l’observabilité partielle des phénomènes du monde réel. Nous proposons un algorithme permettant le calcul exact efficace de l’étape essentielle d’inférence laquelle est requise pour vérifier des propriétés génériques. Pour démontrer la flexibilité de cette approche, nous la validons, entre autres, dans le contexte d’un système informatisé reconfigurable pour l’aérospatiale. Enfin, nous étendons la portée de nos recherches vers la robotique et les systèmes multi-agents, deux sujets dont la popularité est croissante en exploration spatiale. Nous abordons le problème de l’évaluation et de l’entretien de la connectivité dans le context distribué et auto-adaptatif de la robotique en essaim. Nous examinons les limites des solutions existantes et proposons une nouvelle méthodologie pour créer des géométries complexes connectées gérant plusieurs tâches simultanément. Des contributions additionnelles dans plusieurs domaines sont résumés dans les annexes, nommément : (i) la conception de CubeSats, (ii) la modélisation des rayonnements spatiaux pour l’injection d’erreur dans FPGA et (iii) l’analyse temporelle probabiliste pour les systèmes en temps réel. À notre avis, cette recherche constitue un tremplin utile vers la création d’une nouvelle génération de systèmes informatiques qui exécutent leurs tâches d’une façon autonome et fiable, favorisant une exploration spatiale plus simple et moins coûteuse.----------ABSTRACT Today's computer systems are growing more and more complex at a pace that requires the development of novel and more effective methodologies to automate their design. Space, in particular, represents a challenging environment: without protection from ionizing and particle radiation, CMOS-based electronics are subject to transients faults, performance degradation, accelerated wear, and, ultimately, system failure. Traditional approaches adopted to guarantee reliability and extended lifetime are based on redundancy that is established at design-time. These solutions are expensive and sometimes inefficient, as they increase the complexity and size of a system, exposing it to higher risks of overheating and incurring in radiation-induced errors. Moreover, critical systems---e.g., time-constrained ones and those where access is limited---must be able to cope with pivotal situations without relying on human intervention. Hence, the emerging interest in computer systems with adaptive capabilities as the most suitable solution for novel high-performance embedded devices for aerospace. Self-adaptive computing carries unmatched potential and great promises for the creation of a new generation of smart, more reliable computers, and it addresses the challenge of designing and programming modern and future computer systems that must meet conflicting goals. Drawing from the fields of artificial intelligence and reconfigurable systems, we aim at developing self-adaptive computer systems for aerospace. Our goal is to improve their efficiency, fault-tolerance, and computational capabilities. The first step in this research is the experimental analysis of the most popular multi-objective design-space exploration algorithms for high-level design. These algorithms were collected from the recent literature and include heuristic, evolutionary, and statistical methods. Their comparison provides insights that we use to define guidelines for the choice of the most appropriate optimization algorithms, given the features of the design space. For the creation of a self-managing optimization framework---enabling the adaptive trade-off of multiple objectives---we leverage the tools of probabilistic graphical models. We introduce a mechanism based on dynamic hidden Markov models that balances the availability and lifetime of multiprocessor systems. This is achieved by estimating the occurrence of permanent faults amid transient faults, and by dynamically migrating the computation on excess resources, when failure occurs. The dynamic nature of the model makes it adjustable to different mission profiles and fault rates. The results show that we are able to lead systems to extended lifetimes, while keeping their availability close to ideal. On account of the stringent timing constraints imposed by aerospace systems, we then investigate the optimization of fault-tolerance under real-time requirements. We propose a methodology to improve the reliability of computation in the presence of transient errors when considering the mapping of real-time tasks on a homogeneous multiprocessor system with voltage and frequency scaling capabilities. In this framework, we take advantage of probability theory to define a novel trade-off between power consumption and fault-tolerance. As we recognize that resilience is a pervasive property of interest (e.g., for the design and analysis of generic complex systems), we adapt a formal definition of it to one more probabilistic framework derived from hidden Markov models. This allows us to realistically model the stochastic evolution and partial observability of complex real-world environments. Within this framework, we propose an efficient algorithm for the exact computation of the essential inference step required to construct generic property checking. To demonstrate the flexibility of this approach, we validate it in the context, among others, of a self-aware, reconfigurable computing system for aerospace. Finally, we move the scope of our research towards robotics and multi-agent systems: a topic of thriving popularity for space exploration. We tackle the problem of connectivity assessment and maintenance in the distributed and self-adaptive context of swarm robotics. We review the limitations of existing solutions and propose a novel methodology to create connected complex geometries for multiple task coverage. Additional contributions in the areas of (i) CubeSat design, (ii) the modelling of space radiation for FPGA fault-injection, and (iii) probabilistic timing analysis for real-time systems are summarized in the appendices. In the author's opinion, this research provides a number of useful stepping stones for the creation of a new generation of computing systems that autonomously---and reliably---perform their tasks for longer periods of time, fostering simpler and cheaper space exploration

PolyPublie

Parallelization of dynamic programming recurrences in computational biology

Author: Jacob Arpith
Publication venue: Washington University Open Scholarship
Publication date: 01/01/2010
Field of study

The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

Washington University St. Louis: Open Scholarship

Protein structure prediction and structure-based protein function annotation

Author: Roy Ambrish
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2011
Field of study

Nature tends to modify rather than invent function of protein molecules, and the log of the modifications is encrypted in the gene sequence. Analysis of these modification events in evolutionarily related genes is important for assigning function to hypothetical genes and their products surging in databases, and to improve our understanding of the bioverse. However, random mutations occurring during evolution chisel the sequence to an extent that both decrypting these codes and identifying evolutionary relatives from sequence alone becomes difficult. Thankfully, even after many changes at the sequence level, the protein three-dimensional structures are often conserved and hence protein structural similarity usually provide more clues on evolution of functionally related proteins. In this dissertation, I study the design of three bioinformatics modules that form a new hierarchical approach for structure prediction and function annotation of proteins based on sequence-to-structure-to-function paradigm. First, we design an online platform for structure prediction of protein molecules using multiple threading alignments and iterative structural assembly simulations (I-TASSER). I review the components of this module and have added features that provide function annotation to the protein sequences and help to combine experimental and biological data for improving the structure modeling accuracy. The online service of the system has been supporting more than 20,000 biologists from over 100 countries. Next, we design a new comparative approach (COFACTOR) to identify the location of ligand binding sites on these modeled protein structures and spot the functional residue constellations using an innovative global-to-local structural alignment procedure and functional sites in known protein structures. Based on both large-scale benchmarking and blind tests (CASP), the method demonstrates significant advantages over the state-of-the- art methods of the field in recognizing ligand-binding residues for both metal and non- metal ligands. The major advantage of the method is the optimal combination of the local and global protein structural alignments, which helps to recognize functionally conserved structural motifs among proteins that have taken different evolutionary paths. We further extend the COFACTOR global-to-local approach to annotate the gene- ontology and enzyme classifications of protein molecules. Here, we added two new components to COFACTOR. First, we developed a new global structural match algorithm that allows performing better structural search. Second, a sensitive technique was proposed for constructing local 3D-signature motifs of template proteins that lack known functional sites, which allows us to perform query-template local structural similarity comparisons with all template proteins. A scoring scheme that combines the confidence score of structure prediction with global-local similarity score is used for assigning a confidence score to each of the predicted function. Large scale benchmarking shows that the predicted functions have remarkably improved precision and recall rates and also higher prediction coverage than the state-of-art sequence based methods. To explore the applicability of the method for real-world cases, we applied the method to a subset of ORFs from Chlamydia trachomatis and the functional annotations provided new testable hypothesis for improving the understanding of this phylogenetically distinct bacterium

KU ScholarWorks