88 research outputs found

    High-Throughput 3D Homology Detection via NMR Resonance Assignment

    Get PDF
    One goal of the structural genomics initiative is the identification of new protein folds. Sequence-based structural homology prediction methods are an important means for prioritizing unknown proteins for structure determination. However, an important challenge remains: two highly dissimilar sequences can have similar folds --- how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure, called HD, for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies 3D models in a protein structural database whose geometries best fit the unassigned experimental NMR data. HD does not use, and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or homology modelling. The algorithm runs in O(pn5/2log(cn)+plogp)O(pn^{5/2} \log {(cn)} + p \log p) time, where pp is the number of proteins in the database, nn is the number of residues in the target protein and cc is the maximum edge weight in an integer-weighted bipartite graph. Our experiments on real NMR data from 3 different proteins against a database of 4,500 representative folds demonstrate that the method identifies closely related protein folds, including sub-domains of larger proteins, with as little as 10-30\% sequence homology between the target protein (or sub-domain) and the computed model. In particular, we report no false-negatives or false-positives despite significant percentages of missing experimental data

    More Reliable Protein NMR Peak Assignment via Improved 2-Interval Scheduling

    Full text link
    Protein NMR peak assignment refers to the process of assigning a group of \u201cspin systems\u201d obtained experimentally to a protein sequence of amino acids. The automation of this process is still an unsolved and challenging problem in NMR protein structure determination. Recently, protein backbone NMR peak assignment has been formulated as an interval scheduling problem, where a protein sequence of amino acids is viewed as a discrete time interval (the amino acids on one-to-one correspond to the time units of ), each subset S of spin systems that are known to originate from consecutive amino acids of is viewed as a \u201cjob\u201d j S , the preference of assigning S to a subsequence P of consecutive amino acids on is viewed as the profit of executing job j S in the subinterval of corresponding to P, and the goal is to maximize the total profit of executing the jobs (on a single machine) during . The interval scheduling problem is Max SNP-hard in general. Typically the jobs that require one or two consecutive time units are the most difficult to assign/schedule. To solve these most difficult assignments, we present an efficient 7/13-approximation algorithm. Combining this algorithm with a greedy filtering strategy for handling long jobs (i.e. jobs that need more than two consecutive time units), we obtained a new efficient heuristic for protein NMR peak assignment. Our study using experimental data shows that the new heuristic produces the best peak assignment in most of the cases, compared with the NMR peak assignment algorithms in the literature. The 7/13-approximation algorithm is also the first approximation algorithm for a nontrivial case of the classical (weighted) interval scheduling problem that breaks the ratio 2 barrier

    Combining automated peak tracking in SAR by NMR with structure-based backbone assignment from 15N-NOESY

    Get PDF
    BACKGROUND: Chemical shift mapping is an important technique in NMR-based drug screening for identifying the atoms of a target protein that potentially bind to a drug molecule upon the molecule's introduction in increasing concentrations. The goal is to obtain a mapping of peaks with known residue assignment from the reference spectrum of the unbound protein to peaks with unknown assignment in the target spectrum of the bound protein. Although a series of perturbed spectra help to trace a path from reference peaks to target peaks, a one-to-one mapping generally is not possible, especially for large proteins, due to errors, such as noise peaks, missing peaks, missing but then reappearing, overlapped, and new peaks not associated with any peaks in the reference. Due to these difficulties, the mapping is typically done manually or semi-automatically, which is not efficient for high-throughput drug screening. RESULTS: We present PeakWalker, a novel peak walking algorithm for fast-exchange systems that models the errors explicitly and performs many-to-one mapping. On the proteins: hBcl(XL), UbcH5B, and histone H1, it achieves an average accuracy of over 95% with less than 1.5 residues predicted per target peak. Given these mappings as input, we present PeakAssigner, a novel combined structure-based backbone resonance and NOE assignment algorithm that uses just (15)N-NOESY, while avoiding TOCSY experiments and (13)C-labeling, to resolve the ambiguities for a one-to-one mapping. On the three proteins, it achieves an average accuracy of 94% or better. CONCLUSIONS: Our mathematical programming approach for modeling chemical shift mapping as a graph problem, while modeling the errors directly, is potentially a time- and cost-effective first step for high-throughput drug screening based on limited NMR data and homologous 3D structures

    Fast and Robust Mathematical Modeling of NMR Assignment Problems

    Get PDF
    NMR spectroscopy is not only for protein structure determination, but also for drug screening and studies of dynamics and interactions. In both cases, one of the main bottleneck steps is backbone assignment. When a homologous structure is available, it can accelerate assignment. Such structure-based methods are the focus of this thesis. This thesis aims for fast and robust methods for NMR assignment problems; in particular, structure-based backbone assignment and chemical shift mapping. For speed, we identified situations where the number of 15N-labeled experiments for structure-based assignment can be reduced; in particular, when a homologous assignment or chemical shift mapping information is available. For robustness, we modeled and directly addressed the errors. Binary integer linear programming, a well-studied method in operations research, was used to model the problems and provide practically efficient solutions with optimality guarantees. Our approach improved on the most robust method for structure-based backbone assignment on 15N-labeled data by improving the accuracy by 10% on average on 9 proteins, and then by handling typing errors, which had previously been ignored. We show that such errors can have a large impact on the accuracy; decreasing the accuracy from 95% or greater to between 40% and 75%. On automatically picked peaks, which is much noisier than manually picked peaks, we achieved an accuracy of 97% on ubiquitin. In chemical shift mapping, the peak tracking is often done manually because the problem is inherently visual. We developed a computer vision approach for tracking the peak movements with average accuracy of over 95% on three proteins with less than 1.5 residues predicted per peak. One of the proteins tested is larger than any tested by existing automated methods, and it has more titration peak lists. We then combined peak tracking with backbone assignment to take into account contact information, which resulted in an average accuracy of 94% on one-to-one assignments for these three proteins. Finally, we applied peak tracking and backbone assignment to protein-ligand docking to illustrate the potential for fast 3D complex determination

    New Approaches to Protein NMR Automation

    Get PDF
    The three-dimensional structure of a protein molecule is the key to understanding its biological and physiological properties. A major problem in bioinformatics is to efficiently determine the three-dimensional structures of query proteins. Protein NMR structure de- termination is one of the main experimental methods and is comprised of: (i) protein sample production and isotope labelling, (ii) collecting NMR spectra, and (iii) analysis of the spectra to produce the protein structure. In protein NMR, the three-dimensional struc- ture is determined by exploiting a set of distance restraints between spatially proximate atoms. Currently, no practical automated protein NMR method exists that is without human intervention. We first propose a complete automated protein NMR pipeline, which can efficiently be used to determine the structures of moderate sized proteins. Second, we propose a novel and efficient semidefinite programming-based (SDP) protein structure determination method. The proposed automated protein NMR pipeline consists of three modules: (i) an automated peak picking method, called PICKY, (ii) a backbone chemical shift assign- ment method, called IPASS, and (iii) a protein structure determination method, called FALCON-NMR. When tested on four real protein data sets, this pipeline can produce structures with reasonable accuracies, starting from NMR spectra. This general method can be applied to other macromolecule structure determination methods. For example, a promising application is RNA NMR-assisted secondary structure determination. In the second part of this thesis, due to the shortcomings of FALCON-NMR, we propose a novel SDP-based protein structure determination method from NMR data, called SPROS. Most of the existing prominent protein NMR structure determination methods are based on molecular dynamics coupled with a simulated annealing schedule. In these methods, an objective function representing the error between observed and given distance restraints is minimized; these objective functions are highly non-convex and difficult to optimize. Euclidean distance geometry methods based on SDP provide a natural formulation for realizing a three-dimensional structure from a set of given distance constraints. However, the complexity of the SDP solvers increases cubically with the input matrix size, i.e., the number of atoms in the protein, and the number of constraints. In fact, the complexity of SDP solvers is a major obstacle in their applicability to the protein NMR problem. To overcome these limitations, the SPROS method models the protein molecule as a set of intersecting two- and three-dimensional cliques. We adapt and extend a technique called semidefinite facial reduction for the SDP matrix size reduction, which makes the SDP problem size approximately one quarter of the original problem. The reduced problem is solved nearly one hundred times faster and is more robust against numerical problems. Reasonably accurate results were obtained when SPROS was applied to a set of 20 real protein data sets

    De novo sequencing of heparan sulfate saccharides using high-resolution tandem mass spectrometry

    Get PDF
    Heparan sulfate (HS) is a class of linear, sulfated polysaccharides located on cell surface, secretory granules, and in extracellular matrices found in all animal organ systems. It consists of alternately repeating disaccharide units, expressed in animal species ranging from hydra to higher vertebrates including humans. HS binds and mediates the biological activities of over 300 proteins, including growth factors, enzymes, chemokines, cytokines, adhesion and structural proteins, lipoproteins and amyloid proteins. The binding events largely depend on the fine structure - the arrangement of sulfate groups and other variations - on HS chains. With the activated electron dissociation (ExD) high-resolution tandem mass spectrometry technique, researchers acquire rich structural information about the HS molecule. Using this technique, covalent bonds of the HS oligosaccharide ions are dissociated in the mass spectrometer. However, this information is complex, owing to the large number of product ions, and contains a degree of ambiguity due to the overlapping of product ion masses and lability of sulfate groups; as a result, there is a serious barrier to manual interpretation of the spectra. The interpretation of such data creates a serious bottleneck to the understanding of the biological roles of HS. In order to solve this problem, I designed HS-SEQ - the first HS sequencing algorithm using high-resolution tandem mass spectrometry. HS-SEQ allows rapid and confident sequencing of HS chains from millions of candidate structures and I validated its performance using multiple known pure standards. In many cases, HS oligosaccharides exist as mixtures of sulfation positional isomers. I therefore designed MULTI-HS-SEQ, an extended version of HS-SEQ targeting spectra coming from more than one HS sequence. I also developed several pre-processing and post-processing modules to support the automatic identification of HS structure. These methods and tools demonstrated the capacity for large-scale HS sequencing, which should contribute to clarifying the rich information encoded by HS chains as well as developing tailored HS drugs to target a wide spectrum of diseases

    Operations research: from computational biology to sensor network

    Get PDF
    In this dissertation we discuss the deployment of combinatorial optimization methods for modeling and solve real life problemS, with a particular emphasis to two biological problems arising from a common scenario: the reconstruction of the three-dimensional shape of a biological molecule from Nuclear Magnetic Resonance (NMR) data. The fi rst topic is the 3D assignment pathway problem (APP) for a RNA molecule. We prove that APP is NP-hard, and show a formulation of it based on edge-colored graphs. Taking into account that interactions between consecutive nuclei in the NMR spectrum are diff erent according to the type of residue along the RNA chain, each color in the graph represents a type of interaction. Thus, we can represent the sequence of interactions as the problem of fi nding a longest (hamiltonian) path whose edges follow a given order of colors (i.e., the orderly colored longest path). We introduce three alternative IP formulations of APP obtained with a max flow problem on a directed graph with packing constraints over the partitions, which have been compared among themselves. Since the last two models work on cyclic graphs, for them we proposed an algorithm based on the solution of their relaxation combined with the separation of cycle inequalities in a Branch & Cut scheme. The second topic is the discretizable distance geometry problem (DDGP), which is a formulation on discrete search space of the well-known distance geometry problem (DGP). The DGP consists in seeking the embedding in the space of a undirected graph, given a set of Euclidean distances between certain pairs of vertices. DGP has two important applications: (i) fi nding the three dimensional conformation of a molecule from a subset of interatomic distances, called Molecular Distance Geometry Problem, and (ii) the Sensor Network Localization Problem. We describe a Branch & Prune (BP) algorithm tailored for this problem, and two versions of it solving the DDGP both in protein modeling and in sensor networks localization frameworks. BP is an exact and exhaustive combinatorial algorithm that examines all the valid embeddings of a given weighted graph G=(V,E,d), under the hypothesis of existence of a given order on V. By comparing the two version of BP to well-known algorithms we are able to prove the e fficiency of BP in both contexts, provided that the order imposed on V is maintained
    corecore