18 research outputs found

    De Novo Protein Structure Modeling and Energy Function Design

    Get PDF
    The two major challenges in protein structure prediction problems are (1) the lack of an accurate energy function and (2) the lack of an efficient search algorithm. A protein energy function accurately describing the interaction between residues is able to supervise the optimization of a protein conformation, as well as select native or native-like structures from numerous possible conformations. An efficient search algorithm must be able to reduce a conformational space to a reasonable size without missing the native conformation. My PhD research studies focused on these two directions. A protein energy function—the distance and orientation dependent energy function of amino acid key blocks (DOKB), containing a distance term, an orientation term, and a highly packed term—was proposed to evaluate the stability of proteins. In this energy function, key blocks of each amino acids were used to represent each residue; a novel reference state was used to normalize block distributions. The dependent relationship between the orientation term and the distance term was revealed, representing the preference of different orientations at different distances between key blocks. Compared with four widely used energy functions using six general benchmark decoy sets, the DOKB appeared to perform very well in recognizing native conformations. Additionally, the highly packed term in the DOKB played its important role in stabilizing protein structures containing highly packed residues. The cluster potential adjusted the reference state of highly packed areas and significantly improved the recognition of the native conformations in the ig_structal data set. The DOKB is not only an alternative protein energy function for protein structure prediction, but it also provides a different view of the interaction between residues. The top-k search algorithm was optimized to be used for proteins containing both α-helices and β-sheets. Secondary structure elements (SSEs) are visible in cryo-electron microscopy (cryo-EM) density maps. Combined with the SSEs predicted in a protein sequence, it is feasible to determine the topologies referring to the order and direction of the SSEs in the cryo-EM density map with respect to the SSEs in the protein sequence. Our group member Dr. Al Nasr proposed the top-k search algorithm, searching the top-k possible topologies for a target protein. It was the most effective algorithm so far. However, this algorithm only works well for pure a-helix proteins due to the complexity of the topologies of β-sheets. Based on the known protein structures in the Protein Data Bank (PDB), we noticed that some topologies in β-sheets had a high preference; on the contrary, some topologies never appeared. The preference of different topologies of β-sheets was introduced into the optimized top-k search algorithm to adjust the edge weight between nodes. Compared with the previous results, this optimization significantly improved the performance of the top-k algorithm in the proteins containing both α-helices and β-sheets

    Diversifying Top-K Results

    Full text link
    Top-k query processing finds a list of k results that have largest scores w.r.t the user given query, with the assumption that all the k results are independent to each other. In practice, some of the top-k results returned can be very similar to each other. As a result some of the top-k results returned are redundant. In the literature, diversified top-k search has been studied to return k results that take both score and diversity into consideration. Most existing solutions on diversified top-k search assume that scores of all the search results are given, and some works solve the diversity problem on a specific problem and can hardly be extended to general cases. In this paper, we study the diversified top-k search problem. We define a general diversified top-k search problem that only considers the similarity of the search results themselves. We propose a framework, such that most existing solutions for top-k query processing can be extended easily to handle diversified top-k search, by simply applying three new functions, a sufficient stop condition sufficient(), a necessary stop condition necessary(), and an algorithm for diversified top-k search on the current set of generated results, div-search-current(). We propose three new algorithms, namely, div-astar, div-dp, and div-cut to solve the div-search-current() problem. div-astar is an A* based algorithm, div-dp is an algorithm that decomposes the results into components which are searched using div-astar independently and combined using dynamic programming. div-cut further decomposes the current set of generated results using cut points and combines the results using sophisticated operations. We conducted extensive performance studies using two real datasets, enwiki and reuters. Our div-cut algorithm finds the optimal solution for diversified top-k search problem in seconds even for k as large as 2,000.Comment: VLDB201

    Dash: A Novel Search Engine for Database-Generated Dynamic Web Pages

    Get PDF
    Office of Research, Singapore Management Universit

    Scalable Empirical Dynamic Modeling With Parallel Computing and Approximate k-NN Search

    Get PDF
    Empirical Dynamic Modeling (EDM) is a mathematical framework for modeling and predicting non-linear time series data. Although EDM is increasingly adopted in various research fields, its application to large-scale data has been limited due to its high computational cost. This article presents kEDM, a high-performance implementation of EDM for analyzing large-scale time series datasets. kEDM adopts the Kokkos performance-portable programming model to efficiently run on both CPU and GPU while sharing a single code base. We also conduct hardware-specific optimization of performance-critical kernels. kEDM achieved up to 6.58× speedup in pairwise causal inference of real-world biology datasets compared to an existing EDM implementation. Furthermore, we integrate multiple approximate k-NN search algorithms into EDM to enable the analysis of extremely large datasets that were intractable with conventional EDM based on exhaustive k-NN search. EDM-based time series forecast enhanced with approximate k-NN search demonstrated up to 790× speedup compared to conventional Simplex projection with less than 1% increase in MAPE.journal articl

    Scaling Manifold Ranking Based Image Retrieval

    Get PDF
    Manifold Ranking is a graph-based ranking algorithm being successfully applied to retrieve images from multimedia databases. Given a query image, Manifold Ranking computes the ranking scores of images in the database by exploiting the relationships among them expressed in the form of a graph. Since Manifold Ranking effectively utilizes the global structure of the graph, it is significantly better at finding intuitive results compared with current approaches. Fundamentally, Manifold Ranking requires an inverse matrix to compute ranking scores and so needs O(n^3) time, where n is the number of images. Manifold Ranking, unfortunately, does not scale to support databases with large numbers of images. Our solution, Mogul, is based on two ideas: (1) It efficiently computes ranking scores by sparse matrices, and (2) It skips unnecessary score computations by estimating upper bounding scores. These two ideas reduce the time complexity of Mogul to O(n) from O(n^3) of the inverse matrix approach. Experiments show that Mogul is much faster and gives significantly better retrieval quality than a state-of-the-art approximation approach

    Finding Most Popular Indoor Semantic Locations Using Uncertain Mobility Data

    Get PDF
    corecore