116 research outputs found
Robust Algorithms for Detecting Hidden Structure in Biological Data
Biological data, such as molecular abundance measurements and protein
sequences, harbor complex hidden structure that reflects its underlying
biological mechanisms. For example, high-throughput abundance measurements
provide a snapshot the global state of a living cell, while homologous
protein sequences encode the residue-level logic of the proteins\u27 function
and provide a snapshot of the evolutionary trajectory of the protein family.
In this work I describe algorithmic approaches and analysis software I
developed for uncovering hidden structure in both kinds of data.
Clustering is an unsurpervised machine learning technique commonly used
to map the structure of data collected in high-throughput experiments,
such as quantification of gene expression by DNA microarrays or
short-read sequencing. Clustering algorithms always yield a partitioning
of the data, but relying on a single partitioning solution can lead to
spurious conclusions. In particular, noise in the data can cause objects
to fall into the same cluster by chance rather than due to meaningful
association. In the first part of this thesis I demonstrate approaches to
clustering data robustly in the presence of noise and apply robust clustering
to analyze the transcriptional response to injury in a neuron cell.
In the second part of this thesis I describe identifying hidden specificity
determining residues (SDPs) from alignments of protein sequences descended
through gene duplication from a common ancestor (paralogs) and apply the
approach to identify numerous putative SDPs in bacterial transcription
factors in the LacI family. Finally, I describe and demonstrate a new
algorithm for reconstructing the history of duplications by which paralogs
descended from their common ancestor. This algorithm addresses the
complexity of such reconstruction due to indeterminate or erroneous
homology assignments made by sequence alignment algorithms and to the
vast prevalence of divergence through speciation over divergence through
gene duplication in protein evolution
Comparative Genomics of Microbial Chemoreceptor Sequence, Structure, and Function
Microbial chemotaxis receptors (chemoreceptors) are complex proteins that sense the external environment and signal for flagella-mediated motility, serving as the GPS of the cell. In order to sense a myriad of physicochemical signals and adapt to diverse environmental niches, sensory regions of chemoreceptors are frenetically duplicated, mutated, or lost. Conversely, the chemoreceptor signaling region is a highly conserved protein domain. Extreme conservation of this domain is necessary because it determines very specific helical secondary, tertiary, and quaternary structures of the protein while simultaneously choreographing a network of interactions with the adaptor protein CheW and the histidine kinase CheA. This dichotomous nature has split the chemoreceptor community into two major camps, studying either an organism’s sensory capabilities and physiology or the molecular signal transduction mechanism. Fortunately, the current vast wealth of sequencing data has enabled comparative study of chemoreceptors. Comparative genomics can serve as a bridge between these communities, connecting sequence, structure, and function through comprehensive studies on scales ranging from minute and molecular to global and ecological. Herein are four works in which comparative genomics illuminates unanswered questions across the broad chemoreceptor landscape. First, we used evolutionary histories to refine chemoreceptor interactions in Thermotoga maritima, pairing phylogenetics with x-ray crystallography. Next, we uncovered the origin of a unique chemoreceptor, isolated only from hypervirulent strains of Campylobacter jejuni, by comparing chemoreceptor signaling and sensory regions from Campylobacter and Helicobacter. We then selected the opportunistic human pathogen Pseudomonas aeruginosa to address the question of assigning multiple chemoreceptors to multiple chemotaxis pathways within the same organism. We assigned all P. aeruginosa receptors to pathways using a novel in silico approach by incorporating sequence information spanning the entire taxonomic order Pseudomonadales and beyond. Finally, we surveyed the chemotaxis systems of all environmental, commensal, laboratory, and pathogenic strains of the ubiquitous Escherichia coli, where we discovered an ancestral chemoreceptor gene loss event that may have predisposed a well-studied subpopulation to adopt extra-intestinal pathogenic lifestyles. Overall, comparative genomics is a cutting edge method for comprehensive chemoreceptor study that is poised to promote synergy within and expand the significance of the chemoreceptor field
Doctor of Philosophy
dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving
Recommended from our members
Mapping numerical software onto distributed memory parallel systems
The aim of this thesis is to further the use of parallel computers, in particular distributed memory systems, by proving strategies for parallelisation and developing the core component of tools to aid scalar software porting. The ported code must not only efficiently exploit available parallel processing speed and distributed memory, but also enable existing users of the scalar code to use the parallel version with identical inputs and allow maintenance to be performed by the scalar code author in conjunction with the parallel code.
The data partition strategy has been used to parallelise an in-house solidification modelling code where all requirements for the parallel software were successfully met. To confirm the success of this parallelisation strategy, a much sterner test was used, parallelising the HARWELL-FLOW3D fluid flow package. The performance results of the parallel version clearly vindicate the conclusions of the first example. Speedup efficiencies of around 80 percent have been achieved on fifty processors for sizable models. In both these tests, the alterations to the code were fairly minor, maintaining the structure and style of the original scalar code which can easily be recognised by its original author.
The alterations made to these codes indicated the potential for parallelising tools since the alterations were fairly minor and usually mechanical in nature. The current generation of parallelising compilers rely heavily on heuristic guidance in parallel code generation and other decisions that may be better made by a human. As a result, the code they produce will almost certainly be inferior to manually produced code. Also, in order not to sacrifice parallel code quality when using tools, the scalar code analysis to identify inherent parallelism in a application code, as used in parallelising compilers, has been extended to eliminate dependencies conservatively assumed, since these dependencies can greatly inhibit parallelisation.
Extra information has been extracted both from control flow and from processing symbolic information. The tests devised to utilise this information enable the non-existence of a significant number of previously assumed dependencies to be proved. In some cases, the number of true dependencies has been more than halved.
The dependence graph produced is of sufficient quality to greatly aid the parallelisation, with user interaction and interpretation, parallelism detection and code transformation validity being less inhibited by assumed dependencies. The use of tools rather than the black box approach removes the handicaps associated with using heuristic methods, if any relevant heuristic methods exist
On the evolution of genetic diversity in RNA virus species : uncovering barriers to genetic divergence and gene length in picorna- and nidoviruses
This thesis combines the use of standard bioinformatics analyses with the development of new computational techniques to study the evolution and genetic diversity of picornaviruses and nidoviruses. It integrates two lines of research __ genetics-based virus classification and evolutionary dynamics of gene length __ and aims at unveiling commonalities in the biology of these and other RNA viruses as well as assisting applied research in virology.NBIC, European UnionUBL - phd migration 201
3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)
This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization
Complete Model-Based Testing Applied to the Railway Domain
Testing is the most important verification technique to assert the correctness of an embedded system. Model-based testing (MBT) is a popular approach that generates test cases from models automatically. For the verification of safety-critical systems, complete MBT strategies are most promising. Complete testing strategies can guarantee that all errors of a certain kind are revealed by the generated test suite, given that the system-under-test fulfils several hypotheses. This work presents a complete testing strategy which is based on equivalence class abstraction. Using this approach, reactive systems, with a potentially infinite input domain but finitely many internal states, can be abstracted to finite-state machines. This allows for the generation of finite test suites providing completeness. However, for a system-under-test, it is hard to prove the validity of the hypotheses which justify the completeness of the applied testing strategy. Therefore, we experimentally evaluate the fault-detection capabilities of our equivalence class testing strategy in this work. We use a novel mutation-analysis strategy which introduces artificial errors to a SystemC model to mimic typical HW/SW integration errors. We provide experimental results that show the adequacy of our approach considering case studies from the railway domain (i.e., a speed-monitoring function and an interlocking-system controller) and from the automotive domain (i.e., an airbag controller). Furthermore, we present extensions to the equivalence class testing strategy. We show that a combination with randomisation and boundary-value selection is able to significantly increase the probability to detect HW/SW integration errors
Software for Exascale Computing - SPPEXA 2016-2019
This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest
Recommended from our members
Computer Science Research Institute 2004 annual report of activities.
This report summarizes the activities of the Computer Science Research Institute (CSRI) at Sandia National Laboratories during the period January 1, 2004 to December 31, 2004. During this period the CSRI hosted 166 visitors representing 81 universities, companies and laboratories. Of these 65 were summer students or faculty. The CSRI partially sponsored 2 workshops and also organized and was the primary host for 4 workshops. These 4 CSRI sponsored workshops had 140 participants--74 from universities, companies and laboratories, and 66 from Sandia. Finally, the CSRI sponsored 14 long-term collaborative research projects and 5 Sabbaticals
The State of the Art in Multilayer Network Visualization
Modelling relationships between entities in real-world systems with a simple
graph is a standard approach. However, reality is better embraced as several
interdependent subsystems (or layers). Recently the concept of a multilayer
network model has emerged from the field of complex systems. This model can be
applied to a wide range of real-world datasets. Examples of multilayer networks
can be found in the domains of life sciences, sociology, digital humanities and
more. Within the domain of graph visualization there are many systems which
visualize datasets having many characteristics of multilayer graphs. This
report provides a state of the art and a structured analysis of contemporary
multilayer network visualization, not only for researchers in visualization,
but also for those who aim to visualize multilayer networks in the domain of
complex systems, as well as those developing systems across application
domains. We have explored the visualization literature to survey visualization
techniques suitable for multilayer graph visualization, as well as tools,
tasks, and analytic techniques from within application domains. This report
also identifies the outstanding challenges for multilayer graph visualization
and suggests future research directions for addressing them
- …