30 research outputs found

    Large scale parallel state space search utilizing graphics processing units and solid state disks

    Get PDF
    The evolution of science is a double-track process composed of theoretical insights on the one hand and practical inventions on the other one. While in most cases new theoretical insights motivate hardware developers to produce systems following the theory, in some cases the shown hardware solutions force theoretical research to forecast the results to expect. Progress in computer science rely on two aspects, processing information and storing it. Improving one side without touching the other will evidently impose new problems without producing a real alternative solution to the problem. While decreasing the time to solve a challenge may provide a solution to long term problems it will fail in solving problems which require much storage. In contrast, increasing the available amount of space for information storage will definitively allow harder problems to be solved by offering enough time. This work studies two recent developments in the hardware to utilize them in the domain of graph searching. The trend to discontinue information storage on magnetic disks and use electronic media instead and the tendency to parallelize the computation to speed up information processing are analyzed. Storing information on rotating magnetic disk has become the standard way since a couple of years and has reached a point where the storage capacity can be seen as infinite due to the possibility of adding new drives instantly with low costs. However, while the possible storage capacity increases every year, the transferring speed does not. At the beginning of this work, solid state media appeared on the market, slowly suppressing hard disks in speed demanding applications. Today, when finishing this work solid state drives are replacing magnetic disks in mobile computing, and computing centers use them as caching media to increase information retrieving speed. The reason is the huge advantage in random access where the speed does not drop so significantly as with magnetic drives. While storing and retrieving huge amounts of information is one side of the medal, the other one is the processing speed. Here the trend from increasing the clock frequency of single processors stagnated in 2006 and the manufacturers started to combine multiple cores in one processor. While a CPU is a general purpose processor the manufacturers of graphics processing units (GPUs) encounter the challenge to perform the same computation for a large number of image points. Here, a parallelization offers huge advantages, so modern graphics cards have evolved to highly parallel computing instances with several hundreds of cores. The challenge is to utilize these processors in other domains than graphics processing. One of the vastly used tasks in computer science is search. Not only disciplines with an obvious search but also in software testing searching a graph is the crucial aspect. Strategies which enable to examine larger graphs, be it by reducing the number of considered nodes or by increasing the searching speed, have to be developed to battle the rising challenges. This work enhances searching in multiple scientific domains like explicit state Model Checking, Action Planning, Game Solving and Probabilistic Model Checking proposing strategies to find solutions for the search problems. Providing an universal search strategy which can be used in all environments to utilize solid state media and graphics processing units is not possible due to the heterogeneous aspects of the domains. Thus, this work presents a tool kit of strategies tied together in an universal three stage strategy. In the first stage the edges leaving a node are determined, in the second stage the algorithm follows the edges to generate nodes. The duplicate detection in stage three compares all newly generated nodes to existing once and avoids multiple expansions. For each stage at least two strategies are proposed and decision hints are given to simplify the selection of the proper strategy. After describing the strategies the kit is evaluated in four domains explaining the choice for the strategy, evaluating its outcome and giving future clues on the topic

    Pan-genome Search and Storage

    Get PDF
    Holley G. Pan-genome Search and Storage. Bielefeld: UniversitÀt Bielefeld; 2018.High Throughput Sequencing (HTS) technologies are constantly improving and making genome sequencing more affordable. However, HTS sequencers can only produce short overlapping genome fragments that are erroneous and cover the sequenced genomes unevenly. These genome fragments are assembled based on their overlaps to produce larger contiguous sequences. Since de novo genome assembly is computationally intensive, some species have a reference genome used as a guide for assembling genome fragments from the same species or as a basis for comparative genomics methods. Yet, assembling a genome is an error-prone process depending on the quality of the sequencing data and the heuristics used during the assembly. Furthermore, analyses based on a reference are biased towards the reference. Finally, a single reference cannot reflect the dynamics and diversity of a population of genomes. Overcoming these issues requires to move away from the single-genome reference-centric paradigm and take advantage of the multiple sequenced genomes available for each species. For this purpose, pan-genomes were introduced as sets of genomes from different strains of the same species. A pan-genome is represented by a multi-genome index exploiting the similarity and redundancy of the genomes it contains. Still, pan-genomes are more difficult to analyze than single genomes because of the large amount of data to be stored and indexed. Current data structures for pan-genome indexing do not fulfill all requirements for pan-genome analysis. Indeed, these data structures are often immutable while the size of a pan-genome grows constantly with newly sequenced genomes. Frequently, these data structures consider only assemblies as input, while unassembled genome fragments abound in databases. Also, indexing variants and similarities between the genomes of a pan-genome usually requires time and memory consuming algorithms such as sequence alignments. Sometimes, pan-genome analysis tools just assume variants and similarities are provided as input. While data structures already exist for pan-genome indexing, no solution is currently proposed for genome fragment compression in a pan-genome context. Indeed, it is often of interest to transmit and store all genome fragments of a pan-genome. However, HTS-specific compression tools are not dynamic and cannot update a compressed archive of genome fragments with new fragments of a genome without decompression. Hence, those tools are poorly adapted to the transmission and storage of genome fragments in a pan-genome context. In this thesis, we aim to provide scalable solutions for pan-genome indexing and storage. We first address the problem of pan-genome indexing by proposing a new alignment-free, reference-free and incremental data structure that considers genome fragments as well as assemblies in input: the Bloom Filter Trie (BFT). The BFT is a tree data structure representing a colored de Bruijn graph in which k-mers, words of length k from the input genomes, are associated with sets of colors representing the genomes in which they occur. The BFT makes extensive use of Bloom filters to navigate in the tree and optimize the graph traversal. A "bursting" method is employed to perform an efficient path and level compaction of the tree. We show that the BFT outperforms a data structure that has similar features but is based on an approximation of the set of indexed k-mers. Secondly, we address the problem of genome fragments compression in a pan-genome context by proposing a new abstract data structure, the guided de Bruijn graph. It augments the de Bruijn graph with k-mer partitions such that the graph traversal is guided to reconstruct exactly the genome fragments when decompressing. Different techniques are proposed to optimize the storage of fragments in the graph and the partition encoding. We show that the BFT described previously has all features required to index a guided de Bruijn graph and is used in the implementation of our compression method named DARRC. The evaluation of DARRC on a large pan-genome dataset compared to state-of-the-art HTS-specific and general purpose compression tools shows a 30% compression ratio improvement over the second best performing tool of this evaluation

    Location-based web search and mobile applications

    Get PDF

    Proteome characterizations of microbial systems using MS-based experimental and informatics approaches to examine key metabolic pathways, proteins of unknown function, and phenotypic adaptation

    Get PDF
    Microbes express complex phenotypes and coordinate activities to build microbial communities. Recent work has focused on understanding the ability of microbial systems to efficiently utilize cellulosic biomass to produce bioenergy-related products. In order to maximize the yield of these bioenergy-related products from a microbial system, it is necessary to understand the molecular mechanisms.The ability of mass spectrometry to precisely identify thousands of proteins from a bacterial source has established mass spectrometry-based proteomics as an indispensable tool for various biological disciplines. This dissertation developed and optimized various proteomics experimental and informatic protocols, and integrated the resulting data with metabolomics, transcriptomics, and genomics in order to understand the systems biology of bio-energy relevant organisms. Integration of these various omics technologies led to an improved understanding of microbial cell-to-cell communication in response to external stimuli, microbial adaptation during deconstruction of lignocellulosic biomass and proteome diversity when an organism is subjected to different growth conditions.Integrated omics revealed Clostridium thermocellum\u27s accumulate long-chain, branched fatty acids over time in response to cytotoxic inhibitors released during the deconstruction and utilization of switchgrass. A striking feature implies a restructuring of C. thermocellum\u27s cellular membrane as the culture progresses. The membrane remodulation was further examined in a study involving the swarming and swimming phenotypes of Paenibacillus polymyxa. The possible roles of phospholipids, hydrolytic enzymes, surfactin, flagellar assembly, chemotaxis and glycerol metabolism in swarming motility were investigated by integrating lipidomics with proteomics.Extracellular proteome analysis of Caldicellulosiruptor bescii revealed secretome plasticity based on the complexity (mono-/disaccharides vs. polysaccharides) and type of carbon (C5 vs. C6) available to the microorganism. This study further opened the avenue for research to characterize proteins of unknown function (PUFs) specific to growth conditions.To gain a better understanding of the possible functions of PUFs in C. thermocellum, a time course analysis of C. thermocellum was conducted. Based on the concept of guilt-by-association, protein intensities and their co-expressions were used to tease out the functional aspect of PUFs. Clustering trends and network analysis were used to infer potential functions of PUFs. Selected PUFs were further interrogated by the use of phylogeny and structural modeling

    The Data Science Design Manual

    Get PDF

    Biohacking and code convergence : a transductive ethnography

    Full text link
    Cette thĂšse se dĂ©ploie dans un espace de discours et de pratiques revendicatrices, Ă  l’inter- section des cultures amateures informatiques et biotechniques, euro-amĂ©ricaines contempo- raines. La problĂ©matique se dessinant dans ce croisement culturel examine des mĂ©taphores et analogies au coeur d’un traffic intense, au milieu de voies de commmunications imposantes, reliant les technologies informatiques et biotechniques comme lieux d’expression mĂ©diatique. L’examen retrace les lignes de force, les mĂ©diations expressives en ces lieux Ă  travers leurs manifestations en tant que codes —à la fois informatiques et gĂ©nĂ©tiques— et reconnaĂźt les caractĂšres analogiques d’expressivitĂ© des codes en tant que processus de convergence. Émergeant lentement, Ă  partir des annĂ©es 40 et 50, les visions convergentes des codes ont facilitĂ© l’entrĂ©e des ordinateurs personnels dans les marchĂ©s, ainsi que dans les garages de hackers, alors que des bricoleurs de l’informatique s’en rĂ©clamaient comme espace de libertĂ© d’information —et surtout d’innovation. Plus de cinquante ans plus tard, l’analogie entre codes informatiques et gĂ©nĂ©tiques sert de moteur aux revendications de libertĂ©, informant cette fois les nouvelles applications de la biotechnologie de marchĂ©, ainsi que l’activitĂ© des biohackers, ces bricoleurs de garage en biologie synthĂ©tique. Les pratiques du biohacking sont ainsi comprises comme des individuations : des tentatives continues de rĂ©soudre des frictions, des tensions travaillant les revendications des cultures amateures informatiques et biotechniques. Une des maniĂšres de moduler ces tensions s’incarne dans un processus connu sous le nom de forking, entrevu ici comme l’expĂ©rience d’une bifurcation. Autrement dit, le forking est ici dĂ©finit comme passage vers un seuil critique, dĂ©clinant la technologie et la biologie sur plusieurs modes. Le forking informe —c’est-Ă -dire permet et contraint— diffĂ©rentes vi- sions collectives de l’ouverture informationnelle. Le forking intervient aussi sur les plans des iii semio-matĂ©rialitĂ©s et pouvoirs d’action investis dans les pratiques biotechniques et informa- tiques. Pris comme processus de co-constitution et de diffĂ©rentiation de l’action collective, les mouvements de bifurcation invitent les trois questions suivantes : 1) Comment le forking catalyse-t-il la solution des tensions participant aux revendications des pratiques du bioha- cking ? 2) Dans ce processus de solution, de quelles maniĂšres les revendications changent de phase, bifurquent et se transforment, parfois au point d’altĂ©rer radicalement ces pratiques ? 3) Quels nouveaux problĂšmes Ă©mergent de ces solutions ? L’effort de recherche a trouvĂ© ces questions, ainsi que les plans correspondants d’action sĂ©mio-matĂ©rielle et collective, incarnĂ©es dans trois expĂ©riences ethnographiques rĂ©parties sur trois ans (2012-2015) : la premiĂšre dans un laboratoire de biotechnologie communautaire new- yorkais, la seconde dans l’émergence d’un groupe de biotechnologie amateure Ă  MontrĂ©al, et la troisiĂšme Ă  Cork, en Irlande, au sein du premier accĂ©lĂ©rateur d’entreprises en biologie synthĂ©tique au monde. La logique de l’enquĂȘte n’est ni strictement inductive ou dĂ©ductive, mais transductive. Elle emprunte Ă  la philosophie de la communication et de l’information de Gilbert Simondon et dĂ©couvre l’épistĂ©mologie en tant qu’acte de crĂ©ation opĂ©rant en milieux relationnels. L’heuristique transductive offre des rencontres inusitĂ©es entre les mĂ©taphores et les analogies des codes. Ces rencontres Ă©tonnantes ont amĂ©nagĂ© l’expĂ©rience de la conver- gence des codes sous forme de jeux d’écritures. Elles se sont retrouvĂ©es dans la recherche ethnographique en tant que processus transductifs.This dissertation examines creative practices and discourses intersecting computer and biotech cultures. It queries influential metaphors and analogies on both sides of the inter- section, and their positioning of biotech and information technologies as expression media. It follows mediations across their incarnations as codes, both computational and biological, and situates their analogical expressivity and programmability as a process of code conver- gence. Converging visions of technological freedom facilitated the entrance of computers in 1960’s Western hobbyist hacker circles, as well as in consumer markets. Almost fifty years later, the analogy drives claims to freedom of information —and freedom of innovation— from biohacker hobbyist groups to new biotech consumer markets. Such biohacking practices are understood as individuations: as ongoing attempts to resolve frictions, tensions working through claims to freedom and openness animating software and biotech cultures. Tensions get modulated in many ways. One of them, otherwise known as “forking,” refers here to a critical bifurcation allowing for differing iterations of biotechnical and computa- tional configurations. Forking informs —that is, simultaneously affords and constrains— differing collective visions of openness. Forking also operates on the materiality and agency invested in biotechnical and computational practices. Taken as a significant process of co- constitution and differentiation in collective action, bifurcation invites the following three questions: 1) How does forking solve tensions working through claims to biotech freedom? 2) In this solving process, how can claims bifurcate and transform to the point of radically altering biotech practices? 3) what new problems do these solutions call into existence? This research found these questions, and both scales of material action and agency, in- carnated in three extensive ethnographical journeys spanning three years (2012-2015): the first in a Brooklyn-based biotech community laboratory, the second in the early days of a biotech community group in Montreal, and the third in the world’s first synthetic biology startup accelerator in Cork, Ireland. The inquiry’s guiding empirical logic is neither solely deductive or inductive, but transductive. It borrows from Gilbert Simondon’s philosophy of communication and information to experience epistemology as an act of analogical creation involving the radical, irreversible transformation of knower and known. Transductive heuris- tics offer unconvential encounters with practices, metaphors and analogies of code. In the end, transductive methods acknowledge code convergence as a metastable writing games, and ethnographical research itself as a transductive process

    A Novel Technique for Compressing Pattern Databases in the Pancake Sorting Problems

    No full text
    In this paper we present a lossless technique to compress pattern databases (PDBs) in the Pancake Sorting problems. This compression technique together with the choice of zero-cost operators in the construction of additive PDBs reduces the memory requirement for PDBs in these problems to a great extent, thus making otherwise intractable problems able to be efficiently handled. Also, using this method, we can construct some problem-size independent PDBs. This precludes the necessity of constructing new PDBs for new problems with different numbers of pancakes. In addition to our compression technique, by maximizing over the heuristic value of additive PDBs and the modified version of the gap heuristic, we have obtained powerful heuristics for the burnt pancake problem

    The 2nd International Electronic Conference on Applied Sciences

    Get PDF
    This book is focused on the works presented at the 2nd International Electronic Conference on Applied Sciences, organized by Applied Sciences from 15 to 31 October 2021 on the MDPI Sciforum platform. Two decades have passed since the start of the 21st century. The development of sciences and technologies is growing ever faster today than in the previous century. The field of science is expanding, and the structure of science is becoming ever richer. Because of this expansion and fine structure growth, researchers may lose themselves in the deep forest of the ever-increasing frontiers and sub-fields being created. This international conference on the Applied Sciences was started to help scientists conduct their own research into the growth of these frontiers by breaking down barriers and connecting the many sub-fields to cut through this vast forest. These functions will allow researchers to see these frontiers and their surrounding (or quite distant) fields and sub-fields, and give them the opportunity to incubate and develop their knowledge even further with the aid of this multi-dimensional network

    Techniques of design optimisation for algorithms implemented in software

    Get PDF
    The overarching objective of this thesis was to develop tools for parallelising, optimising, and implementing algorithms on parallel architectures, in particular General Purpose Graphics Processors (GPGPUs). Two projects were chosen from different application areas in which GPGPUs are used: a defence application involving image compression, and a modelling application in bioinformatics (computational immunology). Each project had its own specific objectives, as well as supporting the overall research goal. The defence / image compression project was carried out in collaboration with the Jet Propulsion Laboratories. The specific questions were: to what extent an algorithm designed for bit-serial for the lossless compression of hyperspectral images on-board unmanned vehicles (UAVs) in hardware could be parallelised, whether GPGPUs could be used to implement that algorithm, and whether a software implementation with or without GPGPU acceleration could match the throughput of a dedicated hardware (FPGA) implementation. The dependencies within the algorithm were analysed, and the algorithm parallelised. The algorithm was implemented in software for GPGPU, and optimised. During the optimisation process, profiling revealed less than optimal device utilisation, but no further optimisations resulted in an improvement in speed. The design had hit a local-maximum of performance. Analysis of the arithmetic intensity and data-flow exposed flaws in the standard optimisation metric of kernel occupancy used for GPU optimisation. Redesigning the implementation with revised criteria (fused kernels, lower occupancy, and greater data locality) led to a new implementation with 10x higher throughput. GPGPUs were shown to be viable for on-board implementation of the CCSDS lossless hyperspectral image compression algorithm, exceeding the performance of the hardware reference implementation, and providing sufficient throughput for the next generation of image sensor as well. The second project was carried out in collaboration with biologists at the University of Arizona and involved modelling a complex biological system – VDJ recombination involved in the formation of T-cell receptors (TCRs). Generation of immune receptors (T cell receptor and antibodies) by VDJ recombination is an enormously complex process, which can theoretically synthesize greater than 1018 variants. Originally thought to be a random process, the underlying mechanisms clearly have a non-random nature that preferentially creates a small subset of immune receptors in many individuals. Understanding this bias is a longstanding problem in the field of immunology. Modelling the process of VDJ recombination to determine the number of ways each immune receptor can be synthesized, previously thought to be untenable, is a key first step in determining how this special population is made. The computational tools developed in this thesis have allowed immunologists for the first time to comprehensively test and invalidate a longstanding theory (convergent recombination) for how this special population is created, while generating the data needed to develop novel hypothesis

    The Third NASA Goddard Conference on Mass Storage Systems and Technologies

    Get PDF
    This report contains copies of nearly all of the technical papers and viewgraphs presented at the Goddard Conference on Mass Storage Systems and Technologies held in October 1993. The conference served as an informational exchange forum for topics primarily relating to the ingestion and management of massive amounts of data and the attendant problems involved. Discussion topics include the necessary use of computers in the solution of today's infinitely complex problems, the need for greatly increased storage densities in both optical and magnetic recording media, currently popular storage media and magnetic media storage risk factors, data archiving standards including a talk on the current status of the IEEE Storage Systems Reference Model (RM). Additional topics addressed System performance, data storage system concepts, communications technologies, data distribution systems, data compression, and error detection and correction
    corecore