116 research outputs found
Recommended from our members
Efficient analysis and storage of large-scale genomic data
The impending advent of population-scaled sequencing cohorts involving tens of millions of individuals with matched phenotypic measurements will produce unprecedented volumes of genetic data. Storing and analysing such gargantuan datasets places computational performance at a pivotal position in medical genomics. In this thesis, I explore the potential for accelerating and parallelizing standard genetics workflows, file formats, and algorithms using both hardware-accelerated vectorization, parallel and distributed
algorithms, and heterogeneous computing.
First, I describe a novel bit-counting operation termed the positional population-count, which can be used together with succinct representations and standard efficient operations to accelerate many genetic calculations. In order to enable the use of this new operator and the canonical population count on any target machine I developed a unified low-level library using CPU dispatching to select the optimal method contingent on the available
instruction set architecture and the given input size at run-time. As a proof-of-principle application, I apply the positional population-count operator to computing quality control-related summary statistics for terabyte-scaled sequencing readsets with >3,800-fold speed improvements. As another application, I describe a framework for efficiently computing the cardinality of set intersection using these operators and applied this framework to efficiently compute genome-wide linkage-disequilibrium in datasets with up to 67 million samples resulting in up to >60-fold improvements in speed for dense genotypic vectors and up to >250,000-fold savings in memory and >100,000-fold improvement in speed for sparse genotypic vectors. I next describe a framework for handling the terabytes of compressed output data and describe graphical routines for visualizing long-range linkage-disequilibrium blocks as seen over many human centromeres. Finally, I describe efficient algorithms for storing and querying very large genetic datasets and specialized algorithms for the genotype component of such datasets with >10,000-fold savings in memory compared to the current interchange format.Wellcome Trus
Recommended from our members
GPU-Acceleration of In-Memory Data Analytics
Hardware advances strongly influence the database system design. The flattening speed of CPU cores makes many-core accelerators, such as GPUs, a vital alternative to explore for processing the ever-increasing amounts of data. GPUs have a significantly higher degree of parallelism than multi-core CPUs but their cores are simpler. As a result, they do not face the power constraints limiting the parallelism of CPUs. Their trade-off, however, is the increased implementation complexity. This thesis adapts and redesigns data analytics operators to better exploit the GPU special memory and threading model. Due to the increasing memory capacity and also the user's need for fast interaction with the data, we focus on in-memory analytics.
Our techniques span different steps of the data processing pipeline: (1) Data preprocessing, (2) Query compilation, and (3) Algorithmic optimization of the operators. Our data preprocessing techniques adapt the data layout for numeric and string columns to maximize the achieved GPU memory bandwidth. Our query compilation techniques compute the optimal execution plan for conjunctive filters. We formulate \textit{memory divergence} for string matching algorithms and suggest how to eliminate it. Finally, we parallelize decompression algorithms in our compression framework \textit{Gompresso} to fit more data into the limited GPU memory. Gompresso achieves high speed-ups on GPUs over multi-core CPU state-of-the-art libraries and is suitable for any massively parallel processor
Fast Packet Processing on High Performance Architectures
The rapid growth of Internet and the fast emergence of new network applications have brought great challenges and complex issues in deploying high-speed and QoS guaranteed IP network. For this reason packet classication and network intrusion detection have assumed a key role in modern communication networks in order to provide Qos and security. In this thesis we describe a number of the most advanced solutions to these tasks. We introduce NetFPGA and Network Processors as reference platforms both for the design and the implementation of the solutions and
algorithms described in this thesis. The rise in links capacity reduces the time available to network devices for packet processing. For this reason, we show different solutions which, either by heuristic and randomization or by smart construction of state machine, allow IP lookup, packet classification and deep packet inspection to be fast in real devices based on high speed platforms such as NetFPGA or Network Processors
Electronic instructional materials and course requirements "Computer science" for specialty: 1-53 01 01 «Automation of technological processes and production»
The purpose of the electronic instructional materials and course requirements by the discipline «Computer science» (EIMCR) is to develop theoretical systemic and practical knowledge in different fields of Computer science. Features of structuring and submission of educational material: EIMCR includes the following sections: theoretical, practical, knowledge control, auxiliary. The theoretical section presents lecture material in accordance with the main sections and topics of the syllabus. The practical section of the EIMCR contains materials for conducting practical classes aimed to develop modern computational thinking, basic skills in computing and making decisions in the field of the fundamentals of computer theory and many computer science fields. The knowledge control section of the EIMCR contains: guidelines for the implementation of the control work aimed at developing the skills of independent work on the course under study, developing the skills of selecting, analyzing and writing out the necessary material, as well as the correct execution of the tasks; list of questions for the credit by the discipline. The auxiliary section of the EIMCR contains the following elements of the syllabus: explanatory note; thematic lectures plan; tables of distribution of classroom hours by topics and informational and methodological part. EIMCR contains active links to quickly find the necessary material
Object-based video representations: shape compression and object segmentation
Object-based video representations are considered to be useful for easing the process of multimedia content production and enhancing user interactivity in multimedia productions. Object-based video presents several new technical challenges, however.
Firstly, as with conventional video representations, compression of the video data is a
requirement. For object-based representations, it is necessary to compress the shape of
each video object as it moves in time. This amounts to the compression of moving
binary images. This is achieved by the use of a technique called context-based
arithmetic encoding. The technique is utilised by applying it to rectangular pixel blocks and as such it is consistent with the standard tools of video compression. The blockbased application also facilitates well the exploitation of temporal redundancy in the sequence of binary shapes. For the first time, context-based arithmetic encoding is used in conjunction with motion compensation to provide inter-frame compression. The method, described in this thesis, has been thoroughly tested throughout the MPEG-4 core experiment process and due to favourable results, it has been adopted as part of the MPEG-4 video standard.
The second challenge lies in the acquisition of the video objects. Under normal conditions, a video sequence is captured as a sequence of frames and there is no inherent information about what objects are in the sequence, not to mention information relating to the shape of each object. Some means for segmenting semantic objects from general video sequences is required. For this purpose, several image analysis tools may be of help and in particular, it is believed that video object tracking algorithms will be important. A new tracking algorithm is developed based on piecewise polynomial motion representations and statistical estimation tools, e.g. the expectationmaximisation method and the minimum description length principle
Optimizing the recovery of data consistency gossip algorithms on distributed object-store systems (CEPH)
Η αύξηση των δεδομένων στο Διαδίκτυο αυξάνεται ραγδαία και τα συστήματα αποθήκευσης και διατήρησης του τεράστιου όγκου πληροφοριών γίνονται όλο και ποιο δημοφιλή. Το Ceph είναι ένα κατανεμημένο σύστημα αποθήκευσης αντικειμένων για το χειρισμό μεγάλων ποσοτήτων δεδομένων. Το σύστημα αυτό αναπτύχθηκε αρχικά από τον Sage Weil (Redhat) και κερδίζει δημοτικότητα με την πάροδο του χρόνου. Το Ceph χρησιμοποιείται ως σύστημα αποθήκευσης μεγάλων δεδομένων σε μεγάλες εταιρείες όπως η CISCO, η CERN και η Deutche Telekom. Αν και είναι ένα δημοφιλές σύστημα, όπως και κάθε άλλο κατανεμημένο σύστημα, οι κόμβοι της συστάδας του αποτυγχάνουν με την πάροδο του χρόνου. Σε αυτήν την περίπτωση, θα πρέπει να πραγματοποιηθούν μηχανισμοί αποκατάστασης χαμένων δεδομένων για την επίλυση τυχόν προβλημάτων. Σε αυτή τη διατριβή, συστήνουμε έναν νέο τρόπο συγχρονισμού των δεδομένων μεταξύ των αντιγράφων για να κάνουμε τα δεδομένα συνεπή, εντοπίζοντας και φιλτράροντας τα αμετάβλητα αντικείμενα. Ο τρέχων αλγόριθμος για ανάκτηση χαμένων δεδομένων του Ceph είναι μια ανθεκτική αλλά απλοϊκή εφαρμογή σχετικά με την πρόσβαση στο δίσκο και την κατανάλωση μνήμης. Καθώς η τεχνολογία εξελίσσεται και γίνονται ταχύτερες λύσεις αποθήκευσης (π.χ. PCIe SSD, NVME), πρακτικές όπως το πρωτόκολλο προεγγραφής ημερολογίου (Write-Ahead Log) για τη συνέπεια των δεδομένων μπορούν επίσης να δημιουργήσουν νέα προβλήματα. Καταγράφοντας χιλιάδες εγγραφές ανά δευτερόλεπτο κάτω από μια υποβαθμισμένη συστάδα κόμβων μπορεί να αυξηθεί αρκετά γρήγορα η κατανάλωση μνήμης και να αποτύχει ένας κόμβος αποθήκευσης (η υποβαθμισμένη συστάδα είναι μια κατάσταση της συστάδας στην οποία ένας κόμβος αποθήκευσης είναι εκτός λειτουργίας για οποιονδήποτε λόγο). Παρόλο που το Ceph υποστηρίζει πλέον ένα ανώτατο όριο στον αριθμό των εγγραφών του WAL, αυτό το όριο επιτυγχάνεται συχνά και ακυρώνει το ημερολόγιο πλήρως, επειδή οι νέες εγγραφές θα χαθούν. Επομένως, το σύστημα στην τρέχουσα υλοποίηση χρειάζεται να ελέγχει κάθε αντικείμενο των κόμβων αντιγράφων, ώστε να μπορεί να τους συγχρονίσει, κάτι που προφανώς είναι μια πολύ αργή διαδικασία. Ως εκ τούτου, παρουσιάζουμε τα δέντρα Merkle ως μια εναλλακτική λύση στα φίλτρα Bloom, ώστε η διαδικασία ανάκτησης να μπορεί να εντοπίσει περιοχές όπου τα αντικείμενα δεν τροποποιήθηκαν και έτσι να μειώσει τον χρόνο ανάκτησης αυτών των δεδομένων. Η διαδικασία ανάκτησης έχει ένα εμφανές αντίκτυπο στην επίδοση των λειτουργιών των αντικειμένων (γράψιμο, ανάγνωση) των χρηστών και η συνολική εμπειρία για αυτούς μπορεί να βελτιωθεί με την μείωση των χρόνων ανάκτησης χαμένων δεδομένων της συστάδας. Σύμφωνα με τα πειράματα που πραγματοποιήσαμε, παρατηρούμε αύξηση απόδοσης της τάξης των 10% έως 400% που ποικίλλει ανάλογα με τον αριθμό των αντικειμένων που επηρεάστηκαν κατά τη διακοπή λειτουργίας ενός η περισσότερων κόμβων.The data growth on the internet is increasing rapidly and systems for storing and preserving the sheer volume of information are nowadays on the rise. Ceph is a distributed storage system for handling large amounts of data, it was initially developed by Sage Weil (Redhat) and it is gaining popularity over the years. Ceph is being used as a system for big data storage in large companies such as CISCO, CERN and Deutche Telekom. Although a popular system, as any other distributed system, its individual components fail over the course of time. In this case, the recovery mechanisms need to take place to resolve any issues. In this thesis, we introduce a new way to synchronise the data between the replicas to make the data consistent, by identifying and filtering unchanged objects. The current algorithm for recovery in Ceph is a durable yet simple implementation regarding disk access and memory consumption. As the technology evolves and faster storage solutions emerge (e.g. PCIe SSDs), practices such as Write-Ahead Logging for data consistency can also introduce new problems. Having thousands of write operations logged per second under a degraded cluster can rapidly increase memory consumption and fail a storage node (degraded is a cluster state in which a storage node is down for any reason). Although, Ceph now supports an upper limit on the number of entries in its WAL, this limit is often reached and it invalidates the log, because any new entries will be lost. Therefore, the system is left to check every object of the replicas so it can synchronize them, which is a very slow process. Hence, we introduce the Merkle trees as an alternative solution to Bloom filters so the recovery procedure can identify regions where objects were not modified and thus reduce the recovery time. The recovery process has an observable impact on the users’ IO bandwidth, and the overall experience for them can be improved by reducing the cluster’s recovery times. The benchmarks show a performance increase of 10% to 400% that varies with how many objects were affected during the downtime of a node
Physiological system modelling
Computer graphics has a major impact in our day-to-day life. It is used in diverse areas such as displaying the results of engineering and scientific computations and visualization, producing television commercials and feature films, simulation and analysis of real world problems, computer aided design, graphical user interfaces that increases the communication bandwidth between humans and machines, etc Scientific visualization is a well-established method for analysis of data, originating from scientific computations, simulations or measurements. The development and implementation of the 3Dgen software was developed by the author using OpenGL and C language was presented in this report 3Dgen was used to visualize threedimensional cylindrical models such as pipes and also for limited usage in virtual endoscopy. Using the developed software a model was created using the centreline data input by the user or from the output of some other program, stored in a normal text file. The model was constructed by drawing surface polygons between two adjacent centreline points. The software allows the user to view the internal and external surfaces of the model. The software was designed in such a way that it runs in more than one operating systems with minimal installation procedures Since the size of the software is very small it can be stored in a 1 44 Megabyte floppy diskette. Depending on the processing speed of the PC the software can generate models of any length and size Compared to other packages, 3Dgen has minimal input procedures was able to generate models with smooth bends. It has both modelling and virtual exploration features. For models with sharp bends the software generates an overshoot
Remote Sensing Data Compression
A huge amount of data is acquired nowadays by different remote sensing systems installed on satellites, aircrafts, and UAV. The acquired data then have to be transferred to image processing centres, stored and/or delivered to customers. In restricted scenarios, data compression is strongly desired or necessary. A wide diversity of coding methods can be used, depending on the requirements and their priority. In addition, the types and properties of images differ a lot, thus, practical implementation aspects have to be taken into account. The Special Issue paper collection taken as basis of this book touches on all of the aforementioned items to some degree, giving the reader an opportunity to learn about recent developments and research directions in the field of image compression. In particular, lossless and near-lossless compression of multi- and hyperspectral images still remains current, since such images constitute data arrays that are of extremely large size with rich information that can be retrieved from them for various applications. Another important aspect is the impact of lossless compression on image classification and segmentation, where a reasonable compromise between the characteristics of compression and the final tasks of data processing has to be achieved. The problems of data transition from UAV-based acquisition platforms, as well as the use of FPGA and neural networks, have become very important. Finally, attempts to apply compressive sensing approaches in remote sensing image processing with positive outcomes are observed. We hope that readers will find our book useful and interestin
Learning with Labeled and Unlabeled Data
In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of input-dependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature review, we try to cover the wide variety of (recent) work and to classify this work into meaningful categories. We also mention work done on related problems and suggest some ideas towards synthesis. Finally, we discuss some caveats and tradeoffs of central importance to the problem
- …