2,364 research outputs found

    METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS

    Get PDF
    High-throughput sequencing has accelerated applications of genomics throughout the world. The increased production and decentralization of sequencing has also created bottlenecks in computational analysis. In this dissertation, I provide novel computational methods to improve analysis throughput in three areas: whole genome multiple alignment, pan-genome annotation, and bioinformatics workflows. To aid in the study of populations, tools are needed that can quickly compare multiple genome sequences, millions of nucleotides in length. I present a new multiple alignment tool for whole genomes, named Mugsy, that implements a novel method for identifying syntenic regions. Mugsy is computationally efficient, does not require a reference genome, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence in mixtures of draft and completed genome data. Mugsy is evaluated on the alignment of several dozen bacterial chromosomes on a single computer and was the fastest program evaluated for the alignment of assembled human chromosome sequences from four individuals. A distributed version of the algorithm is also described and provides increased processing throughput using multiple CPUs. Numerous individual genomes are sequenced to study diversity, evolution and classify pan-genomes. Pan-genome annotations contain inconsistencies and errors that hinder comparative analysis, even within a single species. I introduce a new tool, Mugsy-Annotator, that identifies orthologs and anomalous gene structure across a pan-genome using whole genome multiple alignments. Identified anomalies include inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of pan-genomes indicates that such anomalies are common and alternative annotations suggested by the tool can improve annotation consistency and quality. Finally, I describe the Cloud Virtual Resource, CloVR, a desktop application for automated sequence analysis that improves usability and accessibility of bioinformatics software and cloud computing resources. CloVR is installed on a personal computer as a virtual machine and requires minimal installation, addressing challenges in deploying bioinformatics workflows. CloVR also seamlessly accesses remote cloud computing resources for improved processing throughput. In a case study, I demonstrate the portability and scalability of CloVR and evaluate the costs and resources for microbial sequence analysis

    Socially Trusted Collaborative Edge Computing in Ultra Dense Networks

    Full text link
    Small cell base stations (SBSs) endowed with cloud-like computing capabilities are considered as a key enabler of edge computing (EC), which provides ultra-low latency and location-awareness for a variety of emerging mobile applications and the Internet of Things. However, due to the limited computation resources of an individual SBS, providing computation services of high quality to its users faces significant challenges when it is overloaded with an excessive amount of computation workload. In this paper, we propose collaborative edge computing among SBSs by forming SBS coalitions to share computation resources with each other, thereby accommodating more computation workload in the edge system and reducing reliance on the remote cloud. A novel SBS coalition formation algorithm is developed based on the coalitional game theory to cope with various new challenges in small-cell-based edge systems, including the co-provisioning of radio access and computing services, cooperation incentives, and potential security risks. To address these challenges, the proposed method (1) allows collaboration at both the user-SBS association stage and the SBS peer offloading stage by exploiting the ultra dense deployment of SBSs, (2) develops a payment-based incentive mechanism that implements proportionally fair utility division to form stable SBS coalitions, and (3) builds a social trust network for managing security risks among SBSs due to collaboration. Systematic simulations in practical scenarios are carried out to evaluate the efficacy and performance of the proposed method, which shows that tremendous edge computing performance improvement can be achieved.Comment: arXiv admin note: text overlap with arXiv:1010.4501 by other author

    Systematizing Genome Privacy Research: A Privacy-Enhancing Technologies Perspective

    Full text link
    Rapid advances in human genomics are enabling researchers to gain a better understanding of the role of the genome in our health and well-being, stimulating hope for more effective and cost efficient healthcare. However, this also prompts a number of security and privacy concerns stemming from the distinctive characteristics of genomic data. To address them, a new research community has emerged and produced a large number of publications and initiatives. In this paper, we rely on a structured methodology to contextualize and provide a critical analysis of the current knowledge on privacy-enhancing technologies used for testing, storing, and sharing genomic data, using a representative sample of the work published in the past decade. We identify and discuss limitations, technical challenges, and issues faced by the community, focusing in particular on those that are inherently tied to the nature of the problem and are harder for the community alone to address. Finally, we report on the importance and difficulty of the identified challenges based on an online survey of genome data privacy expertsComment: To appear in the Proceedings on Privacy Enhancing Technologies (PoPETs), Vol. 2019, Issue

    High performance computing in the cloud

    Get PDF
    In recent years, the interest in both scientific and business workflows has increased. A workflow is composed of a series of tools, which should be executed in a predefined order to perform an analysis. Traditionally, these workflows were executed in a manual way, sending the output of one tool to the next one in the analysis process. Many applications to execute workflows automatically, appeared recently. These applications ease the work of the users while executing their analyses. In addition, from the computational point of view, some workflows require a significant amount of resources. Consequently, workflow execution moved from single workstations to distributed environments such as Grids or Clouds. Data management and tasks scheduling are required to execute workflows in an efficient way in such environments. In this thesis, we propose a cloud-based HPC environment, focusing on tasks scheduling, resources auto-scaling, data management and simplifying the access to the resources with software clients. First, the cloud computing infrastructure is devised, which includes the base software (i.e. OpenStack) plus several additional modules aimed at improving authentication (i.e. LDAP) and data management (i.e. GridFTP, Globus Online and CloudFuse). Second, built on top of the mentioned infrastructure, the TORQUE distributed resources manager and the Maui scheduler have been configured to schedule and distribute tasks to the cloud-based workers. To reduce the number of idle nodes and the incurred cost of the active cloud resources, we also propose a configurable auto-scaling technique, which is able to scale the execution cluster depending on the workload. Additionally, in order to simplify tasks submission to the TORQUE execution cluster, we have interconnected the Galaxy workflows management system with it, therefore users benefit from a simple way to execute their tasks. Finally, we conducted an experimental evaluation, composed by a number of different studies with synthetic and real-world applications, to show the behaviour of the auto-scaled execution cluster managed by TORQUE and Maui. All experiments have been performed by using an OpenStack cloud computing environment and the benchmarked applications correspond to the benchmarking suite, which is specially designed for workflows scheduling in the cloud computing environment. Cybershake, Ligo and Montage have been the selected synthetic applications from the benchmarking suite. GECKO and a GWAS pipeline represent the real-world test use cases, both having a diverse and heterogeneous set of tasks.The numerous technological advances in data acquisition techniques allow the massive production of enormous amounts of data in diverse fields such as astronomy, health and social networks. Nowadays, only a small part of this data can be analysed because of the lack of computational resources. High Performance Computing (HPC) strategies represent the single choice to analyse such overwhelming amount of data. However, in general, HPC techniques require the use of big and expensive computing and storage infrastructures, usually not affordable or available for most users. Cloud computing, where users pay for the resources they need and when they actually need them, appears as an interesting alternative. Besides the savings in hardware infrastructure, cloud computing offers further advantages such as the removal of installation, administration and supplying requirements. In addition, it enables users to use better hardware than the one they can usually afford, scale the resources depending on their needs, and a greater fault-tolerance, amongst others. The efficient utilisation of HPC resources becomes a fundamental task, particularly in cloud computing. We need to consider the cost of using HPC resources, specially in the case of cloud-based infrastructures, where users have to pay for storing, transferring and analysing data. Therefore, it is really important the usage of generic tasks scheduling and auto-scaling techniques to efficiently exploit the computational resources. It is equally important to make these tasks user-friendly through the development of tools/applications (software clients), which act as interface between the user and the infrastructure

    Framing Apache Spark in life sciences

    Get PDF
    Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities

    Keystroke Biometrics in Response to Fake News Propagation in a Global Pandemic

    Full text link
    This work proposes and analyzes the use of keystroke biometrics for content de-anonymization. Fake news have become a powerful tool to manipulate public opinion, especially during major events. In particular, the massive spread of fake news during the COVID-19 pandemic has forced governments and companies to fight against missinformation. In this context, the ability to link multiple accounts or profiles that spread such malicious content on the Internet while hiding in anonymity would enable proactive identification and blacklisting. Behavioral biometrics can be powerful tools in this fight. In this work, we have analyzed how the latest advances in keystroke biometric recognition can help to link behavioral typing patterns in experiments involving 100,000 users and more than 1 million typed sequences. Our proposed system is based on Recurrent Neural Networks adapted to the context of content de-anonymization. Assuming the challenge to link the typed content of a target user in a pool of candidate profiles, our results show that keystroke recognition can be used to reduce the list of candidate profiles by more than 90%. In addition, when keystroke is combined with auxiliary data (such as location), our system achieves a Rank-1 identification performance equal to 52.6% and 10.9% for a background candidate list composed of 1K and 100K profiles, respectively.Comment: arXiv admin note: text overlap with arXiv:2004.0362

    HIGH QUALITY HUMAN 3D BODY MODELING, TRACKING AND APPLICATION

    Get PDF
    Geometric reconstruction of dynamic objects is a fundamental task of computer vision and graphics, and modeling human body of high fidelity is considered to be a core of this problem. Traditional human shape and motion capture techniques require an array of surrounding cameras or subjects wear reflective markers, resulting in a limitation of working space and portability. In this dissertation, a complete process is designed from geometric modeling detailed 3D human full body and capturing shape dynamics over time using a flexible setup to guiding clothes/person re-targeting with such data-driven models. As the mechanical movement of human body can be considered as an articulate motion, which is easy to guide the skin animation but has difficulties in the reverse process to find parameters from images without manual intervention, we present a novel parametric model, GMM-BlendSCAPE, jointly taking both linear skinning model and the prior art of BlendSCAPE (Blend Shape Completion and Animation for PEople) into consideration and develop a Gaussian Mixture Model (GMM) to infer both body shape and pose from incomplete observations. We show the increased accuracy of joints and skin surface estimation using our model compared to the skeleton based motion tracking. To model the detailed body, we start with capturing high-quality partial 3D scans by using a single-view commercial depth camera. Based on GMM-BlendSCAPE, we can then reconstruct multiple complete static models of large pose difference via our novel non-rigid registration algorithm. With vertex correspondences established, these models can be further converted into a personalized drivable template and used for robust pose tracking in a similar GMM framework. Moreover, we design a general purpose real-time non-rigid deformation algorithm to accelerate this registration. Last but not least, we demonstrate a novel virtual clothes try-on application based on our personalized model utilizing both image and depth cues to synthesize and re-target clothes for single-view videos of different people
    • …
    corecore