1,308 research outputs found

    Revisiting 1-Copy equivalence in clustered databases

    Get PDF
    Recently renewed interest in scalable database systems for shared nothing clusters has been supported by replication protocols based on group communication that are aimed at seamlessly extending the native consistency criteria of centralized database management systems. By using a read-one/write-all-available approach and avoiding the fine-grained synchronization associated with traditional distributed locking, one needs just a single distributed interaction step for each update transaction. Therefore the system can easily be scaled to a large number of replicas, especially, with read intensive loads typical of Web server support environments.In this paper we point out that 1-copy equivalence for causal consistency, which is subsumed by both serializability and snapshot isolation criteria, depends on basic session guarantees that are costly to ensure in clusters, especially in a multi-tier environment. We then point out a simple solution that guarantees causal consistency in the Database State Machine protocol and evaluate its performance, thus highlighting the cost of seamlessly providing common consistency criteria of centralized databases in a clustered environment.(undefined

    A simple approach to shared storage database servers

    Get PDF
    This paper introduces a generic technique to obtain a shared-storage database cluster from an off-the-shelf database management system, without needing to heavily refactor server software to deal with distributed locking, buffer invalidation, and recovery from partial cluster failure. Instead, the core of the proposal is the combination of a replication protocol and a surprisingly simple modification to the common copy-on-write logical volume management technique: One of the servers is allowed to skip copy-on-write and directly update the original backing store. This makes it possible to use any shared-nothing database server software in a shared or partially shared storage configuration, thus allowing large cluster configurations with a small number of copies of data.(undefined

    On the cost of database clusters reconfiguration

    Get PDF
    Database clusters based on share-nothing replication techniques are currently widely accepted as a practical solution to scalability and availability of the data tier. A key issue when planning such systems is the ability to meet service level agreements when load spikes occur or cluster nodes fail. This translates into the ability to provision and deploy additional nodes. Many current research efforts focus on designing autonomic controllers to perform such reconfiguration, tuned to quickly react to system changes and spawn new replicas based on resource usage and performance measurements. In contrast, we are concerned about the inherent impact of deploying an additional node to an online cluster, considering both the time required to finish such an action as well as the impact on resource usage and performance of the cluster as a whole. If noticeable, such impact hinders the practicability of self-management techniques, since it adds an additional dimension that has to be accounted for. Our approach is to systematically benchmark a number of different reconfiguration scenarios to assess the cost of bringing a new replica online. We consider factors such as: workload characteristics, incremental and parallel recovery, flow control and outdatedness of the recovering replica. As a result, we show that research should be refocused from optimizing the capture and transmition of changes to applying them, which in a realistic setting dominates the cost of the recovery operation.Work supported by the Spanish Government under research grant TIN2006-14738-C02-02

    Pregelix: Big(ger) Graph Analytics on A Dataflow Engine

    Full text link
    There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up to 35x speedup compared to distributed GraphLab), and makes more effective use of available machine resources to support Big(ger) Graph Analytics

    A formal characterization of SI-based ROWA replication protocols

    Full text link
    Snapshot isolation (SI) is commonly used in some commercial DBMSs with a multiversion concurrency control mechanism since it never blocks read-only transactions. Recent database replication protocols have been designed using SI replicas where transactions are firstly executed in a delegate replica and their updates (if any) are propagated to the rest of the replicas at commit time; i.e. they follow the Read One Write All (ROWA) approach. This paper provides a formalization that shows the correctness of abstract protocols which cover these replication proposals. These abstract protocols differ in the properties demanded for achieving a global SI level and those needed for its generalized SI (GSI) variant Âż allowing reads from old snapshots. Additionally, we propose two more relaxed properties that also ensure a global GSI level. Thus, some applications can further optimize their performance in a replicated system while obtaining GSI. © 2010 Elsevier B.V. All rights reserved.The authors wish to thank the reviewers for their valuable comments that helped us to greatly improve the quality and readability of this paper. This work has been supported by the Spanish Government under research grant TIN2009-14460-C03. Besides, the authors wish to thank the reviewers for their valuable comments that helped us to greatly improve the quality and readability of this paper.ArmendĂĄriz-Iñigo, J.; JuĂĄrez-RodrĂ­guez, J.; GonzĂĄlez De MendĂ­vil, J.; Garitagoitia, J.; IrĂșn Briz, L.; Muñoz EscoĂ­, FD. (2011). A formal characterization of SI-based ROWA replication protocols. Data and Knowledge Engineering. 70(1):21-34. doi:10.1016/j.datak.2010.07.012S213470

    Inter-individual variation of the human epigenome & applications

    Get PDF
    Genome-wide association studies (GWAS) have led to the discovery of genetic variants influencing human phenotypes in health and disease. However, almost two decades later, most human traits can still not be accurately predicted from common genetic variants. Moreover, genetic variants discovered via GWAS mostly map to the non-coding genome and have historically resisted interpretation via mechanistic models. Alternatively, the epigenome lies in the cross-roads between genetics and the environment. Thus, there is great excitement towards the mapping of epigenetic inter-individual variation since its study may link environmental factors to human traits that remain unexplained by genetic variants. For instance, the environmental component of the epigenome may serve as a source of biomarkers for accurate, robust and interpretable phenotypic prediction on low-heritability traits that cannot be attained by classical genetic-based models. Additionally, its research may provide mechanisms of action for genetic associations at non-coding regions that mediate their effect via the epigenome. The aim of this thesis was to explore epigenetic inter-individual variation and to mitigate some of the methodological limitations faced towards its future valorisation.Chapter 1 is dedicated to the scope and aims of the thesis. It begins by describing historical milestones and basic concepts in human genetics, statistical genetics, the heritability problem and polygenic risk scores. It then moves towards epigenetics, covering the several dimensions it encompasses. It subsequently focuses on DNA methylation with topics like mitotic stability, epigenetic reprogramming, X-inactivation or imprinting. This is followed by concepts from epigenetic epidemiology such as epigenome-wide association studies (EWAS), epigenetic clocks, Mendelian randomization, methylation risk scores and methylation quantitative trait loci (mQTL). The chapter ends by introducing the aims of the thesis.Chapter 2 focuses on stochastic epigenetic inter-individual variation resulting from processes occurring post-twinning, during embryonic development and early life. Specifically, it describes the discovery and characterisation of hundreds of variably methylated CpGs in the blood of healthy adolescent monozygotic (MZ) twins showing equivalent variation among co-twins and unrelated individuals (evCpGs) that could not be explained only by measurement error on the DNA methylation microarray. DNA methylation levels at evCpGs were shown to be stable short-term but susceptible to aging and epigenetic drift in the long-term. The identified sites were significantly enriched at the clustered protocadherin loci, known for stochastic methylation in neurons in the context of embryonic neurodevelopment. Critically, evCpGs were capable of clustering technical and longitudinal replicates while differentiating young MZ twins. Thus, discovered evCpGs can be considered as a first prototype towards universal epigenetic fingerprint, relevant in the discrimination of MZ twins for forensic purposes, currently impossible with standard DNA profiling. Besides, DNA methylation microarrays are the preferred technology for EWAS and mQTL mapping studies. However, their probe design inherently assumes that the assayed genomic DNA is identical to the reference genome, leading to genetic artifacts whenever this assumption is not fulfilled. Building upon the previous experience analysing microarray data, Chapter 3 covers the development and benchmarking of UMtools, an R-package for the quantification and qualification of genetic artifacts on DNA methylation microarrays based on the unprocessed fluorescence intensity signals. These tools were used to assemble an atlas on genetic artifacts encountered on DNA methylation microarrays, including interactions between artifacts or with X-inactivation, imprinting and tissue-specific regulation. Additionally, to distinguish artifacts from genuine epigenetic variation, a co-methylation-based approach was proposed. Overall, this study revealed that genetic artifacts continue to filter through into the reported literature since current methodologies to address them have overlooked this challenge.Furthermore, EWAS, mQTL and allele-specific methylation (ASM) mapping studies have all been employed to map epigenetic variation but require matching phenotypic/genotypic data and can only map specific components of epigenetic inter-individual variation. Inspired by the previously proposed co-methylation strategy, Chapter 4 describes a novel method to simultaneously map inter-haplotype, inter-cell and inter-individual variation without these requirements. Specifically, binomial likelihood function-based bootstrap hypothesis test for co-methylation within reads (Binokulars) is a randomization test that can identify jointly regulated CpGs (JRCs) from pooled whole genome bisulfite sequencing (WGBS) data by solely relying on joint DNA methylation information available in reads spanning multiple CpGs. Binokulars was tested on pooled WGBS data in whole blood, sperm and combined, and benchmarked against EWAS and ASM. Our comparisons revealed that Binokulars can integrate a wide range of epigenetic phenomena under the same umbrella since it simultaneously discovered regions associated with imprinting, cell type- and tissue-specific regulation, mQTL, ageing or even unknown epigenetic processes. Finally, we verified examples of mQTL and polymorphic imprinting by employing another novel tool, JRC_sorter, to classify regions based on epigenotype models and non-pooled WGBS data in cord blood. In the future, we envision how this cost-effective approach can be applied on larger pools to simultaneously highlight regions of interest in the methylome, a highly relevant task in the light of the post-GWAS era.Moving towards future applications of epigenetic inter-individual variation, Chapters 5 and 6 are dedicated to solving some of methodological issues faced in translational epigenomics.Firstly, due to its simplicity and well-known properties, linear regression is the starting point methodology when performing prediction of a continuous outcome given a set of predictors. However, linear regression is incompatible with missing data, a common phenomenon and a huge threat to the integrity of data analysis in empirical sciences, including (epi)genomics. Chapter 5 describes the development of combinatorial linear models (cmb-lm), an imputation-free, CPU/RAM-efficient and privacy-preserving statistical method for linear regression prediction on datasets with missing values. Cmb-lm provide prediction errors that take into account the pattern of missing values in the incomplete data, even at extreme missingness. As a proof-of-concept, we tested cmb-lm in the context of epigenetic ageing clocks, one of the most popular applications of epigenetic inter-individual variation. Overall, cmb-lm offer a simple and flexible methodology with a wide range of applications that can provide a smooth transition towards the valorisation of linear models in the real world, where missing data is almost inevitable. Beyond microarrays, due to its high accuracy, reliability and sample multiplexing capabilities, massively parallel sequencing (MPS) is currently the preferred methodology of choice to translate prediction models for traits of interests into practice. At the same time, tobacco smoking is a frequent habit sustained by more than 1.3 billion people in 2020 and a leading (and preventable) health risk factor in the modern world. Predicting smoking habits from a persistent biomarker, such as DNA methylation, is not only relevant to account for self-reporting bias in public health and personalized medicine studies, but may also allow broadening forensic DNA phenotyping. Previously, a model to predict whether someone is a current, former, or never smoker had been published based on solely 13 CpGs from the hundreds of thousands included in the DNA methylation microarray. However, a matching lab tool with lower marker throughput, and higher accuracy and sensitivity was missing towards translating the model in practice. Chapter 6 describes the development of an MPS assay and data analysis pipeline to quantify DNA methylation on these 13 smoking-associated biomarkers for the prediction of smoking status. Though our systematic evaluation on DNA standards of known methylation levels revealed marker-specific amplification bias, our novel tool was still able to provide highly accurate and reproducible DNA methylation quantification and smoking habit prediction. Overall, our MPS assay allows the technological transfer of DNA methylation microarray findings and models to practical settings, one step closer towards future applications.Finally, Chapter 7 provides a general discussion on the results and topics discussed across Chapters 2-6. It begins by summarizing the main findings across the thesis, including proposals for follow-up studies. It then covers technical limitations pertaining bisulfite conversion and DNA methylation microarrays, but also more general considerations such as restricted data access. This chapter ends by covering the outlook of this PhD thesis, including topics such as bisulfite-free methods, third-generation sequencing, single-cell methylomics, multi-omics and systems biology.<br/

    Comparing Optimistic Database Replication Techniques

    Get PDF
    International audienceReplication is attractive for scaling databases up, as it does not require costly equipment and it enables fault tolerance. However, as the latency gap between local and remote accesses continues to widen, maintaining consistency between replicas remains a performance and complexity bottleneck. Optimistic replication (OR) addresses these problems. In OR, a database tentatively executes transactions against its local cache; databases reconcile a posteriori to agree on a common schedule of committed transactions. We present three OR protocols based on the deferred update scheme. The first two are representative of the state the art. The third is new; we describe it in detail. As all three protocols are expressed within a common formal framework, we are able to compare them, to identify similarities and differences, and to introduce common variants. We show that our protocol behaves better than the other two, with respect to latency, message cost and abort rate

    Inter-individual variation of the human epigenome &amp; applications

    Get PDF

    Correlating tephras and cryptotephras using glass compositional analyses and numerical and statistical methods:Review and evaluation

    Get PDF
    We define tephras and cryptotephras and their components (mainly ash-sized particles of glass ± crystals in distal deposits) and summarize the basis of tephrochronology as a chronostratigraphic correlational and dating tool for palaeoenvironmental, geological, and archaeological research. We then document and appraise recent advances in analytical methods used to determine the major, minor, and trace elements of individual glass shards from tephra or cryptotephra deposits to aid their correlation and application. Protocols developed recently for the electron probe microanalysis of major elements in individual glass shards help to improve data quality and standardize reporting procedures. A narrow electron beam (diameter ~3-5 ÎŒm) can now be used to analyze smaller glass shards than previously attainable. Reliable analyses of ‘microshards’ (defined here as glass shards <32 ”m in diameter) using narrow beams are useful for fine-grained samples from distal or ultra-distal geographic locations, and for vesicular or microlite-rich glass shards or small melt inclusions. Caveats apply, however, in the microprobe analysis of very small microshards (<=~5 ”m in diameter), where particle geometry becomes important, and of microlite-rich glass shards where the potential problem of secondary fluorescence across phase boundaries needs to be recognised. Trace element analyses of individual glass shards using laser ablation inductively coupled plasma-mass spectrometry (LA-ICP-MS), with crater diameters of 20 ÎŒm and 10 ÎŒm, are now effectively routine, giving detection limits well below 1 ppm. Smaller ablation craters (<10 ÎŒm) can be subject to significant element fractionation during analysis, but the systematic relationship of such fractionation with glass composition suggests that analyses for some elements at these resolutions may be quantifiable. In undertaking analyses, either by microprobe or LA-ICP-MS, reference material data acquired using the same procedure, and preferably from the same analytical session, should be presented alongside new analytical data. In part 2 of the review, we describe, critically assess, and recommend ways in which tephras or cryptotephras can be correlated (in conjunction with other information) using numerical or statistical analyses of compositional data. Statistical methods provide a less subjective means of dealing with analytical data pertaining to tephra components (usually glass or crystals/phenocrysts) than heuristic alternatives. They enable a better understanding of relationships among the data from multiple viewpoints to be developed and help quantify the degree of uncertainty in establishing correlations. In common with other scientific hypothesis testing, it is easier to infer using such analysis that two or more tephras are different rather than the same. Adding stratigraphic, chronological, spatial, or palaeoenvironmental data (i.e. multiple criteria) is usually necessary and allows for more robust correlations to be made. A two-stage approach is useful, the first focussed on differences in the mean composition of samples, or their range, which can be visualised graphically via scatterplot matrices or bivariate plots coupled with the use of statistical tools such as distance measures, similarity coefficients, hierarchical cluster analysis (informed by distance measures or similarity or cophenetic coefficients), and principal components analysis (PCA). Some statistical methods (cluster analysis, discriminant analysis) are referred to as ‘machine learning’ in the computing literature. The second stage examines sample variance and the degree of compositional similarity so that sample equivalence or otherwise can be established on a statistical basis. This stage may involve discriminant function analysis (DFA), support vector machines (SVMs), canonical variates analysis (CVA), and ANOVA or MANOVA (or its two-sample special case, the Hotelling two-sample TÂČ test). Randomization tests can be used where distributional assumptions such as multivariate normality underlying parametric tests are doubtful. Compositional data may be transformed and scaled before being subjected to multivariate statistical procedures including calculation of distance matrices, hierarchical cluster analysis, and PCA. Such transformations may make the assumption of multivariate normality more appropriate. A sequential procedure using Mahalanobis distance and the Hotelling two-sample TÂČ test is illustrated using glass major element data from trachytic to phonolitic Kenyan tephras. All these methods require a broad range of high-quality compositional data which can be used to compare ‘unknowns’ with reference (training) sets that are sufficiently complete to account for all possible correlatives, including tephras with heterogeneous glasses that contain multiple compositional groups. Currently, incomplete databases are tending to limit correlation efficacy. The development of an open, online global database to facilitate progress towards integrated, high-quality tephrostratigraphic frameworks for different regions is encouraged
    • 

    corecore