156 research outputs found

    Compile-Time Query Optimization for Big Data Analytics

    Get PDF
    Many emerging programming environments for large-scale data analysis, such as Map-Reduce, Spark, and Flink, provide Scala-based APIs that consist of powerful higher-order operations that ease the development of complex data analysis applications. However, despite the simplicity of these APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, most current data analysis query languages are based on the relational model and cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model. To address these shortcomings, we introduce a new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer uses algebraic transformations to derive all possible joins in a query, including those hidden across deeply nested queries, thus unnesting nested queries of any form and any number of nesting levels. The optimizer also uses general transformations to push down predicates before joins and to prune unneeded data across operations. DIQL has been implemented on three Big Data platforms, Apache Spark, Apache Flink, and Twitter's Cascading/Scalding, and has been shown to have competitive performance relative to Spark DataFrames and Spark SQL for some complex queries. This paper extends our previous work on embedded data-intensive query languages by describing the complete details of the formal framework and the query translation and optimization processes, and by providing more experimental results that give further evidence of the performance of our system

    Applicative Bidirectional Programming with Lenses

    Get PDF
    A bidirectional transformation is a pair of mappings between source and view data objects, one in each direction. When the view is modified, the source is updated accordingly with respect to some laws. One way to reduce the development and maintenance effort of bidirectional transformations is to have specialized languages in which the resulting programs are bidirectional by construction---giving rise to the paradigm of bidirectional programming. In this paper, we develop a framework for applicative-style and higher-order bidirectional programming, in which we can write bidirectional transformations as unidirectional programs in standard functional languages, opening up access to the bundle of language features previously only available to conventional unidirectional languages. Our framework essentially bridges two very different approaches of bidirectional programming, namely the lens framework and Voigtlander’s semantic bidirectionalization, creating a new programming style that is able to bag benefits from both

    Translation of Array-based Loop Programs to Optimized SQL-based Distributed Programs

    Get PDF
    Many data analysis programs are often expressed in terms of array operations in sequential loops. However, these programs do not scale very well to large amounts of data that cannot fit in the memory of a single computer and they have to be rewritten to work on Big Data analysis platforms, such as Map-Reduce and Spark. We present a novel framework, called SQLgen, that automatically translates sequential loops on arrays to distributed data-parallel programs, specifically Spark SQL programs. We further extend this framework by introducing OSQLgen, which automatically parallelizes array-based loop programs to distributed data-parallel programs on block arrays. At first, our framework translates the sequential loops on arrays to monoid comprehensions and then to Spark SQL. For SQLgen, the SQL is over coordinate arrays while for OSQLgen, it is over block arrays. As block arrays are more compact than coordinate arrays, computations on block matrices are significantly faster than on arrays in the coordinate format. Since not all array-based loops can be translated to SQL on block arrays, we focus on certain patterns of loops that match an algebraic structure known as a semiring. Many linear algebra operations, such as matrix multiplication required in many machine learning algorithms, as well as many graph programs that are equivalent to a semiring can be translated to distributed data-parallel programs on block arrays using OSQLgen, thus giving us a substantial performance gain. Finally, to evaluate our framework, we compare the performance of OSQLgen with GraphX, GraphFrames, MLlib, and hand-written Spark SQL programs on coordinate and block arrays on various real-world problems

    How functional programming mattered

    Get PDF
    In 1989 when functional programming was still considered a niche topic, Hughes wrote a visionary paper arguing convincingly ‘why functional programming matters’. More than two decades have passed. Has functional programming really mattered? Our answer is a resounding ‘Yes!’. Functional programming is now at the forefront of a new generation of programming technologies, and enjoying increasing popularity and influence. In this paper, we review the impact of functional programming, focusing on how it has changed the way we may construct programs, the way we may verify programs, and fundamentally the way we may think about programs

    Transcriptional regulation of Elf-1: locus-wide analysis reveals four distinct promoters, a tissue-specific enhancer, control by PU.1 and the importance of Elf-1 downregulation for erythroid maturation

    Get PDF
    Ets transcription factors play important roles during the development and maintenance of the haematopoietic system. One such factor, Elf-1 (E74-like factor 1) controls the expression of multiple essential haematopoietic regulators including Scl/Tal1, Lmo2 and PU.1. However, to integrate Elf-1 into the wider regulatory hierarchies controlling haematopoietic development and differentiation, regulatory elements as well as upstream regulators of Elf-1 need to be identified. Here, we have used locus-wide comparative genomic analysis coupled with chromatin immunoprecipitation (ChIP-chip) assays which resulted in the identification of five distinct regulatory regions directing expression of Elf-1. Further, ChIP-chip assays followed by functional validation demonstrated that the key haematopoietic transcription factor PU.1 is a major upstream regulator of Elf-1. Finally, overexpression studies in a well-characterized erythroid differentiation assay from primary murine fetal liver cells demonstrated that Elf-1 downregulation is necessary for terminal erythroid differentiation. Given the known activation of PU.1 by Elf-1 and our newly identified reciprocal activation of Elf-1 by PU.1, identification of an inhibitory role for Elf-1 has significant implications for our understanding of how PU.1 controls myeloid–erythroid differentiation. Our findings therefore not only represent the first report of Elf-1 regulation but also enhance our understanding of the wider regulatory networks that control haematopoiesis

    A novel G-quadruplex-forming GGA repeat region in the c-myb promoter is a critical regulator of promoter activity

    Get PDF
    The c-myb promoter contains multiple GGA repeats beginning 17 bp downstream of the transcription initiation site. GGA repeats have been previously shown to form unusual DNA structures in solution. Results from chemical footprinting, circular dichroism and RNA and DNA polymerase arrest assays on oligonucleotides representing the GGA repeat region of the c-myb promoter demonstrate that the element is able to form tetrad:heptad:heptad:tetrad (T:H:H:T) G-quadruplex structures by stacking two tetrad:heptad G-quadruplexes formed by two of the three (GGA)4 repeats. Deletion of one or two (GGA)4 motifs destabilizes this secondary structure and increases c-myb promoter activity, indicating that the G-quadruplexes formed in the c-myb GGA repeat region may act as a negative regulator of the c-myb promoter. Complete deletion of the c-myb GGA repeat region abolishes c-myb promoter activity, indicating dual roles of the c-myb GGA repeat element as both a transcriptional repressor and an activator. Furthermore, we demonstrated that Myc-associated zinc finger protein (MAZ) represses c-myb promoter activity and binds to the c-myb T:H:H:T G-quadruplexes. Our findings show that the T:H:H:T G-quadruplex-forming region in the c-myb promoter is a critical cis-acting element and may repress c-myb promoter activity through MAZ interaction with G-quadruplexes in the c-myb promoter

    EZH2 modulates angiogenesis in vitro and in a mouse model of limb ischemia

    Get PDF
    Epigenetic mechanisms may regulate the expression of pro-angiogenic genes, thus affecting reparative angiogenesis in ischemic limbs. The enhancer of zest homolog-2 (EZH2) induces thtrimethylation of lysine 27 on histone H3 (H3K27me3), which represses gene transcription. We explored (i) if EZH2 expression is regulated by hypoxia and ischemia; (ii) the impact of EZH2 on the expression of two pro-angiogenic genes: eNOS and BDNF; (iii) the functional effect of EZH2 inhibition on cultured endothelial cells (ECs); (iv) the therapeutic potential of EZH2 inhibition in a mouse model of limb ischemia (LI). EZH2 expression was increased in cultured ECs exposed to hypoxia (control: normoxia) and in ECs extracted from mouse ischemic limb muscles (control: absence of ischemia). EZH2 increased the H3K27me3 abundance onto regulatory regions of eNOS and BDNF promoters. In vitro RNA silencing or pharmacological inhibition by 3-deazaneplanocin (DZNep) of EZH2 increased eNOS and BDNF mRNA and protein levels and enhanced functional capacities (migration, angiogenesis) of ECs under either normoxia or hypoxia. In mice with experimentally induced LI, DZNep increased angiogenesis in ischaemic muscles, the circulating levels of pro-angiogenic hematopoietic cells and blood flow recovery. Targeting EZH2 for inhibition may open new therapeutic avenues for patients with limb ischemia

    Using the Parametricity Theorem for Program Fusion

    No full text
    Program fusion techniques have long been proposed as an effective means of improving program performance and of eliminating unnecessary intermediate data structures. This paper proposes a new approach on program fusion that is based entirely on the type signatures of programs. First, for each function, a recursive skeleton is extracted that captures its pattern of recursion. Then, the parametricity theorem of this skeleton is derived, which provides a rule for fusing this function with any function. This method generalizes other approaches that use fixed parametricity theorems to fuse programs. 1 Introduction There is much work recently on using higher-order operators, such as fold [11] and build [8, 5], to automate program fusion [2] and deforestation [13]. Even though these methods do a good job on fusing programs, they are only effective if programs are expressed in terms of these operators. This limits their applicability to conventional functional languages. To ameliorate this pr..

    Supporting Bulk Synchronous Parallelism in Map-Reduce Queries

    No full text
    Abstract—One of the major drawbacks of the Map-Reduce (MR) model is that, to simplify reliability and fault tolerance, it does not preserve data in memory across consecutive MR jobs: a MR job must dump its data to the distributed file system before they can be read by the next MR job. This restriction imposes a high overhead to complex MR workflows and graph algorithms, such as PageRank, which require repetitive MR jobs. The Bulk Synchronous Parallelism (BSP) programming model, on the other hand, has been recently advocated as an alternative to the MR model that does not suffer from this restriction, and, under certain circumstances, allows complex repetitive algorithms to run entirely in the collective memory of a cluster. We present a framework for translating complex declarative queries for scientific and graph data analysis applications to both MR and BSP evaluation plans, leaving the choice to be made at run-time based on the available resources. If the resources are sufficient, the query will be evaluated entirely in memory based on the BSP model, otherwise, the same query will be evaluated based on the MR model. I
    corecore