Search CORE

8 research outputs found

Putting Lipstick on Pig: Enabling Database-Style Workflow Provenance

Author: Amsterdamer Yael
Davidson Susan B.
Deutch Daniel
Milo Tova
Stoyanovich Julia
Tannen Val
Publication venue: ScholarlyCommons
Publication date: 01/12/2011
Field of study

Workflow provenance typically assumes that each module is a “black-box”, so that each output depends on all inputs (coarse-grained dependencies). Furthermore, it does not model the internal state of a module, which can change between repeated executions. In practice, however, an output may depend on only a small subset of the inputs (finegrained dependencies) as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style provenance, by using Pig Latin to expose the functionality of modules, thus capturing internal state and fine-grained dependencies. A critical ingredient in our solution is the use of a novel form of provenance graph that models module invocations and yields a compact representation of fine-grained workflow provenance. It also enables a number of novel graph transformation operations, allowing to choose the desired level of granularity in provenance querying (ZoomIn and ZoomOut), and supporting “what-if” workflow analytic queries. We implemented our approach in the Lipstick system and developed a benchmark in support of a systematic performance evaluation. Our results demonstrate the feasibility of tracking and querying fine-grained workflow provenance

arXiv.org e-Print Archive

ScholarlyCommons@Penn

Scalable Querying of Nested Data

Author: Benedikt Michael
Nikolic Milos
Shaikhha Amir
Smith Jaclyn
Publication venue
Publication date: 01/01/2020
Field of study

While large-scale distributed data processing platforms have become an attractive target for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions. In this work, we propose a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. We provide an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs

arXiv.org e-Print Archive

Edinburgh Research Explorer

Oxford University Research Archive

Approximate Data Analytics Systems

Author: Le Quoc Do
Publication venue
Publication date: 22/01/2018
Field of study

Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications

Technische Universität Dresden: Qucosa

Metadata-driven data integration

Author: Nadal Francesch Sergi
Publication venue: Universitat Politècnica de Catalunya
Publication date: 16/05/2019
Field of study

Cotutela: Universitat Politècnica de Catalunya i Université Libre de Bruxelles, IT4BI-DC programme for the joint Ph.D. degree in computer science.Data has an undoubtable impact on society. Storing and processing large amounts of available data is currently one of the key success factors for an organization. Nonetheless, we are recently witnessing a change represented by huge and heterogeneous amounts of data. Indeed, 90% of the data in the world has been generated in the last two years. Thus, in order to carry on these data exploitation tasks, organizations must first perform data integration combining data from multiple sources to yield a unified view over them. Yet, the integration of massive and heterogeneous amounts of data requires revisiting the traditional integration assumptions to cope with the new requirements posed by such data-intensive settings. This PhD thesis aims to provide a novel framework for data integration in the context of data-intensive ecosystems, which entails dealing with vast amounts of heterogeneous data, from multiple sources and in their original format. To this end, we advocate for an integration process consisting of sequential activities governed by a semantic layer, implemented via a shared repository of metadata. From an stewardship perspective, this activities are the deployment of a data integration architecture, followed by the population of such shared metadata. From a data consumption perspective, the activities are virtual and materialized data integration, the former an exploratory task and the latter a consolidation one. Following the proposed framework, we focus on providing contributions to each of the four activities. We begin proposing a software reference architecture for semantic-aware data-intensive systems. Such architecture serves as a blueprint to deploy a stack of systems, its core being the metadata repository. Next, we propose a graph-based metadata model as formalism for metadata management. We focus on supporting schema and data source evolution, a predominant factor on the heterogeneous sources at hand. For virtual integration, we propose query rewriting algorithms that rely on the previously proposed metadata model. We additionally consider semantic heterogeneities in the data sources, which the proposed algorithms are capable of automatically resolving. Finally, the thesis focuses on the materialized integration activity, and to this end, proposes a method to select intermediate results to materialize in data-intensive flows. Overall, the results of this thesis serve as contribution to the field of data integration in contemporary data-intensive ecosystems.Les dades tenen un impacte indubtable en la societat. La capacitat d’emmagatzemar i processar grans quantitats de dades disponibles és avui en dia un dels factors claus per l’èxit d’una organització. No obstant, avui en dia estem presenciant un canvi representat per grans volums de dades heterogenis. En efecte, el 90% de les dades mundials han sigut generades en els últims dos anys. Per tal de dur a terme aquestes tasques d’explotació de dades, les organitzacions primer han de realitzar una integració de les dades, combinantles a partir de diferents fonts amb l’objectiu de tenir-ne una vista unificada d’elles. Per això, aquest fet requereix reconsiderar les assumpcions tradicionals en integració amb l’objectiu de lidiar amb els requisits imposats per aquests sistemes de tractament massiu de dades. Aquesta tesi doctoral té com a objectiu proporcional un nou marc de treball per a la integració de dades en el context de sistemes de tractament massiu de dades, el qual implica lidiar amb una gran quantitat de dades heterogènies, provinents de múltiples fonts i en el seu format original. Per això, proposem un procés d’integració compost d’una seqüència d’activitats governades per una capa semàntica, la qual és implementada a partir d’un repositori de metadades compartides. Des d’una perspectiva d’administració, aquestes activitats són el desplegament d’una arquitectura d’integració de dades, seguit per la inserció d’aquestes metadades compartides. Des d’una perspectiva de consum de dades, les activitats són la integració virtual i materialització de les dades, la primera sent una tasca exploratòria i la segona una de consolidació. Seguint el marc de treball proposat, ens centrem en proporcionar contribucions a cada una de les quatre activitats. La tesi inicia proposant una arquitectura de referència de software per a sistemes de tractament massiu de dades amb coneixement semàntic. Aquesta arquitectura serveix com a planell per a desplegar un conjunt de sistemes, sent el repositori de metadades al seu nucli. Posteriorment, proposem un model basat en grafs per a la gestió de metadades. Concretament, ens centrem en donar suport a l’evolució d’esquemes i fonts de dades, un dels factors predominants en les fonts de dades heterogènies considerades. Per a l’integració virtual, proposem algorismes de rescriptura de consultes que usen el model de metadades previament proposat. Com a afegitó, considerem heterogeneïtat semàntica en les fonts de dades, les quals els algorismes de rescriptura poden resoldre automàticament. Finalment, la tesi es centra en l’activitat d’integració materialitzada. Per això proposa un mètode per a seleccionar els resultats intermedis a materialitzar un fluxes de tractament intensiu de dades. En general, els resultats d’aquesta tesi serveixen com a contribució al camp d’integració de dades en els ecosistemes de tractament massiu de dades contemporanisLes données ont un impact indéniable sur la société. Le stockage et le traitement de grandes quantités de données disponibles constituent actuellement l’un des facteurs clés de succès d’une entreprise. Néanmoins, nous assistons récemment à un changement représenté par des quantités de données massives et hétérogènes. En effet, 90% des données dans le monde ont été générées au cours des deux dernières années. Ainsi, pour mener à bien ces tâches d’exploitation des données, les organisations doivent d’abord réaliser une intégration des données en combinant des données provenant de sources multiples pour obtenir une vue unifiée de ces dernières. Cependant, l’intégration de quantités de données massives et hétérogènes nécessite de revoir les hypothèses d’intégration traditionnelles afin de faire face aux nouvelles exigences posées par les systèmes de gestion de données massives. Cette thèse de doctorat a pour objectif de fournir un nouveau cadre pour l’intégration de données dans le contexte d’écosystèmes à forte intensité de données, ce qui implique de traiter de grandes quantités de données hétérogènes, provenant de sources multiples et dans leur format d’origine. À cette fin, nous préconisons un processus d’intégration constitué d’activités séquentielles régies par une couche sémantique, mise en oeuvre via un dépôt partagé de métadonnées. Du point de vue de la gestion, ces activités consistent à déployer une architecture d’intégration de données, suivies de la population de métadonnées partagées. Du point de vue de la consommation de données, les activités sont l’intégration de données virtuelle et matérialisée, la première étant une tâche exploratoire et la seconde, une tâche de consolidation. Conformément au cadre proposé, nous nous attachons à fournir des contributions à chacune des quatre activités. Nous commençons par proposer une architecture logicielle de référence pour les systèmes de gestion de données massives et à connaissance sémantique. Une telle architecture consiste en un schéma directeur pour le déploiement d’une pile de systèmes, le dépôt de métadonnées étant son composant principal. Ensuite, nous proposons un modèle de métadonnées basé sur des graphes comme formalisme pour la gestion des métadonnées. Nous mettons l’accent sur la prise en charge de l’évolution des schémas et des sources de données, facteur prédominant des sources hétérogènes sous-jacentes. Pour l’intégration virtuelle, nous proposons des algorithmes de réécriture de requêtes qui s’appuient sur le modèle de métadonnées proposé précédemment. Nous considérons en outre les hétérogénéités sémantiques dans les sources de données, que les algorithmes proposés sont capables de résoudre automatiquement. Enfin, la thèse se concentre sur l’activité d’intégration matérialisée et propose à cette fin une méthode de sélection de résultats intermédiaires à matérialiser dans des flux des données massives. Dans l’ensemble, les résultats de cette thèse constituent une contribution au domaine de l’intégration des données dans les écosystèmes contemporains de gestion de données massivesPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Software Technologies:12th International Joint Conference, ICSOFT 2017, Madrid, Spain, July 24-26, 2017, Revised Selected Papers

Author: Cabello Enrique
Cardoso Jorge S.
Maciaszek Leszek A.
van Sinderen Marten J.
Publication venue: Springer
Publication date: 01/01/2018
Field of study

University of Twente Research Information

CERN Document Server

IMPRoving Outcomes for children exposed to domestic ViolencE (IMPROVE): an evidence synthesis

Author: Abraham
Achenbach
Achenbach
Ades
ADVA
Against Violence and Abuse (AVA)
Alberani
Alderson
Andrews
Appleyard
Arksey
Arthur
Atkins
Bagwell
Bair-Merritt
Barkley
Barnes
Barnett-Page
Barraclough
Bee
Beeble
Bergman
Bion
Bisset
Bondas
Brandon
Brierley
Briggs
British Association of Play Therapists
British Columbia Centre of Excellence for Women’s Health (BCCEWH)
Britten
Brooks
Buckley
Caldwell
Caldwell
Campbell
Capaldi
Capaldi
CASP
Chaffin
Chaffin
Chalmers
Chen
Chignell
Chorpita
Cicchetti
Clark
Clarke
Clarke
Clyburne-Sherin
Cohen
Cohen
Cohen
Cohen
Cohn
Community Health Services Forum
Cook
Cosla
Craig
Cui
Curry
Curtis
Datta
Davies
Debbonaire
Dechartres
Deeks
Deeks
DeMauro
Denzin
Department of Health
Department of Health
Department of Health Social Services and Public Safety
Dias
Dias
Dias
Dias
Dias
Dias
Dixon-Woods
Dixon-Woods
Domestic Violence Intervention Programme
Domestic Violence Intervention Project
Donovan
Eady
Ebert
Eckenrode
Elliott
English
Enright
Evans
Farrington
Featherstone
Feder
Feder
Feder
Fergusson
Fergusson
Finkelhor
Finkelhor
Fisher
Fothergill
Galano
García-Moreno
Gardner
Garnefski
Ghosh Ippen
Glaser
Gold
Goldman Fraser
Gonzalez
Gonzalez
Graham-Bermann
Graham-Bermann
Graham-Bermann
Graham-Bermann
Graham-Bermann
Graham-Bermann
Graham-Bermann
Great Britain
Great Britain
Greenhalgh
Greenhalgh
Grych
Grych
Gul
Hall
Hamby
Hartling
Hawe
Heise
Herrenkohl
Hester
Heyman
Higgins
Higgins
Hill
Hilsenroth Mark
Hoagwood
Hoffmann
Hoffmann
Holden
Home Office
Home Office
Home-Start Westminster
Howard
Howarth
Howarth
Howell
Hróbjartsson
Huesmann
Hughes
Humphreys
Humphreys
Humphreys
Humphreys
Ingoldsby
Institute of Medicine
INTERTASC
Itzen
Jack
Jaffe
Jaffe
Jahanfar
Johnson
Jones
Jouriles
Jouriles
Kaplow
Kazdin
Kearney
Kessler
Kim
Kitzmann
Kline
Kolbo
Kot
Kot
Lamers-Winkelman
Landreth GL
Lee
Lefebvre
Leschied
Levell
Lewin
Lieberman
Lieberman
Lieberman
Lincoln
Littell
Lohman
Loke
Lomas
London Borough of Hounslow
Loosley
Luborsky
Lunn
Lussier
Macmillan
MacMillan
MacMillan
Madan
Maggs
Malpass
Malti
McConnell
McDonald
McDonald
McFarlane
McFarlane
McGee
McManus
McNamee
McNamee Consulting
McWhirter
Meltzer
Merry
Miller
Miller
Miller
Mills
Moffitt
Moher
Moncher Frank
Montgomery
Moore
Moore
Moore
Morris
Morris
Mullender
Munro
National Institute for Health and Care Excellence
Nelson
Noblit
Nolas
North Down and Ards Women’s Aid
Northern School of Child and Adolescent Psychiatry
Olds
Overbeek
Overbeek
Overbeek
O’Cathain
O’Doherty
Paddon
Patton
Pawson
Peled
Peled
Peled
Platt
Pope
Radford
Radford
Radford
Ramsay
Ravens-Sieberer
Rees
Reese
Rizo
Robling
Russell
Saini
Salanti
Schell
Schry
Scott
Sharma
Sharp
Sharps
Sherin
Smith
Spiegelhalter
Stafford
Stanley
Stanley
Steel
Stringer
Sudermann
Sue Penna Associates
Sue Penna Associates
Sullivan
Sullivan
Swan
Taft
Tajima
Thiara
Thomas
Thompson
Timimi
Tong
Topitzes
Tugwell
Turner
Vachon
Verduin
Viergever
Visser
Wagar
Waldman-Levi
Waldman-Levi
Walker
Wathen
Wathen
Webster-Stratton
Wells
Welton
Welton
Welton
Wolfe
Wolpert
Wolpert
Wolpert
Women’s Aid
Woolf
Wykes
Zero to Three: National Center for Clinical Infant Programs
Zimmermann
Publication venue: Public Health Research
Publication date: 01/01/2016
Field of study

BackgroundExposure to domestic violence and abuse (DVA) during childhood and adolescence increases the risk of negative outcomes across the lifespan.ObjectivesTo synthesise evidence on the clinical effectiveness, cost-effectiveness and acceptability of interventions for children exposed to DVA, with the aim of making recommendations for further research.Design(1) A systematic review of controlled trials of interventions; (2) a systematic review of qualitative studies of participant and professional experience of interventions; (3) a network meta-analysis (NMA) of controlled trials and cost-effectiveness analysis; (4) an overview of current UK provision of interventions; and (5) consultations with young people, parents, service providers and commissioners.SettingsNorth America (11), the Netherlands (1) and Israel (1) for the systematic review of controlled trials of interventions; the USA (4) and the UK (1) for the systematic review of qualitative studies of participant and professional experience of interventions; and the UK for the overview of current UK provision of interventions and consultations with young people, parents, service providers and commissioners.ParticipantsA total of 1345 children for the systematic review of controlled trials of interventions; 100 children, 202 parents and 39 professionals for the systematic review of qualitative studies of participant and professional experience of interventions; and 16 young people, six parents and 20 service providers and commissioners for the consultation with young people, parents, service providers and commissioners.InterventionsPsychotherapeutic, advocacy, parenting skills and advocacy, psychoeducation, psychoeducation and advocacy, guided self-help.Main outcome measuresInternalising symptoms and externalising behaviour, mood, depression symptoms and diagnosis, post-traumatic stress disorder symptoms and self-esteem for the systematic review of controlled trials of interventions and NMA; views about and experience of interventions for the systematic review of qualitative studies of participant and professional experience of interventions and consultations.Data sourcesMEDLINE, Cumulative Index to Nursing and Allied Health Literature, PsycINFO, EMBASE, Cochrane Central Register of Controlled Trials, Science Citation Index, Applied Social Sciences Index and Abstracts, International Bibliography of the Social Sciences, Social Services Abstracts, Social Care Online, Sociological Abstracts, Social Science Citation Index, World Health Organization trials portal and clinicaltrials.gov.Review methodsA narrative review; a NMA and incremental cost-effectiveness analysis; and a qualitative synthesis.ResultsThe evidence base on targeted interventions was small, with limited settings and types of interventions; children were mostly < 14 years of age, and there was an absence of comparative studies. The interventions evaluated in trials were mostly psychotherapeutic and psychoeducational interventions delivered to the non-abusive parent and child, usually based on the child’s exposure to DVA (not specific clinical or broader social needs). Qualitative studies largely focused on psychoeducational interventions, some of which included the abusive parent. The evidence for clinical effectiveness was as follows: 11 trials reported improvements in behavioural or mental health outcomes, with modest effect sizes but significant heterogeneity and high or unclear risk of bias. Psychoeducational group-based interventions delivered to the child were found to be more effective for improving mental health outcomes than other types of intervention. Interventions delivered to (non-abusive) parents and to children were most likely to be effective for improving behavioural outcomes. However, there is a large degree of uncertainty around comparisons, particularly with regard to mental health outcomes. In terms of evidence of cost-effectiveness, there were no economic studies of interventions. Cost-effectiveness was modelled on the basis of the NMA, estimating differences between types of interventions. The outcomes measured in trials were largely confined to children’s mental health and behavioural symptoms and disorders, although stakeholders’ concepts of success were broader, suggesting that a broader range of outcomes should be measured in trials. Group-based psychoeducational interventions delivered to children and non-abusive parents in parallel were largely acceptable to all stakeholders. There is limited evidence for the acceptability of other types of intervention. In terms of the UK evidence base and service delivery landscape, there were no UK-based trials, few qualitative studies and little widespread service evaluation. Most programmes are group-based psychoeducational interventions. However, the funding crisis in the DVA sector is significantly undermining programme delivery.ConclusionsThe evidence base regarding the acceptability, clinical effectiveness and cost-effectiveness of interventions to improve outcomes for children exposed to DVA is underdeveloped. There is an urgent need for more high-quality studies, particularly trials, that are designed to produce actionable, generalisable findings that can be implemented in real-world settings and that can inform decisions about which interventions to commission and scale. We suggest that there is a need to pause the development of new interventions and to focus on the systematic evaluation of existing programmes. With regard to the UK, we have identified three types of programme that could be justifiably prioritised for further study: psycho-education delivered to mothers and children, or children alone; parent skills training in combination with advocacy: and interventions involving the abusive parent/caregiver. We also suggest that there is need for key stakeholders to come together to explicitly identify and address the structural, practical and cultural barriers that may have hampered the development of the UK evidence base to date.Future work recommendationsThere is a need for well-designed, well-conducted and well-reported UK-based randomised controlled trials with cost-effectiveness analyses and nested qualitative studies. Development of consensus in the field about core outcome data sets is required. There is a need for further exploration of the acceptability and effectiveness of interventions for specific groups of children and young people (i.e. based on ethnicity, age, trauma exposure and clinical profile). There is also a need for an investigation of the context in which interventions are delivered, including organisational setting and the broader community context, and the evaluation of qualities, qualifications and disciplines of personnel delivering interventions. We recommend prioritisation of psychoeducational interventions and parent skills training delivered in combination with advocacy in the next phase of trials, and exploratory trials of interventions that engage both the abusive and the non-abusive parent.Study registrationThis study is registered as PROSPERO CRD42013004348 and PROSPERO CRD420130043489.FundingThe National Institute for Health Research Public Health Research programme.</jats:sec

CLoK

Crossref

Directory of Open Access Journals

Apollo (Cambridge)

Explore Bristol Research