Search CORE

130 research outputs found

On Shapley Value in Data Assemblage Under Independent Utility

Author: Cong Zicun
Luo Xuan
Pei Jian
Xu Cheng
Publication venue: 'VLDB Endowment'
Publication date: 01/08/2022
Field of study

In many applications, an organization may want to acquire data from many data owners. Data marketplaces allow data owners to produce data assemblage needed by data buyers through coalition. To encourage coalitions to produce data, it is critical to allocate revenue to data owners in a fair manner according to their contributions. Although in literature Shapley fairness and alternatives have been well explored to facilitate revenue allocation in data assemblage, computing exact Shapley value for many data owners and large assembled data sets through coalition remains challenging due to the combinatoric nature of Shapley value. In this paper, we explore the decomposability of utility in data assemblage by formulating the independent utility assumption. We argue that independent utility enjoys many applications. Moreover, we identify interesting properties of independent utility and develop fast computation techniques for exact Shapley value under independent utility. Our experimental results on a series of benchmark data sets show that our new approach not only guarantees the exactness of Shapley value, but also achieves faster computation by orders of magnitudes.Comment: Accepted by VLDB 202

arXiv.org e-Print Archive

Histogram techniques for cost estimation in query optimization.

Author
Publication venue
Publication date: 01/01/2001
Field of study

Yu Xiaohui.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 98-115).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 2 --- Related Work --- p.6Chapter 2.1 --- Query Optimization --- p.6Chapter 2.2 --- Query Rewriting --- p.8Chapter 2.2.1 --- Optimizing Multi-Block Queries --- p.8Chapter 2.2.2 --- Semantic Query Optimization --- p.13Chapter 2.2.3 --- Query Rewriting in Starburst --- p.15Chapter 2.3 --- Plan Generation --- p.16Chapter 2.3.1 --- Dynamic Programming Approach --- p.16Chapter 2.3.2 --- Join Query Processing --- p.17Chapter 2.3.3 --- Queries with Aggregates --- p.23Chapter 2.4 --- Statistics and Cost Estimation --- p.24Chapter 2.5 --- Histogram Techniques --- p.27Chapter 2.5.1 --- Definitions --- p.28Chapter 2.5.2 --- Trivial Histograms --- p.29Chapter 2.5.3 --- Heuristic-based Histograms --- p.29Chapter 2.5.4 --- V-Optimal Histograms --- p.32Chapter 2.5.5 --- Wavelet-based Histograms --- p.35Chapter 2.5.6 --- Multidimensional Histograms --- p.35Chapter 2.5.7 --- Global Histograms --- p.37Chapter 3 --- New Histogram Techniques --- p.39Chapter 3.1 --- Piecewise Linear Histograms --- p.39Chapter 3.1.1 --- Construction --- p.41Chapter 3.1.2 --- Usage --- p.43Chapter 3.1.3 --- Error Measures --- p.43Chapter 3.1.4 --- Experiments --- p.45Chapter 3.1.5 --- Conclusion --- p.51Chapter 3.2 --- A-Optimal Histograms --- p.54Chapter 3.2.1 --- A-Optimal(mean) Histograms --- p.56Chapter 3.2.2 --- A-Optimal(median) Histograms --- p.58Chapter 3.2.3 --- A-Optimal(median-cf) Histograms --- p.59Chapter 3.2.4 --- Experiments --- p.60Chapter 4 --- Global Histograms --- p.64Chapter 4.1 --- Wavelet-based Global Histograms --- p.65Chapter 4.1.1 --- Wavelet-based Global Histograms I --- p.66Chapter 4.1.2 --- Wavelet-based Global Histograms II --- p.68Chapter 4.2 --- Piecewise Linear Global Histograms --- p.70Chapter 4.3 --- A-Optimal Global Histograms --- p.72Chapter 4.3.1 --- Experiments --- p.74Chapter 5 --- Dynamic Maintenance --- p.81Chapter 5.1 --- Problem Definition --- p.83Chapter 5.2 --- Refining Bucket Coefficients --- p.84Chapter 5.3 --- Restructuring --- p.86Chapter 5.4 --- Experiments --- p.91Chapter 6 --- Conclusions --- p.95Bibliography --- p.9

CUHK Digital Repository

Using crowdsourced geospatial data to aid in nuclear proliferation monitoring

Author: Leno Kenyon M.
Miller Steven J.
Publication venue: Monterey, California: Naval Postgraduate School
Publication date: 01/12/2016
Field of study

In 2014, a Defense Science Board Task Force was convened in order to assess and explore new technologies that would aid in nuclear proliferation monitoring. One of their recommendations was for the director of National Intelligence to explore ways that crowdsourced geospatial imagery technologies could aid existing governmental efforts. Our research builds directly on this recommendation and provides feedback on some of the most successful examples of crowdsourced geospatial data (CGD). As of 2016, Special Operations Command (SOCOM) has assumed the new role of becoming the primary U.S. agency responsible for counter-proliferation. Historically, this institution has always been reliant upon other organizations for the execution of its myriad of mission sets. SOCOM's unique ability to build relationships makes it particularly suited to the task of harnessing CGD technologies and employing them in the capacity that our research recommends. Furthermore, CGD is a low cost, high impact tool that is already being employed by commercial companies and non-profit groups around the world. By employing CGD, a wider whole-of-government effort can be created that provides a long term, cohesive engagement plan for facilitating a multi-faceted nuclear proliferation monitoring process.http://archive.org/details/usingcrowdsource1094551570Major, United States ArmyMajor, United States ArmyApproved for public release; distribution is unlimited

Conceptual model for usable multi-modal mobile assistance during Umrah

Author: Al-Aidaroos Ahmed Sheikh Abdullah
Publication venue
Publication date: 01/01/2017
Field of study

Performing Umrah is very demanding and to be performed in very crowded environments. In response to that, many efforts have been initiated to overcome the difficulties faced by pilgrims. However, those efforts focus on acquiring initial perspective and background knowledge before going to Mecca. Findings of preliminary study show that those efforts do not support multi-modality for user interaction. Nowadays the computational capabilities in mobile phones enable it to serve people in various aspects of daily life. Consequently, the mobile phone penetration has increased dramatically in the last decade. Hence, this study aims to propose a comprehensive conceptual model for usable multimodal mobile assistance during Umrah called Multi-model Mobile Assistance during Umrah (MMA-U). Thus, four (4) supporting objectives are formulated, and the Design Science Research Methodology has been adopted. For the usability of MMA-U, Systematic Literature Review (SLR) indicates ten (10) attributes: usefulness, errors rate, simplicity, reliability, ease of use, safety, flexibility, accessibility, attitude, and acceptability. Meanwhile, the content and comparative analysis result in five (5) components that construct the conceptual model of MMA-U: structural, content composition, design principles, development approach, technology, and the design and usability theories. Then, the MMA-U has been reviewed and well-accepted by 15 experts. Later, the MMA-U was incorporated into a prototype called Personal Digital Mutawwif (PDM). The PDM was developed for the purpose of user test in the field. The findings indicate that PDM facilitates the execution of Umrah and successfully meet pilgrims’ needs and expectations. Also, the pilgrims were satisfied and felt that they need to have PDM. In fact, they would recommend PDM to their friends, which mean that use of PDM is safe and suitable while performing Umrah. As a conclusion, the theoretical contribution; the conceptual model of MMA-U; provides guidelines for developing multimodal content mobile applications during Umrah

Universiti Utara Malaysia: UUM eTheses

Natural language interface to relational database: a simplified customization approach

Author: Mvumbi Tresor
Publication venue: Department of Computer Science
Publication date: 01/01/2016
Field of study

Natural language interfaces to databases (NLIDB) allow end-users with no knowledge of a formal language like SQL to query databases. One of the main open problems currently investigated is the development of NLIDB systems that are easily portable across several domains. The present study focuses on the development and evaluation of methods allowing to simplify customization of NLIDB targeting relational databases without sacrificing coverage and accuracy. This goal is approached by the introduction of two authoring frameworks that aim to reduce the workload required to port a NLIDB to a new domain. The first authoring approach is called top-down; it assumes the existence of a corpus of unannotated natural language sample questions used to pre-harvest key lexical terms to simplify customization. The top-down approach further reduces the configuration workload by autoincluding the semantics for negative form of verbs, comparative and superlative forms of adjectives in the configuration model. The second authoring approach introduced is bottom-up; it explores the possibility of building a configuration model with no manual customization using the information from the database schema and an off-the-shelf dictionary. The evaluation of the prototype system with geo-query, a benchmark query corpus, has shown that the top-down approach significantly reduces the customization workload: 93% of the entries defining the meaning of verbs and adjectives which represents the hard work has been automatically generated by the system; only 26 straightforward mappings and 3 manual definitions of meaning were required for customization. The top-down approach answered correctly 74.5 % of the questions. The bottom-up approach, however, has correctly answered only 1/3 of the questions due to insufficient lexicon and missing semantics. The use of an external lexicon did not improve the system's accuracy. The bottom-up model has nevertheless correctly answered 3/4 of the 105 simple retrieval questions in the query corpus not requiring nesting. Therefore, the bottom-up approach can be useful to build an initial lightweight configuration model that can be incrementally refined by using the failed queries to train a topdown model for example. The experimental results for top-down suggest that it is indeed possible to construct a portable NLIDB that reduces the configuration effort while maintaining a decent coverage and accuracy

Personalized large scale classification of public tenders on hadoop

Author: Dumoulin Mathieu
Publication venue: Bibliotheque de l' Universite Laval
Publication date: 01/01/2014
Field of study

Ce projet a été réalisé dans le cadre d’un partenariat entre Fujitsu Canada et Université Laval. Les besoins du projets ont été centrés sur une problématique d’affaire définie conjointement avec Fujitsu. Le projet consistait à classifier un corpus d’appels d’offres électroniques avec une approche orienté big data. L’objectif était d’identifier avec un très fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. Après une séries d’expérimentations à petite échelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacité de notre approche basé sur l’algorithme BNS (Bi-Normal Separation), nous avons implanté un système complet qui exploite l’infrastructure technologique big data Hadoop. Nos expérimentations sur le système complet démontrent qu’il est possible d’obtenir une performance de classification tout aussi efficace à grande échelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuée de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and Université Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

Disjunctively incomplete information in relational databases: modeling and related issues

Author: Zhang Lu
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/1993
Field of study

In this dissertation, the issues related to the information incompleteness in relational databases are explored. In general, this dissertation can be divided into two parts. The first part extends the relational natural join operator and the update operations of insertion and deletion to I-tables, an extended relational model representing inclusively indefinite and maybe information, in a semantically correct manner. Rudimentary or naive algorithms for computing natural joins on I-tables require an exponential number of pair-up operations and block accesses proportional to the size of I-tables due to the combinatorial nature of natural joins on I-tables. Thus, the problem becomes intractable for large I-tables. An algorithm for computing natural joins under the extended model which reduces the number of pair-up operations to a linear order of complexity in general and in the worst case to a polynomial order of complexity with respect to the size of I-tables is proposed in this dissertation. In addition, this algorithm also reduces the number of block accesses to a linear order of complexity with respect to the size of I-tables;The second part is related to the modeling aspect of incomplete databases. An extended relational model, called E-table, is proposed. E-table is capable of representing exclusively disjunctive information. That is, disjunctions of the form P[subscript]1\mid P[subscript]2\mid·s\mid P[subscript]n, where ǁ denotes a generalized logical exclusive or indicating that exactly one of the P[subscript]i\u27s can be true. The information content of an E-table is precisely defined and relational operators of selection, projection, difference, union, intersection, and cartisian product are extended to E-tables in a semantically correct manner. Conditions under which redundancies could arise due to the presence of exclusively disjunctive information are characterized and the procedure for resolving redundancies is presented;Finally, this dissertation is concluded with discussions on the directions for further research in the area of incomplete information modeling. In particular, a sketch of a relational model, IE-table (Inclusive and Exclusive table), for representing both inclusively and exclusively disjunctive information is provided

Computer-language based data prefetching techniques

Author: Touma Rizkallah
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

Data prefetching has long been used as a technique to improve access times to persistent data. It is based on retrieving data records from persistent storage to main memory before the records are needed. Data prefetching has been applied to a wide variety of persistent storage systems, from file systems to Relational Database Management Systems and NoSQL databases, with the aim of reducing access times to the data maintained by the system and thus improve the execution times of the applications using this data. However, most existing solutions to data prefetching have been based on information that can be retrieved from the storage system itself, whether in the form of heuristics based on the data schema or data access patterns detected by monitoring access to the system. There are multiple disadvantages of these approaches in terms of the rigidity of the heuristics they use, the accuracy of the predictions they make and / or the time they need to make these predictions, a process often performed while the applications are accessing the data and causing considerable overhead. In light of the above, this thesis proposes two novel approaches to data prefetching based on predictions made by analyzing the instructions and statements of the computer languages used to access persistent data. The proposed approaches take into consideration how the data is accessed by the higher-level applications, make accurate predictions and are performed without causing any additional overhead. The first of the proposed approaches aims at analyzing instructions of applications written in object-oriented languages in order to prefetch data from Persistent Object Stores. The approach is based on static code analysis that is done prior to the application execution and hence does not add any overhead. It also includes various strategies to deal with cases that require runtime information unavailable prior to the execution of the application. We integrate this analysis approach into an existing Persistent Object Store and run a series of extensive experiments to measure the improvement obtained by prefetching the objects predicted by the approach. The second approach analyzes statements and historic logs of the declarative query language SPARQL in order to prefetch data from RDF Triplestores. The approach measures two types of similarity between SPARQL queries in order to detect recurring query patterns in the historic logs. Afterwards, it uses the detected patterns to predict subsequent queries and launch them before they are requested to prefetch the data needed by them. Our evaluation of the proposed approach shows that it high-accuracy prediction and can achieve a high cache hit rate when caching the results of the predicted queries.Precargar datos ha sido una de las técnicas más comunes para mejorar los tiempos de acceso a datos persistentes. Esta técnica se basa en predecir los registros de datos que se van a acceder en el futuro y cargarlos del almacenimiento persistente a la memoria con antelación a su uso. Precargar datos ha sido aplicado en multitud de sistemas de almacenimiento persistente, desde sistemas de ficheros a bases de datos relacionales y NoSQL, con el objetivo de reducir los tiempos de acceso a los datos y por lo tanto mejorar los tiempos de ejecución de las aplicaciones que usan estos datos. Sin embargo, la mayoría de los enfoques existentes utilizan predicciones basadas en información que se encuentra dentro del mismo sistema de almacenimiento, ya sea en forma de heurísticas basadas en el esquema de los datos o patrones de acceso a los datos generados mediante la monitorización del acceso al sistema. Estos enfoques presentan varias desventajas en cuanto a la rigidez de las heurísticas usadas, la precisión de las predicciones generadas y el tiempo que necesitan para generar estas predicciones, un proceso que se realiza con frecuencia mientras las aplicaciones acceden a los datos y que puede tener efectos negativos en el tiempo de ejecución de estas aplicaciones. En vista de lo anterior, esta tesis presenta dos enfoques novedosos para precargar datos basados en predicciones generadas por el análisis de las instrucciones y sentencias del lenguaje informático usado para acceder a los datos persistentes. Los enfoques propuestos toman en consideración cómo las aplicaciones acceden a los datos, generan predicciones precisas y mejoran el rendimiento de las aplicaciones sin causar ningún efecto negativo. El primer enfoque analiza las instrucciones de applicaciones escritas en lenguajes de programación orientados a objetos con el fin de precargar datos de almacenes de objetos persistentes. El enfoque emplea análisis estático de código hecho antes de la ejecución de las aplicaciones, y por lo tanto no afecta negativamente el rendimiento de las mismas. El enfoque también incluye varias estrategias para tratar casos que requieren información de runtime no disponible antes de ejecutar las aplicaciones. Además, integramos este enfoque en un almacén de objetos persistentes y ejecutamos una serie extensa de experimentos para medir la mejora de rendimiento que se puede obtener utilizando el enfoque. Por otro lado, el segundo enfoque analiza las sentencias y logs del lenguaje declarativo de consultas SPARQL para precargar datos de triplestores de RDF. Este enfoque aplica dos medidas para calcular la similtud entre las consultas del lenguaje SPARQL con el objetivo de detectar patrones recurrentes en los logs históricos. Posteriormente, el enfoque utiliza los patrones detectados para predecir las consultas siguientes y precargar con antelación los datos que necesitan. Nuestra evaluación muestra que este enfoque produce predicciones de alta precisión y puede lograr un alto índice de aciertos cuando los resultados de las consultas predichas se guardan en el caché.Postprint (published version