28 research outputs found

    Scalable dataspace construction

    Get PDF
    The conference aimed at supporting and stimulating active productive research set to strengthen the technical foundations of engineers and scientists in the continent, through developing strong technical foundations and skills, leading to new small to medium enterprises within the African sub-continent. It also seeked to encourage the emergence of functionally skilled technocrats within the continent.This paper proposes the design and implementation of scalable dataspaces based on efficient data structures. Dataspaces are often likely to exhibit a multidimensional structure due to the unpredictable neighbour relationship between participants coupled by the continuous exponential growth of data. Layered range trees are incorporated to the proposed solution as multidimensional binary trees which are used to perform d-dimensional orthogonal range indexing and searching. Furthermore, the solution is readily extensible to multiple dimensions, raising the possibility of volume searches and even extension to attribute space. We begin by a study of the important literature and dataspace designs. A scalable design and implementation is further presented. Finally, we conduct experimental evaluation to illustrate the finer performance of proposed techniques. The design of a scalable dataspace is important in order to bridge the gap resulting from the lack of coexistence of data entities in the spatial domain as a key milestone towards pay-as-you-go systems integrationStrathmore University;nstitute of Electrical and Electronics Engineers (IEEE

    A best-effort integration framework for imperfect information spaces

    Get PDF
    Entity resolution (ER) with imperfection management has been accepted as a major aspect while integrating heterogeneous information sources that exhibit entities with varied identifiers, abbreviated names, and multi-valued attributes. Many of novel integration applications such as personal information management and web-scale information management require the ability to represent and manipulate imperfect data. This requirement signifies the issues of starting with imperfect data to the production of probabilistic database. However, classical data integration (CDI) framework fails to cope with such requirement of explicit imperfect information management. This paper introduces an alternative integration framework based on the best-effort perspective to support instance integration automation. The new framework explicitly incorporates probabilistic management to the ER tasks. The probabilistic management includes a new probabilistic global entity, a new pair-wise-source-to-target ER process, and probabilistic decision model logic as alternatives. Together, the paper presents how these processes operate to support the current heterogeneous sources integration challenges

    Anti-surge Control Technologies of Large-sized Chinese Gas Compression Pump

    Get PDF
    Abstract In this paper, we take Jinan Iron & Steel Co. Ltd.'s Gas-steam Combined Cycle Power Generation Gas GCP as the object of study. This is the first application of Chinese large-sized gas compression pump in the field of steam-gas combined cycle power generation and thus initiates the application of technologies of Chinese large-sized gas compression pump. To broaden the pump's relatively narrow working area where the centrifugal compressor functions stably and to minimize the surge incurred therein, we established the anti-surge mathematical model and anti-surge control algorithm applicable to this gas compression pump; according to default settings, we tested anti-surge safety curve and thus figured out the estimated anti-surge safety curve to be incurred in application; then we concluded the anti-surge safety curve of the gas compression pump by fitting the test results of anti-surge safety curve into the estimated anti-surge safety curve. In practice, this gas compression pump not only satisfied the requirement for control function, but also reached the leading level in the same industry of China

    Private Set Generation with Discriminative Information

    Full text link
    Differentially private data generation techniques have become a promising solution to the data privacy challenge -- it enables sharing of data while complying with rigorous privacy guarantees, which is essential for scientific progress in sensitive domains. Unfortunately, restricted by the inherent complexity of modeling high-dimensional distributions, existing private generative models are struggling with the utility of synthetic samples. In contrast to existing works that aim at fitting the complete data distribution, we directly optimize for a small set of samples that are representative of the distribution under the supervision of discriminative information from downstream tasks, which is generally an easier task and more suitable for private training. Our work provides an alternative view for differentially private generation of high-dimensional data and introduces a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.Comment: NeurIPS 2022, 19 page

    Query-Time Data Integration

    Get PDF
    Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections

    A Reference System Architecture with Data Sovereignty for Human-Centric Data Ecosystems

    Get PDF
    Since the European information economy faces insufficient access to and joint utilization of data, data ecosystems increasingly emerge as economical solutions in B2B environments. Contrarily, in B2C ambits, concepts for sharing and monetizing personal data have not yet prevailed, impeding growth and innovation. Their major pitfall is European data protection law that merely ascribes human data subjects a need for data privacy while widely neglecting their economic participatory claims to data. The study reports on a design science research (DSR) approach addressing this gap and proposes an abstract reference system architecture for an ecosystem centered on humans with personal data. In this DSR approach, multiple methods are embedded to iteratively build and evaluate the artifact, i.e., structured literature reviews, design recovery, prototyping, and expert interviews. Managerial contributions embody novel design knowledge about the conceptual development of human-centric B2C data ecosystems, considering their legal, ethical, economic, and technical constraints

    Interaction-Based Creation and Maintenance of Continuously Usable Trace Links

    Get PDF
    Traceability is a major concern for all software engineering artefacts. The core of traceability are trace links between the artefacts. Out of the links between all kinds of artefacts, trace links between requirements and source code are fundamental, since they enable the connection between the user point of view of a requirement and its actual implementation. Trace links are important for many software engineering tasks such as maintenance, program comprehension, verification, etc. Furthermore, the direct availability of trace links during a project improves the performance of developers. The manual creation of trace links is too time-consuming to be practical. Thus, traceability research has a strong focus on automatic trace link creation. The most common automatic trace link creation methods use information retrieval techniques to measure the textual similarity between artefacts. The results of the textual similarity measurement is then used to judge the creation of links between artefacts. The application of such information retrieval techniques results in a lot of wrong link candidates and requires further expert knowledge to make the automatically created links usable, insomuch as it is necessary to manually vet the link candidates. This fact prevents the usage of information retrieval techniques to create trace links continuously and directly provide them to developers during a project. Thus, this thesis addresses the problem of continuously providing trace links of a good quality to developers during a project and to maintain these links along with changing artefacts. To achieve this, a novel automatic trace link creation approach called Interaction Log Recording-based Trace Link Creation (ILog) has been designed and evaluated. ILog utilizes the interactions of developers with source code while implementing requirements. In addition, ILog uses the common development convention to provide issues' identifiers in a commit message, to assign recorded interactions to requirements. Thus ILog avoids additional manual efforts from the developers for link creation. ILog has been implemented in a set of tools. The tools enable the recording of interactions in different integrated development environments and the subsequent creation of trace links. Trace link are created between source code files which have been touched by interactions and the current requirement which is being worked on. The trace links which are initially created in this way are further improved by utilizing interaction data such as interaction duration, frequency, type, etc. and source code structure, i.e. source code references between source code files involved in trace links. ILog's link improvement removes potentially wrong links and subsequently adds further correct links. ILog was evaluated in three empirical studies using gold standards created by experts. One of the studies used data from an open source project. In the two other studies, student projects involving a real world customer were used. The results of the studies showed that ILog can create trace links with perfect precision and good recall, which enables the direct usage of the links. The studies also showed that the ILog approach has better precision and recall than other automatic trace link creation approaches, such as information retrieval. To identify trace link maintenance capabilities suitable for the integration in ILog, a systematic literature review about trace link maintenance was performed. In the systematic literature review the trace link maintenance approaches which were found are discussed on the basis of a standardized trace link maintenance process. Furthermore, the extension of ILog with suitable trace link maintenance capabilities from the approaches found is illustrated

    Querying and mining heterogeneous spatial, social, and temporal data

    Get PDF

    DBFIRE: recuperação de documentos relacionados a consultas a banco de dados.

    Get PDF
    Bancos de dados e documentos são comumente mantidos em separado nas organizações, controlados por Sistemas Gerenciadores de Bancos de Dados (SGBDs) e Sistemas de Recuperação de Informação (SRIs), respectivamente. Essa separação tem ligação com a natureza dos dados manipulados: estruturados, no primeiro caso; não estruturados, no segundo. Enquanto os SGBDs processam consultas exatas a bancos de dados, os SRIs recuperam documentos com base em buscas por palavras-chave, que são inerentemente imprecisas. Apesar disso, a integração desses sistemas pode resultar em grandes ganhos ao usuário, uma vez que, numa mesma organização, bancos de dados e documentos frequentemente se referem a entidades comuns. Uma das possibilidades de integração é a recuperação de documentos associados a uma dada consulta a banco de dados. Por exemplo, considerando a consulta "Quais os clientes com contratos acima de X reais?", como recuperar documentos que possam estar associados a esta consulta, como os próprios contratos desses clientes, propostas de novas vendas em aberto, entre outros documentos? A solução proposta nesta tese baseia-se numa abordagem especial de expansão de busca para a recuperação de documentos: um conjunto inicial de palavras-chave é expandido com termos potencialmente úteis contidos no resultado de uma consulta a banco de dados; o conjunto de palavras-chave resultante é então enviado a um SRI para a recuperação dos documentos de interesse para a consulta. Propõe-se ainda uma nova forma de ordenação dos termos para expansão: partindo do pressuposto de que uma consulta a banco de dados representa com exatidão a necessidade de informação do usuário, a seleção dos termos é medida por sua difusão ao longo do resultado da consulta. Essa medida é usada não apenas para selecionar os melhores termos, mas também para estabelecer seus pesos relativos na expansão. Para validar o método proposto, foram realizados experimentos em dois domínios distintos, com resultados evidenciando melhorias significativas em termos da recuperação de documentos relacionados às consultas na comparação com outros modelos destacados na literatura
    corecore