9,411 research outputs found

    効率的で安全な集合間類似結合に関する研究

    Get PDF
    筑波大学 (University of Tsukuba)201

    Approximate sequence alignment

    Get PDF
    Given a collection of strings and a query string, the goal of the approximate string matching is to efficiently find the strings in the collection, which are similar to the query string. In this paper, we focus on edit distance as a measure to quantify the similarity between two strings. Existing q-gram based methods use inverted lists to index the q-grams of the given string collection. These methods begin with generating the q-grams of the query string, disjoint or overlapping, and then merge the inverted lists of these q-grams. Several filtering techniques have been proposed to segment inverted lists in order to obtain relatively shorter lists, thus reducing the merging cost. The filtering technique we propose in this thesis, which is called position restricted alignment, combines well known length filtering and position filtering to provide more aggressive pruning. We then provide an indexing scheme that integrates the inverted lists storage with the proposed filter. It enables us to auto-filter the inverted lists. We evaluate the effectiveness of the proposed approach by experiments

    TokenJoin:Efficient Filtering for Set Similarity Join with MaximumWeighted Bipartite Matching

    Get PDF
    Set similarity join is an important problem with many applications in data discovery, cleaning and integration. To increase robustness, fuzzy set similarity join calculates the similarity of two sets based on maximum weighted bipartite matching instead of set overlap. This allows pairs of elements, represented as sets or strings, to also match approximately rather than exactly, e.g., based on Jaccard similarity or edit distance. However, this significantly increases the verification cost, making even more important the need for efficient and effective filtering techniques to reduce the number of candidate pairs. The current state-of-the-art algorithm relies on similarity computations between pairs of elements to filter candidates. In this paper, we propose token-based instead of element-based filtering, showing that it is significantly more lightweight, while offering similar or even better pruning effectiveness. Moreover, we address the top-k variant of the problem, alleviating the need for a userspecified similarity threshold. We also propose early termination to reduce the cost of verification. Our experimental results on six real-world datasets show that our approach always outperforms the state of the art, being an order of magnitude faster on average.</p

    Data generator for evaluating ETL process quality

    Get PDF
    Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft

    Integration of Legacy Appliances into Home Energy Management Systems

    Full text link
    The progressive installation of renewable energy sources requires the coordination of energy consuming devices. At consumer level, this coordination can be done by a home energy management system (HEMS). Interoperability issues need to be solved among smart appliances as well as between smart and non-smart, i.e., legacy devices. We expect current standardization efforts to soon provide technologies to design smart appliances in order to cope with the current interoperability issues. Nevertheless, common electrical devices affect energy consumption significantly and therefore deserve consideration within energy management applications. This paper discusses the integration of smart and legacy devices into a generic system architecture and, subsequently, elaborates the requirements and components which are necessary to realize such an architecture including an application of load detection for the identification of running loads and their integration into existing HEM systems. We assess the feasibility of such an approach with a case study based on a measurement campaign on real households. We show how the information of detected appliances can be extracted in order to create device profiles allowing for their integration and management within a HEMS
    corecore