14,557 research outputs found

    Towards a unified framework for string similarity joins

    Get PDF
    A similarity join aims to find all similar pairs between two collections of records. Established algorithms utilise different similarity measures, either syntactic or semantic, to quantify the similarity between two records. However, when records are similar in forms of a mixture of syntactic and semantic relations, utilising a single measure becomes inadequate to disclose the real similarity between records, and hence unable to obtain high-quality join results. In this paper, we study a unified framework to find similar records by combining multiple similarity measures. To achieve this goal, we first develop a new similarity framework that unifies the existing three kinds of similarity measures simultaneously, including syntactic (typographic) similarity, synonym-based similarity, and taxonomy-based similarity. We then theoretically prove that finding the maximum unified similarity between two strings is generally NP-hard, and furthermore develop an approximate algorithm which runs in polynomial time with a non-trivial approximation guarantee. To support efficient string joins based on our unified similarity measure, we adopt the filter-and-verification framework and propose a new signature structure, called pebble, which can be simultaneously adapted to handle multiple similarity measures. The salient feature of our approach is that, it can judiciously select the best pebble signatures and the overlap thresholds to maximise the filtering power. Extensive experiments show that our methods are capable of finding similar records having mixed types of similarity relations, while exhibiting high efficiency and scalability for similarity joins. The implementation can be downloaded at https://github.com/HY-UDBMS/AU-Join.Peer reviewe

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only ∼ 0.7 \sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    The System Kato: Detecting Cases of Plagiarism for Answer-Set Programs

    Full text link
    Plagiarism detection is a growing need among educational institutions and solutions for different purposes exist. An important field in this direction is detecting cases of source-code plagiarism. In this paper, we present the tool Kato for supporting the detection of this kind of plagiarism in the area of answer-set programming (ASP). Currently, the tool is implemented for DLV programs but it is designed to handle other logic-programming dialects as well. We review the basic features of Kato, introduce its theoretical underpinnings, and discuss an application of Kato for plagiarism detection in the context of courses on logic programming at the Vienna University of Technology

    On the Effect of Semantically Enriched Context Models on Software Modularization

    Full text link
    Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

    TĂźSBL : a similarity-based chunk parser for robust syntactic processing

    Get PDF
    Chunk parsing has focused on the recognition of partial constituent structures at the level of individual chunks. Little attention has been paid to the question of how such partial analyses can be combined into larger structures for complete utterances. The TĂźSBL parser extends current chunk parsing techniques by a tree-construction component that extends partial chunk parses to complete tree structures including recursive phrase structure as well as function-argument structure. TĂźSBLs tree construction algorithm relies on techniques from memory-based learning that allow similarity-based classification of a given input structure relative to a pre-stored set of tree instances from a fully annotated treebank. A quantitative evaluation of TĂźSBL has been conducted using a semi-automatically constructed treebank of German that consists of appr. 67,000 fully annotated sentences. The basic PARSEVAL measures were used although they were developed for parsers that have as their main goal a complete analysis that spans the entire input.This runs counter to the basic philosophy underlying TĂźSBL, which has as its main goal robustness of partially analyzed structures

    Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art

    Get PDF
    Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover

    A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena

    Get PDF
    Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials. To orientate the reader in this vast and complex research area, we present a comprehensive survey of word reordering viewed as a statistical modeling challenge and as a natural language phenomenon. The survey describes in detail how word reordering is modeled within different string-based and tree-based SMT frameworks and as a stand-alone task, including systematic overviews of the literature in advanced reordering modeling. We then question why some approaches are more successful than others in different language pairs. We argue that, besides measuring the amount of reordering, it is important to understand which kinds of reordering occur in a given language pair. To this end, we conduct a qualitative analysis of word reordering phenomena in a diverse sample of language pairs, based on a large collection of linguistic knowledge. Empirical results in the SMT literature are shown to support the hypothesis that a few linguistic facts can be very useful to anticipate the reordering characteristics of a language pair and to select the SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
    • …
    corecore