22,397 research outputs found

    Source File Set Search for Clone-and-Own Reuse Analysis

    Get PDF
    Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned component affect an application, developers and security analysts need to identify an original version of the component and understand how the cloned component is different from the original one. Although developers may record the original version information in a version control system and/or directory names, such information is often either unavailable or incomplete. In this research, we propose a code search method that takes as input a set of source files and extracts all the components including similar files from a software ecosystem (i.e., a collection of existing versions of software packages). Our method employs an efficient file similarity computation using b-bit minwise hashing technique. We use an aggregated file similarity for ranking components. To evaluate the effectiveness of this tool, we analyzed 75 cloned components in Firefox and Android source code. The tool took about two hours to report the original components from 10 million files in Debian GNU/Linux packages. Recall of the top-five components in the extracted lists is 0.907, while recall of a baseline using SHA-1 file hash is 0.773, according to the ground truth recorded in the source code repositories.Comment: 14th International Conference on Mining Software Repositorie

    Sketch-based Influence Maximization and Computation: Scaling up with Guarantees

    Full text link
    Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces. Basic computational problems in the study of diffusion are influence queries (determining the potency of a specified seed set of nodes) and Influence Maximization (identifying the most influential seed set of a given size). Answering each influence query involves many edge traversals, and does not scale when there are many queries on very large graphs. The gold standard for Influence Maximization is the greedy algorithm, which iteratively adds to the seed set a node maximizing the marginal gain in influence. Greedy has a guaranteed approximation ratio of at least (1-1/e) and actually produces a sequence of nodes, with each prefix having approximation guarantee with respect to the same-size optimum. Since Greedy does not scale well beyond a few million edges, for larger inputs one must currently use either heuristics or alternative algorithms designed for a pre-specified small seed set size. We develop a novel sketch-based design for influence computation. Our greedy Sketch-based Influence Maximization (SKIM) algorithm scales to graphs with billions of edges, with one to two orders of magnitude speedup over the best greedy methods. It still has a guaranteed approximation ratio, and in practice its quality nearly matches that of exact greedy. We also present influence oracles, which use linear-time preprocessing to generate a small sketch for each node, allowing the influence of any seed set to be quickly answered from the sketches of its nodes.Comment: 10 pages, 5 figures. Appeared at the 23rd Conference on Information and Knowledge Management (CIKM 2014) in Shanghai, Chin

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2
    • …
    corecore