5,504 research outputs found

    TopSig: Topology Preserving Document Signatures

    Get PDF
    Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201

    A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents

    Get PDF
    The drastic development in the information accessible on the World Wide Web has made the employment of automated tools to locate the information resources of interest, and for tracking and analyzing the same a certainty. Web Mining is the branch of data mining that deals with the analysis of World Wide Web. The concepts from various areas such as Data Mining, Internet technology and World Wide Web, and recently, Semantic Web can be said as the origin of web mining. Web mining can be defined as the procedure of determining hidden yet potentially beneficial knowledge from the data accessible in the web. Web mining comprise the sub areas: web content mining, web structure mining, and web usage mining. Web content mining is the process of mining knowledge from the web pages besides other web objects. The process of mining knowledge about the link structure linking web pages and some other web objects is defined as Web structure mining. Web usage mining is defined as the process of mining the usage patterns created by the users accessing the web pages. The search engine technology has led to the development of World Wide. The search engines are the chief gateways for access of information in the web. The ability to locate contents of particular interest amidst a huge heap has turned businesses beneficial and productive. The search engines respond to the queries by employing the process of web crawling that populates an indexed repository of web pages. The programs construct a confined repository of the segment of the web that they visit by navigating the web graph and retrieving pages. There are two main types of crawling, namely, Generic and Focused crawling. Generic crawlers crawls documents and links of diverse topics. Focused crawlers limit the number of pages with the aid of some prior obtained specialized knowledge. The systems that index, mine, and otherwise analyze pages (such as, the search engines) are provided with inputs from the repositories of web pages built by the web crawlers. The drastic development of the Internet and the growing necessity to incorporate heterogeneous data is accompanied by the issue of the existence of near duplicate data. Even if the near duplicate data don’t exhibit bit wise identical nature they are remarkably similar. The duplicate and near duplicate web pages either increase the index storage space or slow down or increase the serving costs which annoy the users, thus causing huge problems for the web search engines. Hence it is inevitable to design algorithms to detect such pages

    Ground correlation investigation of thruster spacecraft interactions to be measured on the IAPS flight test

    Get PDF
    Preliminary ground correlation testing has been conducted with an 8 cm mercury ion thruster and diagnostic instrumentation replicating to a large extent the IAPS flight test hardware, configuration, and electrical grounding/isolation. Thruster efflux deposition retained at 25 C was measured and characterized. Thruster ion efflux was characterized with retarding potential analyzers. Thruster-generated plasma currents, the spacecraft common (SCC) potential, and ambient plasma properties were evaluated with a spacecraft potential probe (SPP). All the measured thruster/spacecraft interactions or their IAPS measurements depend critically on the SCC potential, which can be controlled by a neutralizer ground switch and by the SPP operation

    An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection

    Get PDF
    As machine learning-assisted vulnerability detection research matures, it is critical to understand the datasets being used by existing papers. In this paper, we explore 7 C/C++ datasets and evaluate their suitability for machine learning-assisted vulnerability detection. We also present a new dataset, named Wild C, containing over 10.3 million individual opensource C/C++ files – a sufficiently large sample to be reasonably considered representative of typical C/C++ code. To facilitate comparison, we tokenize all of the datasets and perform the analysis at this level. We make three primary contributions. First, while all the datasets differ from our Wild C dataset, some do so to a greater degree. This includes divergence in file lengths and token usage frequency. Additionally, none of the datasets contain the entirety of the C/C++ vocabulary. These missing tokens account for up to 11% of all token usage. Second, we find all the datasets contain duplication with some containing a significant amount. In the Juliet dataset, we describe augmentations of test cases making the dataset susceptible to data leakage. This augmentation occurs with such frequency that a random 80/20 split has roughly 58% overlap of the test with the training data. Finally, we collect and process a large dataset of C code, named Wild C. This dataset is designed to serve as a representative sample of all C/C++ code and is the basis for our analyses

    Extensive Copy-Number Variation of Young Genes across Stickleback Populations

    Get PDF
    MM received funding from the Max Planck innovation funds for this project. PGDF was supported by a Marie Curie European Reintegration Grant (proposal nr 270891). CE was supported by German Science Foundation grants (DFG, EI 841/4-1 and EI 841/6-1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

    BagMinHash - Minwise Hashing Algorithm for Weighted Sets

    Full text link
    Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past.Comment: 10 pages, KDD 201
    • …
    corecore