255,521 research outputs found

    Web document classification using topic modeling based document ranking

    Get PDF
    In this paper, we propose a web document ranking method using topic modeling for effective information collection and classification. The proposed method is applied to the document ranking technique to avoid duplicated crawling when crawling at high speed. Through the proposed document ranking technique, it is feasible to remove redundant documents, classify the documents efficiently, and confirm that the crawler service is running. The proposed method enables rapid collection of many web documents; the user can search the web pages with constant data update efficiently. In addition, the efficiency of data retrieval can be improved because new information can be automatically classified and transmitted. By expanding the scope of the method to big data based web pages and improving it for application to various websites, it is expected that more effective information retrieval will be possible

    PACRR: A Position-Aware Neural IR Model for Relevance Matching

    Full text link
    In order to adopt deep learning for information retrieval, models are needed that can capture all relevant information required to assess the relevance of a document to a given user query. While previous works have successfully captured unigram term matches, how to fully employ position-dependent information such as proximity and term dependencies has been insufficiently explored. In this work, we propose a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document. Extensive experiments on six years' TREC Web Track data confirm that the proposed model yields better results under multiple benchmarks.Comment: To appear in EMNLP201

    Classifying Web Exploits with Topic Modeling

    Full text link
    This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor behind this classification performance. In addition to these empirical results, the paper contributes to the research tradition of enhancing software vulnerability information with text mining, providing also a few scholarly observations about the potential for semi-automatic classification of exploits in the existing tracking infrastructures.Comment: Proceedings of the 2017 28th International Workshop on Database and Expert Systems Applications (DEXA). http://ieeexplore.ieee.org/abstract/document/8049693

    A semantic similarity approach to electronic document modeling and integration

    Get PDF
    The World Wide Web is an enormous collection of information resources serving various purposes. However, the diversity of the Web information, as well as its related formats, makes it very difficult for users to efficiently search and obtain the information they require. The reason for the difficulty is because most of the information uploaded on to the Web is unstructured or semi-structured. Many meta-data models have been proposed to respond to this problem. These models attempt to provide a certain kind of general description for the Web information in order to improve its structuredness. Although these documents consist of the largest portion of the Web information or Web resources, few meta-data models deal with ill-structured Web documents by analyzing their semantic relations with each other. In this paper, we consider this huge set of Web information, called electronic documents. We propose a meta-data model called the EDM (Electronic Document Metadata) model. Using this model, we can extract the semantic characteristics from electronic documents and then use these characteristics to form a semantic electronic document model. This model, inversely, provides a basis for the analysis of semantic similarity between electronic documents and for electronic document integration. This document modeling and integration supports further manipulations on the electronic documents, such as document exchange, searching and evolution.published_or_final_versio

    The Web SSO Standard OpenID Connect: In-Depth Formal Security Analysis and Security Guidelines

    Full text link
    Web-based single sign-on (SSO) services such as Google Sign-In and Log In with Paypal are based on the OpenID Connect protocol. This protocol enables so-called relying parties to delegate user authentication to so-called identity providers. OpenID Connect is one of the newest and most widely deployed single sign-on protocols on the web. Despite its importance, it has not received much attention from security researchers so far, and in particular, has not undergone any rigorous security analysis. In this paper, we carry out the first in-depth security analysis of OpenID Connect. To this end, we use a comprehensive generic model of the web to develop a detailed formal model of OpenID Connect. Based on this model, we then precisely formalize and prove central security properties for OpenID Connect, including authentication, authorization, and session integrity properties. In our modeling of OpenID Connect, we employ security measures in order to avoid attacks on OpenID Connect that have been discovered previously and new attack variants that we document for the first time in this paper. Based on these security measures, we propose security guidelines for implementors of OpenID Connect. Our formal analysis demonstrates that these guidelines are in fact effective and sufficient.Comment: An abridged version appears in CSF 2017. Parts of this work extend the web model presented in arXiv:1411.7210, arXiv:1403.1866, arXiv:1508.01719, and arXiv:1601.0122
    • …
    corecore