255,521 research outputs found
Web document classification using topic modeling based document ranking
In this paper, we propose a web document ranking method using topic modeling for effective information collection and classification. The proposed method is applied to the document ranking technique to avoid duplicated crawling when crawling at high speed. Through the proposed document ranking technique, it is feasible to remove redundant documents, classify the documents efficiently, and confirm that the crawler service is running. The proposed method enables rapid collection of many web documents; the user can search the web pages with constant data update efficiently. In addition, the efficiency of data retrieval can be improved because new information can be automatically classified and transmitted. By expanding the scope of the method to big data based web pages and improving it for application to various websites, it is expected that more effective information retrieval will be possible
PACRR: A Position-Aware Neural IR Model for Relevance Matching
In order to adopt deep learning for information retrieval, models are needed
that can capture all relevant information required to assess the relevance of a
document to a given user query. While previous works have successfully captured
unigram term matches, how to fully employ position-dependent information such
as proximity and term dependencies has been insufficiently explored. In this
work, we propose a novel neural IR model named PACRR aiming at better modeling
position-dependent interactions between a query and a document. Extensive
experiments on six years' TREC Web Track data confirm that the proposed model
yields better results under multiple benchmarks.Comment: To appear in EMNLP201
Classifying Web Exploits with Topic Modeling
This short empirical paper investigates how well topic modeling and database
meta-data characteristics can classify web and other proof-of-concept (PoC)
exploits for publicly disclosed software vulnerabilities. By using a dataset
comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is
obtained in the empirical experiment. Text mining and topic modeling are a
significant boost factor behind this classification performance. In addition to
these empirical results, the paper contributes to the research tradition of
enhancing software vulnerability information with text mining, providing also a
few scholarly observations about the potential for semi-automatic
classification of exploits in the existing tracking infrastructures.Comment: Proceedings of the 2017 28th International Workshop on Database and
Expert Systems Applications (DEXA).
http://ieeexplore.ieee.org/abstract/document/8049693
A semantic similarity approach to electronic document modeling and integration
The World Wide Web is an enormous collection of information resources serving various purposes. However, the diversity of the Web information, as well as its related formats, makes it very difficult for users to efficiently search and obtain the information they require. The reason for the difficulty is because most of the information uploaded on to the Web is unstructured or semi-structured. Many meta-data models have been proposed to respond to this problem. These models attempt to provide a certain kind of general description for the Web information in order to improve its structuredness. Although these documents consist of the largest portion of the Web information or Web resources, few meta-data models deal with ill-structured Web documents by analyzing their semantic relations with each other. In this paper, we consider this huge set of Web information, called electronic documents. We propose a meta-data model called the EDM (Electronic Document Metadata) model. Using this model, we can extract the semantic characteristics from electronic documents and then use these characteristics to form a semantic electronic document model. This model, inversely, provides a basis for the analysis of semantic similarity between electronic documents and for electronic document integration. This document modeling and integration supports further manipulations on the electronic documents, such as document exchange, searching and evolution.published_or_final_versio
The Web SSO Standard OpenID Connect: In-Depth Formal Security Analysis and Security Guidelines
Web-based single sign-on (SSO) services such as Google Sign-In and Log In
with Paypal are based on the OpenID Connect protocol. This protocol enables
so-called relying parties to delegate user authentication to so-called identity
providers. OpenID Connect is one of the newest and most widely deployed single
sign-on protocols on the web. Despite its importance, it has not received much
attention from security researchers so far, and in particular, has not
undergone any rigorous security analysis.
In this paper, we carry out the first in-depth security analysis of OpenID
Connect. To this end, we use a comprehensive generic model of the web to
develop a detailed formal model of OpenID Connect. Based on this model, we then
precisely formalize and prove central security properties for OpenID Connect,
including authentication, authorization, and session integrity properties.
In our modeling of OpenID Connect, we employ security measures in order to
avoid attacks on OpenID Connect that have been discovered previously and new
attack variants that we document for the first time in this paper. Based on
these security measures, we propose security guidelines for implementors of
OpenID Connect. Our formal analysis demonstrates that these guidelines are in
fact effective and sufficient.Comment: An abridged version appears in CSF 2017. Parts of this work extend
the web model presented in arXiv:1411.7210, arXiv:1403.1866,
arXiv:1508.01719, and arXiv:1601.0122
- …