17,249 research outputs found
Toward Entity-Aware Search
As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability
A Study on Ranking Method in Retrieving Web Pages Based on Content and Link Analysis: Combination of Fourier Domain Scoring and Pagerank Scoring
Ranking module is an important component of search process which sorts through relevant pages. Since collection of Web pages has additional information inherent in the hyperlink structure of the Web, it can be represented as link score and then combined with the usual information retrieval techniques of content score. In this paper we report our studies about ranking score of Web pages combined from link analysis, PageRank Scoring, and content analysis, Fourier Domain Scoring. Our experiments use collection of Web pages relate to Statistic subject from Wikipedia with objectives to check correctness and performance evaluation of combination ranking method. Evaluation of PageRank Scoring show that the highest score does not always relate to Statistic. Since the links within Wikipedia articles exists so that users are always one click away from more information on any point that has a link attached, it it possible that unrelated topics to Statistic are most likely frequently mentioned in the collection. While the combination method show link score which is given proportional weight to content score of Web pages does effect the retrieval results
Relating Web pages to enable information-gathering tasks
We argue that relationships between Web pages are functions of the user's
intent. We identify a class of Web tasks - information-gathering - that can be
facilitated by a search engine that provides links to pages which are related
to the page the user is currently viewing. We define three kinds of intentional
relationships that correspond to whether the user is a) seeking sources of
information, b) reading pages which provide information, or c) surfing through
pages as part of an extended information-gathering process. We show that these
three relationships can be productively mined using a combination of textual
and link information and provide three scoring mechanisms that correspond to
them: {\em SeekRel}, {\em FactRel} and {\em SurfRel}. These scoring mechanisms
incorporate both textual and link information. We build a set of capacitated
subnetworks - each corresponding to a particular keyword - that mirror the
interconnection structure of the World Wide Web. The scores are computed by
computing flows on these subnetworks. The capacities of the links are derived
from the {\em hub} and {\em authority} values of the nodes they connect,
following the work of Kleinberg (1998) on assigning authority to pages in
hyperlinked environments. We evaluated our scoring mechanism by running
experiments on four data sets taken from the Web. We present user evaluations
of the relevance of the top results returned by our scoring mechanisms and
compare those to the top results returned by Google's Similar Pages feature,
and the {\em Companion} algorithm proposed by Dean and Henzinger (1999).Comment: In Proceedings of ACM Hypertext 200
Fully Automated Fact Checking Using External Sources
Given the constantly growing proliferation of false claims online in recent
years, there has been also a growing research interest in automatically
distinguishing false rumors from factually true claims. Here, we propose a
general-purpose framework for fully-automatic fact checking using external
sources, tapping the potential of the entire Web as a knowledge source to
confirm or reject a claim. Our framework uses a deep neural network with LSTM
text encoding to combine semantic kernels with task-specific embeddings that
encode a claim together with pieces of potentially-relevant text fragments from
the Web, taking the source reliability into account. The evaluation results
show good performance on two different tasks and datasets: (i) rumor detection
and (ii) fact checking of the answers to a question in community question
answering forums.Comment: RANLP-201
Extending Yioop! With Geographical Location Local Search
It is often useful when doing an internet search to get results based on our current location. For example, we might want such results when we search on restaurants, car service center, or hospitals. Current open source search engines like those based on Nutch do not provide this facility. Commercial engines like Google and Yahoo! provide this facility so it would be useful to incorporate it in an open source alternative. The goal of this project is to include location aware search in Yioop!(Pollett, 2012) by using geographical data from OpenStreetMap(“Open Street map wiki”, 2012) and hostip.info (“DMOZ”, n.d.) database to geolocate IP addresses
Analyzing Android Browser Apps for file:// Vulnerabilities
Securing browsers in mobile devices is very challenging, because these
browser apps usually provide browsing services to other apps in the same
device. A malicious app installed in a device can potentially obtain sensitive
information through a browser app. In this paper, we identify four types of
attacks in Android, collectively known as FileCross, that exploits the
vulnerable file:// to obtain users' private files, such as cookies, bookmarks,
and browsing histories. We design an automated system to dynamically test 115
browser apps collected from Google Play and find that 64 of them are vulnerable
to the attacks. Among them are the popular Firefox, Baidu and Maxthon browsers,
and the more application-specific ones, including UC Browser HD for tablet
users, Wikipedia Browser, and Kids Safe Browser. A detailed analysis of these
browsers further shows that 26 browsers (23%) expose their browsing interfaces
unintentionally. In response to our reports, the developers concerned promptly
patched their browsers by forbidding file:// access to private file zones,
disabling JavaScript execution in file:// URLs, or even blocking external
file:// URLs. We employ the same system to validate the ten patches received
from the developers and find one still failing to block the vulnerability.Comment: The paper has been accepted by ISC'14 as a regular paper (see
https://daoyuan14.github.io/). This is a Technical Report version for
referenc
- …