63,104 research outputs found
Emo, love and god: making sense of Urban Dictionary, a crowd-sourced online dictionary
The Internet facilitates large-scale collaborative projects and the emergence of Web 2.0 platforms, where producers and consumers of content unify, has drastically changed the information market. On the one hand, the promise of the ‘wisdom of the crowd’ has inspired successful projects such as Wikipedia, which has become the primary source of crowd-based information in many languages. On the other hand, the decentralized and often unmonitored environment of such projects may make them susceptible to low-quality content. In this work, we focus on Urban Dictionary, a crowd-sourced online dictionary. We combine computational methods with qualitative annotation and shed light on the overall features of Urban Dictionary in terms of growth, coverage and types of content. We measure a high presence of opinion-focused entries, as opposed to the meaning-focused entries that we expect from traditional dictionaries. Furthermore, Urban Dictionary covers many informal, unfamiliar words as well as proper nouns. Urban Dictionary also contains offensive content, but highly offensive content tends to receive lower scores through the dictionary’s voting system. The low threshold to include new material in Urban Dictionary enables quick recording of new words and new meanings, but the resulting heterogeneous content can pose challenges in using Urban Dictionary as a source to study language innovation
Are You Finding the Right Person? A Name Translation System Towards Web 2.0
In a multilingual world, information available in global information systems is increasing rapidly. Searching for proper names in foreign language becomes an important task in multilingual search and knowledge discovery. However, these names are the most difficult to handle because they are often unknown words that cannot be found in a translation dictionary and even human experts cannot handle the variation generated during translation. Furthermore, existing research on name translation have focused on translation algorithms. However, user experience during name translation and name search are often ignored. With the Web technology moving towards Web 2.0, creating a platform that allow easier distributed collaboration and information sharing, we seek methods to incorporate Web 2.0 technologies into a name translation system. In this research, we review challenges in name translation and propose an interactive name translation and search system: NameTran. This system takes English names and translates them into Chinese using a combined hybrid Hidden Markov Model-based (HMM-based) transliteration approach and a web mining approach. Evaluation results showed that web mining consistently boosted the performance of a pure HMM approach. Our system achieved top-1 accuracy of 0.64 and top-8 accuracy of 0.96. To cope with changing popularity and variation in name translations, we demonstrated the feasibility of allowing users to rank translations and the new ranking serves as feedback to the original trained HMM model. We believe that such user input will significantly improve system usability
User experiments with the Eurovision cross-language image retrieval system
In this paper we present Eurovision, a text-based system for cross-language (CL) image retrieval.
The system is evaluated by multilingual users for two search tasks with the system configured in
English and five other languages. To our knowledge this is the first published set of user
experiments for CL image retrieval. We show that: (1) it is possible to create a usable multilingual
search engine using little knowledge of any language other than English, (2) categorizing images
assists the user's search, and (3) there are differences in the way users search between the proposed
search tasks. Based on the two search tasks and user feedback, we describe important aspects of
any CL image retrieval system
The Web SSO Standard OpenID Connect: In-Depth Formal Security Analysis and Security Guidelines
Web-based single sign-on (SSO) services such as Google Sign-In and Log In
with Paypal are based on the OpenID Connect protocol. This protocol enables
so-called relying parties to delegate user authentication to so-called identity
providers. OpenID Connect is one of the newest and most widely deployed single
sign-on protocols on the web. Despite its importance, it has not received much
attention from security researchers so far, and in particular, has not
undergone any rigorous security analysis.
In this paper, we carry out the first in-depth security analysis of OpenID
Connect. To this end, we use a comprehensive generic model of the web to
develop a detailed formal model of OpenID Connect. Based on this model, we then
precisely formalize and prove central security properties for OpenID Connect,
including authentication, authorization, and session integrity properties.
In our modeling of OpenID Connect, we employ security measures in order to
avoid attacks on OpenID Connect that have been discovered previously and new
attack variants that we document for the first time in this paper. Based on
these security measures, we propose security guidelines for implementors of
OpenID Connect. Our formal analysis demonstrates that these guidelines are in
fact effective and sufficient.Comment: An abridged version appears in CSF 2017. Parts of this work extend
the web model presented in arXiv:1411.7210, arXiv:1403.1866,
arXiv:1508.01719, and arXiv:1601.0122
Comparing and Combining Sentiment Analysis Methods
Several messages express opinions about events, products, and services,
political views or even their author's emotional state and mood. Sentiment
analysis has been used in several applications including analysis of the
repercussions of events in social networks, analysis of opinions about products
and services, and simply to better understand aspects of social communication
in Online Social Networks (OSNs). There are multiple methods for measuring
sentiments, including lexical-based approaches and supervised machine learning
methods. Despite the wide use and popularity of some methods, it is unclear
which method is better for identifying the polarity (i.e., positive or
negative) of a message as the current literature does not provide a method of
comparison among existing methods. Such a comparison is crucial for
understanding the potential limitations, advantages, and disadvantages of
popular methods in analyzing the content of OSNs messages. Our study aims at
filling this gap by presenting comparisons of eight popular sentiment analysis
methods in terms of coverage (i.e., the fraction of messages whose sentiment is
identified) and agreement (i.e., the fraction of identified sentiments that are
in tune with ground truth). We develop a new method that combines existing
approaches, providing the best coverage results and competitive agreement. We
also present a free Web service called iFeel, which provides an open API for
accessing and comparing results across different sentiment methods for a given
text.Comment: Proceedings of the first ACM conference on Online social networks
(2013) 27-3
A practical index for approximate dictionary matching with few mismatches
Approximate dictionary matching is a classic string matching problem
(checking if a query string occurs in a collection of strings) with
applications in, e.g., spellchecking, online catalogs, geolocation, and web
searchers. We present a surprisingly simple solution called a split index,
which is based on the Dirichlet principle, for matching a keyword with few
mismatches, and experimentally show that it offers competitive space-time
tradeoffs. Our implementation in the C++ language is focused mostly on data
compaction, which is beneficial for the search speed (e.g., by being cache
friendly). We compare our solution with other algorithms and we show that it
performs better for the Hamming distance. Query times in the order of 1
microsecond were reported for one mismatch for the dictionary size of a few
megabytes on a medium-end PC. We also demonstrate that a basic compression
technique consisting in -gram substitution can significantly reduce the
index size (up to 50% of the input text size for the DNA), while still keeping
the query time relatively low
- …