7,212 research outputs found
Bridging the Semantic Gap in Multimedia Information Retrieval: Top-down and Bottom-up approaches
Semantic representation of multimedia information is vital for enabling the kind of multimedia search capabilities that professional searchers require. Manual annotation is often not possible because of the shear scale of the multimedia information that needs indexing. This paper explores the ways in which we are using both top-down, ontologically driven approaches and bottom-up, automatic-annotation approaches to provide retrieval facilities to users. We also discuss many of the current techniques that we are investigating to combine these top-down and bottom-up approaches
Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster
The widespread use of GPS-enabled smartphones along with the popularity of
micro-blogging and social networking applications, e.g., Twitter and Facebook,
has resulted in the generation of huge streams of geo-tagged textual data. Many
applications require real-time processing of these streams. For example,
location-based e-coupon and ad-targeting systems enable advertisers to register
millions of ads to millions of users. The number of users is typically very
high and they are continuously moving, and the ads change frequently as well.
Hence sending the right ad to the matching users is very challenging. Existing
streaming systems are either centralized or are not spatial-keyword aware, and
cannot efficiently support the processing of rapidly arriving spatial-keyword
data streams. This paper presents Tornado, a distributed spatial-keyword stream
processing system. Tornado features routing units to fairly distribute the
workload, and furthermore, co-locate the data objects and the corresponding
queries at the same processing units. The routing units use the Augmented-Grid,
a novel structure that is equipped with an efficient search algorithm for
distributing the data objects and queries. Tornado uses evaluators to process
the data objects against the queries. The routing units minimize the redundant
communication by not sending data updates for processing when these updates do
not match any query. By applying dynamically evaluated cost formulae that
continuously represent the processing overhead at each evaluator, Tornado is
adaptive to changes in the workload. Extensive experimental evaluation using
spatio-textual range queries over real Twitter data indicates that Tornado
outperforms the non-spatio-textually aware approaches by up to two orders of
magnitude in terms of the overall system throughput
What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries
We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform
Recommended from our members
Enriching videos with light semantics
This paper describes an ongoing prototypical framework to annotate and retrieve web videos with light semantics. The proposed framework reuses many existing vocabularies along with a video model. The knowledge is captured from three different information spaces (media content, context, document). We also describe ways to extract the semantic content descriptions from the existing usergenerated content using multiple approaches of linguistic processing and Named Entity Recognition, which are later identified with DBpedia resources to establish meanings for the tags. Finally, the implemented prototype is described with multiple search interfaces and retrieval processes. Evaluation on semantic enrichment shows a considerable (50% of videos) improvement in content description
No-But-Semantic-Match: Computing Semantically Matched XML Keyword Search Results
Users are rarely familiar with the content of a data source they are
querying, and therefore cannot avoid using keywords that do not exist in the
data source. Traditional systems may respond with an empty result, causing
dissatisfaction, while the data source in effect holds semantically related
content. In this paper we study this no-but-semantic-match problem on XML
keyword search and propose a solution which enables us to present the top-k
semantically related results to the user. Our solution involves two steps: (a)
extracting semantically related candidate queries from the original query and
(b) processing candidate queries and retrieving the top-k semantically related
results. Candidate queries are generated by replacement of non-mapped keywords
with candidate keywords obtained from an ontological knowledge base. Candidate
results are scored using their cohesiveness and their similarity to the
original query. Since the number of queries to process can be large, with each
result having to be analyzed, we propose pruning techniques to retrieve the
top- results efficiently. We develop two query processing algorithms based
on our pruning techniques. Further, we exploit a property of the candidate
queries to propose a technique for processing multiple queries in batch, which
improves the performance substantially. Extensive experiments on two real
datasets verify the effectiveness and efficiency of the proposed approaches.Comment: 24 pages, 21 figures, 6 tables, submitted to The VLDB Journal for
possible publicatio
Data production methods for harmonized patent statistics : patentee name harmonization.
Patent documents are one of the most comprehensive data sources on technology development. As such, they provide a unique source of information to analyze and monitor technological performance. Patent indicators are now used by companies and by policy and government agencies alike to assess technological progress on the level of regions, countries, domains, and even specific entities such as companies, universities and individual inventors. In this paper, we develop a comprehensive method to achieve harmonization of patentee names in an automated way so that analysis at the level of patentees can be facilitated. The method has been applied to an extensive set of all patentee names found for all EPO patent applications published between 1978 and 2004 and all granted USPTO patents published between 1991 and 2003. As completeness (the extent to which the name-harmonization procedure is able to capture all name variants of the same patentee ) and accuracy (the extent to which the name-harmonization procedure correctly allocates name variants to a single, harmonized patentee name ) do not go hand in hand, priority has been given to accuracy. Before discussing in detail the methodology and its effects as applied to the EPO and USPTO patentee name list, we will first clarify the difference between patentee name harmonization and legal entity identification. In addition, we will briefly expand on the methods and approaches previously developed to address the issue of patentee name harmonization, in order to shed light on our specific contribution. Finally, future refinements and extensions are discussed.Agency; Applications; EPO; USPTO; Name harmonization; Information;
- …