28,211 research outputs found
Verso folio: Diversified Ranking for Large Graphs with Context-Aware Considerations
This work is pertaining to the diversified ranking of web-resources and
interconnected documents that rely on a network-like structure, e.g. web-pages.
A practical example of this would be a query for the k most relevant web-pages
that are also in the same time as dissimilar with each other as possible.
Relevance and dissimilarity are quantified using an aggregation of network
distance and context similarity. For example, for a specific configuration of
the problem, we might be interested in web-pages that are similar with the
query in terms of their textual description but distant from each other in
terms of the web-graph, e.g. many clicks away. In retrospect, a dearth of work
can be found in the literature addressing this problem taking the network
structure formed by the document links into consideration.
In this work, we propose a hill-climbing approach that is seeded with a
document collection which is generated using greedy heuristics to diversify
initially. More importantly, we tackle the problem in the context of web-pages
where there is an underlying network structure connecting the available
documents and resources. This is a significant difference to the majority of
works that tackle the problem in terms of either content definitions, or the
graph structure of the data, but never addressing both aspects simultaneously.
To the best of our knowledge, this is the very first effort that can be found
to combine both aspects of this important problem in an elegant fashion by also
allowing a great degree of flexibility on how to configure the trade-offs of
(i) document relevance over result-items' dissimilarity, and (ii) network
distance over content relevance or dissimilarity. Last but not least, we
present an extensive evaluation of our methods that demonstrate the
effectiveness and efficiency thereof.Comment: 12 pages of unpublished wor
Joint Neural Entity Disambiguation with Output Space Search
In this paper, we present a novel model for entity disambiguation that
combines both local contextual information and global evidences through Limited
Discrepancy Search (LDS). Given an input document, we start from a complete
solution constructed by a local model and conduct a search in the space of
possible corrections to improve the local solution from a global view point.
Our search utilizes a heuristic function to focus more on the least confident
local decisions and a pruning function to score the global solutions based on
their local fitness and the global coherences among the predicted entities.
Experimental results on CoNLL 2003 and TAC 2010 benchmarks verify the
effectiveness of our model.Comment: Accepted as a long paper at COLING 2018, 11 page
Ranking XPaths for extracting search result records
Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software
Personalized Query Auto-Completion Through a Lightweight Representation of the User Context
Query Auto-Completion (QAC) is a widely used feature in many domains,
including web and eCommerce search, suggesting full queries based on a prefix
typed by the user. QAC has been extensively studied in the literature in the
recent years, and it has been consistently shown that adding personalization
features can significantly improve the performance of QAC. In this work we
propose a novel method for personalized QAC that uses lightweight embeddings
learnt through fastText. We construct an embedding for the user context
queries, which are the last few queries issued by the user. We also use the
same model to get the embedding for the candidate queries to be ranked. We
introduce ranking features that compute the distance between the candidate
queries and the context queries in the embedding space. These features are then
combined with other commonly used QAC ranking features to learn a ranking
model. We apply our method to a large eCommerce search engine (eBay) and show
that the ranker with our proposed feature significantly outperforms the
baselines on all of the offline metrics measured, which includes Mean
Reciprocal Rank (MRR), Success Rate (SR), Mean Average Precision (MAP), and
Normalized Discounted Cumulative Gain (NDCG). Our baselines include the Most
Popular Completion (MPC) model as well as a ranking model without our proposed
features. The ranking model with the proposed features results in a
improvement over the MPC model on all metrics. We obtain up to a
improvement over the baseline ranking model for all the sessions, which goes up
to about when we restrict to sessions that contain the user context.
Moreover, our proposed features also significantly outperform text based
personalization features studied in the literature before, and adding text
based features on top of our proposed embedding based features results only in
minor improvements
Implementing Recommendation Algorithms in a Large-Scale Biomedical Science Knowledge Base
The number of biomedical research articles published has doubled in the past
20 years. Search engine based systems naturally center around searching, but
researchers may not have a clear goal in mind, or the goal may be expressed in
a query that a literature search engine cannot easily answer, such as
identifying the most prominent authors in a given field of research. The
discovery process can be improved by providing researchers with recommendations
for relevant papers or for researchers who are dealing with related bodies of
work. In this paper we describe several recommendation algorithms that were
implemented in the Meta platform. The Meta platform contains over 27 million
articles and continues to grow daily. It provides an online map of science that
organizes, in real time, all published biomedical research. The ultimate goal
is to make it quicker and easier for researchers to: filter through scientific
papers; find the most important work and, keep up with emerging research
results. Meta generates and maintains a semantic knowledge network consisting
of these core entities: authors, papers, journals, institutions, and concepts.
We implemented several recommendation algorithms and evaluated their efficiency
in this large-scale biomedical knowledge base. We selected recommendation
algorithms that could take advantage of the unique environment of the Meta
platform such as those that make use of diverse datasets such as a citation
networks, text content, semantic tag content, and co-authorship information and
those that can scale to very large datasets. In this paper, we describe the
recommendation algorithms that were implemented and report on their relative
efficiency and the challenges associated with developing and deploying a
production recommendation engine system.Comment: 21 pages; 5 figure
Exploiting Lists of Names for Named Entity Identification of Financial Institutions from Unstructured Documents
There is a wealth of information about financial systems that is embedded in
document collections. In this paper, we focus on a specialized text extraction
task for this domain. The objective is to extract mentions of names of
financial institutions, or FI names, from financial prospectus documents, and
to identify the corresponding real world entities, e.g., by matching against a
corpus of such entities. The tasks are Named Entity Recognition (NER) and
Entity Resolution (ER); both are well studied in the literature. Our
contribution is to develop a rule-based approach that will exploit lists of FI
names for both tasks; our solution is labeled Dict-based NER and Rank-based ER.
Since the FI names are typically represented by a root, and a suffix that
modifies the root, we use these lists of FI names to create specialized root
and suffix dictionaries. To evaluate the effectiveness of our specialized
solution for extracting FI names, we compare Dict-based NER with a general
purpose rule-based NER solution, ORG NER. Our evaluation highlights the
benefits and limitations of specialized versus general purpose approaches, and
presents additional suggestions for tuning and customization for FI name
extraction. To our knowledge, our proposed solutions, Dict-based NER and
Rank-based ER, and the root and suffix dictionaries, are the first attempt to
exploit specialized knowledge, i.e., lists of FI names, for rule-based NER and
ER
Learning Representations using Spectral-Biased Random Walks on Graphs
Several state-of-the-art neural graph embedding methods are based on short
random walks (stochastic processes) because of their ease of computation,
simplicity in capturing complex local graph properties, scalability, and
interpretibility. In this work, we are interested in studying how much a
probabilistic bias in this stochastic process affects the quality of the nodes
picked by the process. In particular, our biased walk, with a certain
probability, favors movement towards nodes whose neighborhoods bear a
structural resemblance to the current node's neighborhood. We succinctly
capture this neighborhood as a probability measure based on the spectrum of the
node's neighborhood subgraph represented as a normalized laplacian matrix. We
propose the use of a paragraph vector model with a novel Wasserstein
regularization term. We empirically evaluate our approach against several
state-of-the-art node embedding techniques on a wide variety of real-world
datasets and demonstrate that our proposed method significantly improves upon
existing methods on both link prediction and node classification tasks.Comment: Accepted at IJCNN 2020: International Joint Conference on Neural
Network
TribeFlow: Mining & Predicting User Trajectories
Which song will Smith listen to next? Which restaurant will Alice go to
tomorrow? Which product will John click next? These applications have in common
the prediction of user trajectories that are in a constant state of flux over a
hidden network (e.g. website links, geographic location). What users are doing
now may be unrelated to what they will be doing in an hour from now. Mindful of
these challenges we propose TribeFlow, a method designed to cope with the
complex challenges of learning personalized predictive models of
non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow
is a general method that can perform next product recommendation, next song
recommendation, next location prediction, and general arbitrary-length user
trajectory prediction without domain-specific knowledge. TribeFlow is more
accurate and up to 413x faster than top competitors.Comment: To Appear at WWW 201
Treatment of Semantic Heterogeneity in Information Retrieval
The first step to handle semantic heterogeneity should be the attempt to
enrich the semantic information about documents, i.e. to fill up the gaps in
the documents meta-data automatically. Section 2 describes a set of cascading
deductive and heuristic extraction rules, which were developed in the project
CARMEN for the domain of Social Sciences. The mapping between different
terminologies can be done by using intellectual, statistical and/or neural
network transfer modules. Intellectual transfers use cross-concordances between
different classification schemes or thesauri. Section 3 describes the creation,
storage and handling of such transfers.Comment: Technical Report (Arbeitsbericht) GESIS - Leibniz Institute for the
Social Science
Structured Learning of Two-Level Dynamic Rankings
For ambiguous queries, conventional retrieval systems are bound by two
conflicting goals. On the one hand, they should diversify and strive to present
results for as many query intents as possible. On the other hand, they should
provide depth for each intent by displaying more than a single result. Since
both diversity and depth cannot be achieved simultaneously in the conventional
static retrieval model, we propose a new dynamic ranking approach. Dynamic
ranking models allow users to adapt the ranking through interaction, thus
overcoming the constraints of presenting a one-size-fits-all static ranking. In
particular, we propose a new two-level dynamic ranking model for presenting
search results to the user. In this model, a user's interactions with the
first-level ranking are used to infer this user's intent, so that second-level
rankings can be inserted to provide more results relevant for this intent.
Unlike for previous dynamic ranking models, we provide an algorithm to
efficiently compute dynamic rankings with provable approximation guarantees for
a large family of performance measures. We also propose the first principled
algorithm for learning dynamic ranking functions from training data. In
addition to the theoretical results, we provide empirical evidence
demonstrating the gains in retrieval quality that our method achieves over
conventional approaches.Comment: 10 Pages (Longer Version of CIKM 2011 paper containing more details
and experiments
- …