28,211 research outputs found

    Verso folio: Diversified Ranking for Large Graphs with Context-Aware Considerations

    Full text link
    This work is pertaining to the diversified ranking of web-resources and interconnected documents that rely on a network-like structure, e.g. web-pages. A practical example of this would be a query for the k most relevant web-pages that are also in the same time as dissimilar with each other as possible. Relevance and dissimilarity are quantified using an aggregation of network distance and context similarity. For example, for a specific configuration of the problem, we might be interested in web-pages that are similar with the query in terms of their textual description but distant from each other in terms of the web-graph, e.g. many clicks away. In retrospect, a dearth of work can be found in the literature addressing this problem taking the network structure formed by the document links into consideration. In this work, we propose a hill-climbing approach that is seeded with a document collection which is generated using greedy heuristics to diversify initially. More importantly, we tackle the problem in the context of web-pages where there is an underlying network structure connecting the available documents and resources. This is a significant difference to the majority of works that tackle the problem in terms of either content definitions, or the graph structure of the data, but never addressing both aspects simultaneously. To the best of our knowledge, this is the very first effort that can be found to combine both aspects of this important problem in an elegant fashion by also allowing a great degree of flexibility on how to configure the trade-offs of (i) document relevance over result-items' dissimilarity, and (ii) network distance over content relevance or dissimilarity. Last but not least, we present an extensive evaluation of our methods that demonstrate the effectiveness and efficiency thereof.Comment: 12 pages of unpublished wor

    Joint Neural Entity Disambiguation with Output Space Search

    Full text link
    In this paper, we present a novel model for entity disambiguation that combines both local contextual information and global evidences through Limited Discrepancy Search (LDS). Given an input document, we start from a complete solution constructed by a local model and conduct a search in the space of possible corrections to improve the local solution from a global view point. Our search utilizes a heuristic function to focus more on the least confident local decisions and a pruning function to score the global solutions based on their local fitness and the global coherences among the predicted entities. Experimental results on CoNLL 2003 and TAC 2010 benchmarks verify the effectiveness of our model.Comment: Accepted as a long paper at COLING 2018, 11 page

    Ranking XPaths for extracting search result records

    Get PDF
    Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software

    Personalized Query Auto-Completion Through a Lightweight Representation of the User Context

    Full text link
    Query Auto-Completion (QAC) is a widely used feature in many domains, including web and eCommerce search, suggesting full queries based on a prefix typed by the user. QAC has been extensively studied in the literature in the recent years, and it has been consistently shown that adding personalization features can significantly improve the performance of QAC. In this work we propose a novel method for personalized QAC that uses lightweight embeddings learnt through fastText. We construct an embedding for the user context queries, which are the last few queries issued by the user. We also use the same model to get the embedding for the candidate queries to be ranked. We introduce ranking features that compute the distance between the candidate queries and the context queries in the embedding space. These features are then combined with other commonly used QAC ranking features to learn a ranking model. We apply our method to a large eCommerce search engine (eBay) and show that the ranker with our proposed feature significantly outperforms the baselines on all of the offline metrics measured, which includes Mean Reciprocal Rank (MRR), Success Rate (SR), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG). Our baselines include the Most Popular Completion (MPC) model as well as a ranking model without our proposed features. The ranking model with the proposed features results in a 2030%20-30\% improvement over the MPC model on all metrics. We obtain up to a 5%5\% improvement over the baseline ranking model for all the sessions, which goes up to about 10%10\% when we restrict to sessions that contain the user context. Moreover, our proposed features also significantly outperform text based personalization features studied in the literature before, and adding text based features on top of our proposed embedding based features results only in minor improvements

    Implementing Recommendation Algorithms in a Large-Scale Biomedical Science Knowledge Base

    Full text link
    The number of biomedical research articles published has doubled in the past 20 years. Search engine based systems naturally center around searching, but researchers may not have a clear goal in mind, or the goal may be expressed in a query that a literature search engine cannot easily answer, such as identifying the most prominent authors in a given field of research. The discovery process can be improved by providing researchers with recommendations for relevant papers or for researchers who are dealing with related bodies of work. In this paper we describe several recommendation algorithms that were implemented in the Meta platform. The Meta platform contains over 27 million articles and continues to grow daily. It provides an online map of science that organizes, in real time, all published biomedical research. The ultimate goal is to make it quicker and easier for researchers to: filter through scientific papers; find the most important work and, keep up with emerging research results. Meta generates and maintains a semantic knowledge network consisting of these core entities: authors, papers, journals, institutions, and concepts. We implemented several recommendation algorithms and evaluated their efficiency in this large-scale biomedical knowledge base. We selected recommendation algorithms that could take advantage of the unique environment of the Meta platform such as those that make use of diverse datasets such as a citation networks, text content, semantic tag content, and co-authorship information and those that can scale to very large datasets. In this paper, we describe the recommendation algorithms that were implemented and report on their relative efficiency and the challenges associated with developing and deploying a production recommendation engine system.Comment: 21 pages; 5 figure

    Exploiting Lists of Names for Named Entity Identification of Financial Institutions from Unstructured Documents

    Full text link
    There is a wealth of information about financial systems that is embedded in document collections. In this paper, we focus on a specialized text extraction task for this domain. The objective is to extract mentions of names of financial institutions, or FI names, from financial prospectus documents, and to identify the corresponding real world entities, e.g., by matching against a corpus of such entities. The tasks are Named Entity Recognition (NER) and Entity Resolution (ER); both are well studied in the literature. Our contribution is to develop a rule-based approach that will exploit lists of FI names for both tasks; our solution is labeled Dict-based NER and Rank-based ER. Since the FI names are typically represented by a root, and a suffix that modifies the root, we use these lists of FI names to create specialized root and suffix dictionaries. To evaluate the effectiveness of our specialized solution for extracting FI names, we compare Dict-based NER with a general purpose rule-based NER solution, ORG NER. Our evaluation highlights the benefits and limitations of specialized versus general purpose approaches, and presents additional suggestions for tuning and customization for FI name extraction. To our knowledge, our proposed solutions, Dict-based NER and Rank-based ER, and the root and suffix dictionaries, are the first attempt to exploit specialized knowledge, i.e., lists of FI names, for rule-based NER and ER

    Learning Representations using Spectral-Biased Random Walks on Graphs

    Full text link
    Several state-of-the-art neural graph embedding methods are based on short random walks (stochastic processes) because of their ease of computation, simplicity in capturing complex local graph properties, scalability, and interpretibility. In this work, we are interested in studying how much a probabilistic bias in this stochastic process affects the quality of the nodes picked by the process. In particular, our biased walk, with a certain probability, favors movement towards nodes whose neighborhoods bear a structural resemblance to the current node's neighborhood. We succinctly capture this neighborhood as a probability measure based on the spectrum of the node's neighborhood subgraph represented as a normalized laplacian matrix. We propose the use of a paragraph vector model with a novel Wasserstein regularization term. We empirically evaluate our approach against several state-of-the-art node embedding techniques on a wide variety of real-world datasets and demonstrate that our proposed method significantly improves upon existing methods on both link prediction and node classification tasks.Comment: Accepted at IJCNN 2020: International Joint Conference on Neural Network

    TribeFlow: Mining & Predicting User Trajectories

    Full text link
    Which song will Smith listen to next? Which restaurant will Alice go to tomorrow? Which product will John click next? These applications have in common the prediction of user trajectories that are in a constant state of flux over a hidden network (e.g. website links, geographic location). What users are doing now may be unrelated to what they will be doing in an hour from now. Mindful of these challenges we propose TribeFlow, a method designed to cope with the complex challenges of learning personalized predictive models of non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow is a general method that can perform next product recommendation, next song recommendation, next location prediction, and general arbitrary-length user trajectory prediction without domain-specific knowledge. TribeFlow is more accurate and up to 413x faster than top competitors.Comment: To Appear at WWW 201

    Treatment of Semantic Heterogeneity in Information Retrieval

    Full text link
    The first step to handle semantic heterogeneity should be the attempt to enrich the semantic information about documents, i.e. to fill up the gaps in the documents meta-data automatically. Section 2 describes a set of cascading deductive and heuristic extraction rules, which were developed in the project CARMEN for the domain of Social Sciences. The mapping between different terminologies can be done by using intellectual, statistical and/or neural network transfer modules. Intellectual transfers use cross-concordances between different classification schemes or thesauri. Section 3 describes the creation, storage and handling of such transfers.Comment: Technical Report (Arbeitsbericht) GESIS - Leibniz Institute for the Social Science

    Structured Learning of Two-Level Dynamic Rankings

    Full text link
    For ambiguous queries, conventional retrieval systems are bound by two conflicting goals. On the one hand, they should diversify and strive to present results for as many query intents as possible. On the other hand, they should provide depth for each intent by displaying more than a single result. Since both diversity and depth cannot be achieved simultaneously in the conventional static retrieval model, we propose a new dynamic ranking approach. Dynamic ranking models allow users to adapt the ranking through interaction, thus overcoming the constraints of presenting a one-size-fits-all static ranking. In particular, we propose a new two-level dynamic ranking model for presenting search results to the user. In this model, a user's interactions with the first-level ranking are used to infer this user's intent, so that second-level rankings can be inserted to provide more results relevant for this intent. Unlike for previous dynamic ranking models, we provide an algorithm to efficiently compute dynamic rankings with provable approximation guarantees for a large family of performance measures. We also propose the first principled algorithm for learning dynamic ranking functions from training data. In addition to the theoretical results, we provide empirical evidence demonstrating the gains in retrieval quality that our method achieves over conventional approaches.Comment: 10 Pages (Longer Version of CIKM 2011 paper containing more details and experiments
    corecore