10 research outputs found

    Entity Linking for Queries by Searching Wikipedia Sentences

    Full text link
    We present a simple yet effective approach for linking entities in queries. The key idea is to search sentences similar to a query from Wikipedia articles and directly use the human-annotated entities in the similar sentences as candidate entities for the query. Then, we employ a rich set of features, such as link-probability, context-matching, word embeddings, and relatedness among candidate entities as well as their related entities, to rank the candidates under a regression based framework. The advantages of our approach lie in two aspects, which contribute to the ranking process and final linking result. First, it can greatly reduce the number of candidate entities by filtering out irrelevant entities with the words in the query. Second, we can obtain the query sensitive prior probability in addition to the static link-probability derived from all Wikipedia articles. We conduct experiments on two benchmark datasets on entity linking for queries, namely the ERD14 dataset and the GERDAQ dataset. Experimental results show that our method outperforms state-of-the-art systems and yields 75.0% in F1 on the ERD14 dataset and 56.9% on the GERDAQ dataset

    Report on the 1st Simulation for Information Retrieval Workshop (Sim4IR 2021) at SIGIR 2021

    Get PDF
    Simulation is used as a low-cost and repeatable means of experimentation. As Information Retrieval (IR) researchers, we are no strangers to the idea of using simulation within our own field---such as the traditional means of IR system evaluation as manifested through the Cranfield paradigm. While simulation has been used in other areas of IR research (such as the study of user behaviours), we argue that the potential for using simulation has been recognised by relatively few IR researchers so far. To this end, the Sim4IR workshop was held online on July 15th, 2021 in conjunction with ACM SIGIR 2021. Building on past efforts, the goal of the workshop was to create a forum for researchers and practitioners to promote methodology and development of more widespread use of simulation for IR evaluation. Around 80 participants took part over two sessions. A total of two keynotes, three original paper presentations, and eight 'encore talks' were presented. The main conclusions from the resultant discussion were that simulation has the potential to offer solutions to the limitations of existing evaluation methodologies, but there is more research needed toward developing realistic user simulators; and the development and sharing of simulators, in the form of toolkits and online services, is critical for successful uptake.publishedVersio

    Concept-based short text classification and ranking

    Get PDF
    ABSTRACT Most existing approaches for text classification represent texts as vectors of words, namely "Bag-of-Words." This text representation results in a very high dimensionality of feature space and frequently suffers from surface mismatching. Short texts make these issues even more serious, due to their shortness and sparsity. In this paper, we propose using "Bag-of-Concepts" in short text representation, aiming to avoid the surface mismatching and handle the synonym and polysemy problem. Based on "Bag-of-Concepts," a novel framework is proposed for lightweight short text classification applications. By leveraging a large taxonomy knowledgebase, it learns a concept model for each category, and conceptualizes a short text to a set of relevant concepts. A concept-based similarity mechanism is presented to classify the given short text to the most similar category. One advantage of this mechanism is that it facilitates short text ranking after classification, which is needed in many applications, such as query or ad recommendation. We demonstrate the usage of our proposed framework through a real online application: Channel-based Query Recommendation. Experiments show that our framework can map queries to channels with a high degree of precision (avg. precision = 90.3%), which is critical for recommendation applications

    Methods for ranking user-generated text streams: a case study in blog feed retrieval

    Get PDF
    User generated content are one of the main sources of information on the Web nowadays. With the huge amount of this type of data being generated everyday, having an efficient and effective retrieval system is essential. The goal of such a retrieval system is to enable users to search through this data and retrieve documents relevant to their information needs. Among the different retrieval tasks of user generated content, retrieving and ranking streams is one of the important ones that has various applications. The goal of this task is to rank streams, as collections of documents with chronological order, in response to a user query. This is different than traditional retrieval tasks where the goal is to rank single documents and temporal properties are less important in the ranking. In this thesis we investigate the problem of ranking user-generated streams with a case study in blog feed retrieval. Blogs, like all other user generated streams, have specific properties and require new considerations in the retrieval methods. Blog feed retrieval can be defined as retrieving blogs with a recurrent interest in the topic of the given query. We define three different properties of blog feed retrieval each of which introduces new challenges in the ranking task. These properties include: 1) term mismatch in blog retrieval, 2) evolution of topics in blogs and 3) diversity of blog posts. For each of these properties, we investigate its corresponding challenges and propose solutions to overcome those challenges. We further analyze the effect of our solutions on the performance of a retrieval system. We show that taking the new properties into account for developing the retrieval system can help us to improve state of the art retrieval methods. In all the proposed methods, we specifically pay attention to temporal properties that we believe are important information in any type of streams. We show that when combined with content-based information, temporal information can be useful in different situations. Although we apply our methods to blog feed retrieval, they are mostly general methods that are applicable to similar stream ranking problems like ranking experts or ranking twitter users

    Query refinement for patent prior art search

    Get PDF
    A patent is a contract between the inventor and the state, granting a limited time period to the inventor to exploit his invention. In exchange, the inventor must put a detailed description of his invention in the public domain. Patents can encourage innovation and economic growth but at the time of economic crisis patents can hamper such growth. The long duration of the application process is a big obstacle that needs to be addressed to maximize the benefit of patents on innovation and economy. This time can be significantly improved by changing the way we search the patent and non-patent literature.Despite the recent advancement of general information retrieval and the revolution of Web Search engines, there is still a huge gap between the emerging technologies from the research labs and adapted by major Internet search engines, and the systems which are in use by the patent search communities.In this thesis we investigate the problem of patent prior art search in patent retrieval with the goal of finding documents which describe the idea of a query patent. A query patent is a full patent application composed of hundreds of terms which does not represent a single focused information need. Other relevance evidences (e.g. classification tags, and bibliographical data) provide additional details about the underlying information need of the query patent. The first goal of this thesis is to estimate a uni-gram query model from the textual fields of a query patent. We then improve the initial query representation using noun phrases extracted from the query patent. We show that expansion in a query-dependent manner is useful.The second contribution of this thesis is to address the term mismatch problem from a query formulation point of view by integrating multiple relevance evidences associated with the query patent. To do this, we enhance the initial representation of the query with the term distribution of the community of inventors related to the topic of the query patent. We then build a lexicon using classification tags and show that query expansion using this lexicon and considering proximity information (between query and expansion terms) can improve the retrieval performance. We perform an empirical evaluation of our proposed models on two patent datasets. The experimental results show that our proposed models can achieve significantly better results than the baseline and other enhanced models

    Modeling User Transportation Patterns Using Mobile Devices

    Get PDF
    Participatory sensing frameworks use humans and their computing devices as a large mobile sensing network. Dramatic accessibility and affordability have turned mobile devices (smartphone and tablet computers) into the most popular computational machines in the world, exceeding laptops. By the end of 2013, more than 1.5 billion people on earth will have a smartphone. Increased coverage and higher speeds of cellular networks have given these devices the power to constantly stream large amounts of data. Most mobile devices are equipped with advanced sensors such as GPS, cameras, and microphones. This expansion of smartphone numbers and power has created a sensing system capable of achieving tasks practically impossible for conventional sensing platforms. One of the advantages of participatory sensing platforms is their mobility, since human users are often in motion. This dissertation presents a set of techniques for modeling and predicting user transportation patterns from cell-phone and social media check-ins. To study large-scale transportation patterns, I created a mobile phone app, Kpark, for estimating parking lot occupancy on the UCF campus. Kpark aggregates individual user reports on parking space availability to produce a global picture across all the campus lots using crowdsourcing. An issue with crowdsourcing is the possibility of receiving inaccurate information from users, either through error or malicious motivations. One method of combating this problem is to model the trustworthiness of individual participants to use that information to selectively include or discard data. This dissertation presents a comprehensive study of the performance of different worker quality and data fusion models with plausible simulated user populations, as well as an evaluation of their performance on the real data obtained from a full release of the Kpark app on the UCF Orlando campus. To evaluate individual trust prediction methods, an algorithm selection portfolio was introduced to take advantage of the strengths of each method and maximize the overall prediction performance. Like many other crowdsourced applications, user incentivization is an important aspect of creating a successful crowdsourcing workflow. For this project a form of non-monetized incentivization called gamification was used in order to create competition among users with the aim of increasing the quantity and quality of data submitted to the project. This dissertation reports on the performance of Kpark at predicting parking occupancy, increasing user app usage, and predicting worker quality

    Characterizing and Understanding User Perception of System Initiative for Conversational Systems to Support Collaborative Search

    Get PDF
    Popular messaging applications such as Slack, Discord, and Microsoft Teams have given rise to thousands of chatbots and in-app integrations to facilitate collaborations. We utilized this design framework to study how searchbots (i.e., chatbots that perform specific searches) can facilitate collaborative search. More specifically, we investigated a design space for searchbot's that can engage in a mixed-initiative interaction. In this dissertation, we present a Wizard of Oz (WoZ) study to investigate the implications of envisioning a conversational search system capable of engaging in mixed-initiative interactions to support collaborative search. The Wizard plays the role of a conversational search system that can search for information, send relevant web results, and message users. We investigated three Wizard conditions: bot\_info, bot\_dialog, and bot\_task, which differ in how the Wizard can intervene in a conversation. The intervention modes follow the mixed-initiative framework by ~\cite{chu1998evidential}, originally developed based on human conversations. Broadly, we report on three investigations: (1) participants perceptions of the searchbot across the different levels of inititive; (2) the Wizards' motivations to take the initiative; and (3) the Wizards' characterization of the appropriateness of their interventions. Our results suggest that participants' collaborations can enhance when the searchbot can take limited initiative and align with the participants' search strategy. Additionally, in the characterization of motivations and timings, the Wizard presented a wide array of themes to provide search assistance and promote collaborations. Finally, while the participants did not prefer the advanced capabilities of the searchbot, our characterization of their motivations and timing helps us understand the complex activities the searchbot can cater to support collaborations.Doctor of Philosoph

    The role of context in image annotation and recommendation

    Get PDF
    With the rise of smart phones, lifelogging devices (e.g. Google Glass) and popularity of image sharing websites (e.g. Flickr), users are capturing and sharing every aspect of their life online producing a wealth of visual content. Of these uploaded images, the majority are poorly annotated or exist in complete semantic isolation making the process of building retrieval systems difficult as one must firstly understand the meaning of an image in order to retrieve it. To alleviate this problem, many image sharing websites offer manual annotation tools which allow the user to “tag” their photos, however, these techniques are laborious and as a result have been poorly adopted; Sigurbjörnsson and van Zwol (2008) showed that 64% of images uploaded to Flickr are annotated with < 4 tags. Due to this, an entire body of research has focused on the automatic annotation of images (Hanbury, 2008; Smeulders et al., 2000; Zhang et al., 2012a) where one attempts to bridge the semantic gap between an image’s appearance and meaning e.g. the objects present. Despite two decades of research the semantic gap still largely exists and as a result automatic annotation models often offer unsatisfactory performance for industrial implementation. Further, these techniques can only annotate what they see, thus ignoring the “bigger picture” surrounding an image (e.g. its location, the event, the people present etc). Much work has therefore focused on building photo tag recommendation (PTR) methods which aid the user in the annotation process by suggesting tags related to those already present. These works have mainly focused on computing relationships between tags based on historical images e.g. that NY and timessquare co-exist in many images and are therefore highly correlated. However, tags are inherently noisy, sparse and ill-defined often resulting in poor PTR accuracy e.g. does NY refer to New York or New Year? This thesis proposes the exploitation of an image’s context which, unlike textual evidences, is always present, in order to alleviate this ambiguity in the tag recommendation process. Specifically we exploit the “what, who, where, when and how” of the image capture process in order to complement textual evidences in various photo tag recommendation and retrieval scenarios. In part II, we combine text, content-based (e.g. # of faces present) and contextual (e.g. day-of-the-week taken) signals for tag recommendation purposes, achieving up to a 75% improvement to precision@5 in comparison to a text-only TF-IDF baseline. We then consider external knowledge sources (i.e. Wikipedia & Twitter) as an alternative to (slower moving) Flickr in order to build recommendation models on, showing that similar accuracy could be achieved on these faster moving, yet entirely textual, datasets. In part II, we also highlight the merits of diversifying tag recommendation lists before discussing at length various problems with existing automatic image annotation and photo tag recommendation evaluation collections. In part III, we propose three new image retrieval scenarios, namely “visual event summarisation”, “image popularity prediction” and “lifelog summarisation”. In the first scenario, we attempt to produce a rank of relevant and diverse images for various news events by (i) removing irrelevant images such memes and visual duplicates (ii) before semantically clustering images based on the tweets in which they were originally posted. Using this approach, we were able to achieve over 50% precision for images in the top 5 ranks. In the second retrieval scenario, we show that by combining contextual and content-based features from images, we are able to predict if it will become “popular” (or not) with 74% accuracy, using an SVM classifier. Finally, in chapter 9 we employ blur detection and perceptual-hash clustering in order to remove noisy images from lifelogs, before combining visual and geo-temporal signals in order to capture a user’s “key moments” within their day. We believe that the results of this thesis show an important step towards building effective image retrieval models when there lacks sufficient textual content (i.e. a cold start)

    Identification of re-finding tasks and search difficulty

    Get PDF
    We address the problem of identifying if users are attempting to re-find information and estimating the level of difficulty of the re-finding task. Identifying re-finding tasks and detecting search difficulties will enable search engines to respond dynamically to the search task being undertaken. To this aim, we conduct user studies and query log analysis to make a better understanding of re-finding tasks and search difficulties. Computing features particularly gathered in our user studies, we generate training sets from query log data, which is used for constructing automatic identification (prediction) models. Using machine learning techniques, our built re-finding identification model, which is the first model at the task level, could significantly outperform the existing query-based identifications. While past research assumes that previous search history of the user is available to the prediction model, we examine if re-finding detection is possible without access to this information. Our evaluation indicates that such detection is possible, but more challenging. We further describe the first predictive model in detecting re-finding difficulty, showing it to be significantly better than existing approaches for detecting general search difficulty. We also analyze important features for both identifications of re-finding and difficulties. Next, we investigate detailed identification of re-finding tasks and difficulties in terms of the type of the vertical document to be re-found. The accuracy of constructed predictive models indicates that re-finding tasks are indeed distinguishable across verticals and in comparison to general search tasks. This illustrates the requirement of adapting existing general search techniques for the re-finding context in terms of presenting vertical-specific results. Despite the overall reduction of accuracy in predictions independent of the original search of the user, it appears that identifying &amp;ldquo;image re-finding&amp;rdquo; is less dependent on such past information. Investigating the real-time prediction effectiveness of the models show that predicting ``image&#039;&#039; document re-finding obtains the highest accuracy early in the search. Early predictions would benefit search engines with adaptation of search results during re-finding activities. Furthermore, we study the difficulties in re-finding across verticals given some of the established indications of difficulties in the general web search context. In terms of user effort, re-finding &amp;ldquo;image&amp;rdquo; vertical appears to take more effort in terms of number of queries and clicks than other investigated verticals, while re-finding &amp;ldquo;reference&amp;rdquo; documents seems to be more time consuming when there is a longer time gap between the re-finding and corresponding original search. Exploring other features suggests that there could be particular difficulty indications for the re-finding context and specific to each vertical. To sum up, this research investigates the issue of effectively supporting users with re-finding search tasks. To this end, we have identified features that allow for more accurate distinction between re-finding and general tasks. This will enable search engines to better adapt search results for the re-finding context and improve the search experience of the users. Moreover, features indicative of similar/different and easy/difficult re-finding tasks can be employed for building balanced test environments, which could address one of the main gaps in the re-finding context
    corecore