104,320 research outputs found

    From document to entity retrieval : improving precision and performance of focused text search

    Get PDF
    Text retrieval is an active area of research since decades. Several issues have\ud been studied over the entire period, like the development of statistical models\ud for the estimation of relevance, or the challenge to keep retrieval tasks efficient with ever growing text collections. Especially in the last decade, we have also seen a diversification of retrieval tasks. Passage or XML retrieval systems allow a more focused search. Question answering or expert search systems\ud do not even return a ranked list of text units, but for instance persons with expertise on a given topic. The sketched situation forms the starting point of this thesis, which presents a number of task-specific search solutions and tries to set them into more generic frameworks. In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.\ud In the first case, we show how different types of context information can\ud be incorporated in the retrieval of documents. When users are searching for\ud information, the search task is typically part of a wider working process. This\ud search context, however, is often not reflected by the few search keywords\ud stated to the retrieval system, though it can contain valuable information for\ud query refinement. We address with this work two research questions related\ud to the aim of developing context-aware retrieval systems. First, we show\ud how already available information about the user’s context can be employed\ud effectively to gain highly precise search results. Second, we investigate how\ud such meta-data about the search context can be gathered. The proposed\ud “query profiles” have a central role in the query refinement process. They\ud automatically detect necessary context information and help the user to explicitly\ud express context-dependent search constraints. The effectiveness of\ud the approach is tested with retrieval experiments on newspaper data.\ud When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is attractive to\ud try to develop retrieval systems that make use of the additional structure\ud information. Structured retrieval first asks for the design of a suitable language\ud that enables the user to express queries on content and structure. We\ud investigate here existing query languages, whether and how they support\ud the basic needs of structured querying. However, our main focus lies on the\ud efficiency of structured retrieval systems. Conventional inverted indices for\ud document retrieval systems are not suitable for maintaining structure indices.\ud We identify base operations involved in the execution of structured queries\ud and show how they can be supported by new indices and algorithms on a\ud database system. Efficient query processing has to be concerned with the\ud optimization of query plans as well. We investigate low-level query plans of\ud physical database operators for the execution of simple query patterns. Furthermore,\ud It is demonstrated how complex queries benefit from higher level\ud query optimization.\ud New search tasks and interfaces for the presentation of search results,\ud like faceted search applications, question answering, expert search, and automatic\ud timeline construction, come with the need to rank entities instead of\ud documents. By entities we mean unique (named) existences, such as persons,\ud organizations or dates. Modern language processing tools are able to automatically\ud detect and categorize named entities in large text collections. In\ud order to estimate their relevance to a given search topic, we develop retrieval\ud models for entities which are based on the relevance of texts that mention the\ud entity. A graph-based relevance propagation framework is introduced for this\ud purpose that enables to derive the relevance of entities. Several options for\ud the modeling of entity containment graphs and different relevance propagation\ud approaches are tested, demonstrating the usefulness of the graph-based\ud ranking framework

    Search Engine Optimisation. PageRank best Practices

    Get PDF
    Projecte realitzat en col.laboració amb el centre RWTH AachenSince the explosion of the Internet age the need of search online information has grown as well at the light velocity. As a consequent, new marketing disciplines arise in the digital world. This thesis describes, in the search engine marketing framework, how the ranking in the search engine results page (SERP) can be influenced. Wikipedia describes search engine marketing or SEM as a form of Internet marketing that seeks to promote websites by increasing their visibility in search engine result pages (SERPs). Therefore, the importance of being searchable and visible to the users reveal needs of improvement for the website designers. Different factors are used to produce search rankings. One of them is PageRank. The present thesis focuses on how PageRank of Google makes use of the linking structure of the Web in order to maximise relevance of the results in a web search. PageRank used to be the jigsaw of webmasters because of the secrecy it used to have. The formula that lies behind PageRank enabled the founders of Google to convert a PhD into one of the most successful companies ever. The uniqueness of PageRank in contrast to other Web Search Engines consist in providing the user with the greatest relevance of the results for a specific query, thus providing the most satisfactory user experience. Google does use PageRank as part of their ranking formula. Although it is not as important as many believe, it is nevertheless a measure of a web page’s popularity, and gives a certain indication on how “important” Google considers a page to be. The goal of search marketing is being visible to the end user. Two different fields within search marketing can be pointed out: Search Engine Optimisation and search engine marketing. This study focuses on the first one, Search Engine Optimisation, which refers to all types of initiatives and actions taken by website designers in order to increase the relevance for the Search Engines. It is about design, optimising content, linking structure (internal and external) and other page specific factors. Because of the predominance of Google, this thesis looks at which steps can be taken in a certain website when trying to be optimized for Google’s algorithm PageRank. Moreover, other factors which also have an influence are analyzed

    News vertical search using user-generated content

    Get PDF
    The thesis investigates how content produced by end-users on the World Wide Web — referred to as user-generated content — can enhance the news vertical aspect of a universal Web search engine, such that news-related queries can be satisfied more accurately, comprehensively and in a more timely manner. We propose a news search framework to describe the news vertical aspect of a universal web search engine. This framework is comprised of four components, each providing a different piece of functionality. The Top Events Identification component identifies the most important events that are happening at any given moment using discussion in user-generated content streams. The News Query Classification component classifies incoming queries as news-related or not in real-time. The Ranking News-Related Content component finds and ranks relevant content for news-related user queries from multiple streams of news and user-generated content. Finally, the News-Related Content Integration component merges the previously ranked content for the user query into theWeb search ranking. In this thesis, we argue that user-generated content can be leveraged in one or more of these components to better satisfy news-related user queries. Potential enhancements include the faster identification of news queries relating to breaking news events, more accurate classification of news-related queries, increased coverage of the events searched for by the user or increased freshness in the results returned. Approaches to tackle each of the four components of the news search framework are proposed, which aim to leverage user-generated content. Together, these approaches form the news vertical component of a universal Web search engine. Each approach proposed for a component is thoroughly evaluated using one or more datasets developed for that component. Conclusions are derived concerning whether the use of user-generated content enhances the component in question using an appropriate measure, namely: effectiveness when ranking events by their current importance/newsworthiness for the Top Events Identification component; classification accuracy over different types of query for the News Query Classification component; relevance of the documents returned for the Ranking News-Related Content component; and end-user preference for rankings integrating user-generated content in comparison to the unalteredWeb search ranking for the News-Related Content Integration component. Analysis of the proposed approaches themselves, the effective settings for the deployment of those approaches and insights into their behaviour are also discussed. In particular, the evaluation of the Top Events Identification component examines how effectively events — represented by newswire articles — can be ranked by their importance using two different streams of user-generated content, namely blog posts and Twitter tweets. Evaluation of the proposed approaches for this component indicates that blog posts are an effective source of evidence to use when ranking events and that these approaches achieve state-of-the-art effectiveness. Using the same approaches instead driven by a stream of tweets, provide a story ranking performance that is significantly more effective than random, but is not consistent across all of the datasets and approaches tested. Insights are provided into the reasons for this with regard to the transient nature of discussion in Twitter. Through the evaluation of the News Query Classification component, we show that the use of timely features extracted from different news and user-generated content sources can increase the accuracy of news query classification over relying upon newswire provider streams alone. Evidence also suggests that the usefulness of the user-generated content sources varies as news events mature, with some sources becoming more influential over time as new content is published, leading to an upward trend in classification accuracy. The Ranking News-Related Content component evaluation investigates how to effectively rank content from the blogosphere and Twitter for news-related user queries. Of the approaches tested, we show that learning to rank approaches using features specific to blog posts/tweets lead to state-of-the-art ranking effectiveness under real-time constraints. Finally this thesis demonstrates that the majority of end-users prefer rankings integrated with usergenerated content for news-related queries to rankings containing only Web search results or integrated with only newswire articles. Of the user-generated content sources tested, the most popular source is shown to be Twitter, particularly for queries relating to breaking events. The central contributions of this thesis are the introduction of a news search framework, the approaches to tackle each of the four components of the framework that integrate user-generated content and their subsequent evaluation in a simulated real-time setting. This thesis draws insights from a broad range of experiments spanning the entire search process for news-related queries. The experiments reported in this thesis demonstrate the potential and scope for enhancements that can be brought about by the leverage of user-generated content for real-time news search and related applications

    Sharpening the Search Saw: Lessons from Expert Searchers

    Get PDF
    Many students consider themselves to be proficient searchers and yet are disappointed or frustrated when faced with the task of locating relevant scholarly articles for a literature review. This bleak experience is common among higher education students, even for those in library and information science programs who have heightened appreciation for information resources and yet may settle for “good enough Googling” (Plosker, 2004, p. 34). This is in large part due to reliance on web search engines that have evolved relevance ranking into a vastly intelligent business, one in which we are both its customers and product (Vaidhyanathan, 2011). Google’s Hummingbird nest of search algorithms (Sullivan, 2013) provides quick and targeted hits, yet it can trigger blinders-on trust in first-page results. Concern for student search practices ranges from this permissive trust all the way to lost ability to recall facts and formulate questions (Abilock, 2015), lack of confidence in one’s own knowledge (Carr, 2010), and increased dependence on single search boxes that encourage stream-of-consciousness user input (Tucker, 2013); indeed, students may be high in tech savvy but lacking the critical thinking skills needed for information research tasks (Katz, 2007). Students have come to rely on web search engine intelligence—and it is inarguably colossal—to such an extent that they may fail to formulate a question before charging forward to search for its answer. “Google is known as a search engine, yet there is barely any searching involved anymore. The gap between a question crystallizing in your mind and an answer appearing at the top of your screen is shrinking all the time. As a consequence, our ability to ask questions is atrophying” (Leslie, 2015, para. 4). Highly accomplished students often lament their lack of skills for higher-level searching that calls for formulating pointed questions when struggling to develop a solid literature review. In addition, many are unaware that search results are filtered based on previous searches, location, and other factors extracted from personal search patterns by the search engine. Two students working side by side and entering the same search terms may receive quite different results on Google, yet the extent to which this ‘filter bubble’ (Pariser, 2011) is personalizing their search results is difficult to assess and to overcome. Just as important, it can be impossible to know what a search might be missing: how to know what’s not there? This portrayal of the information landscape may appear gloomy but, in fact, it could not be a more inspiring environment in which to do research, to find connections in ideas, and to benefit from and generate new ideas. A few lessons from expert searchers, focused on critical concepts and search practices, can sharpen a student’s search saw and move the proficient student-researcher, desiring more relevant and comprehensive search results, into a trajectory toward search expertise. For the lessons involved in this journey, the focus is on two areas: first, the critical concepts— called threshold concepts (Meyer & Land, 2003)— found to be necessary for developing search expertise (Tucker et al., 2014); and, second, four strategic areas within search that can have significant and immediate impact on improving search results for research literature. The latter are grounded in the threshold concepts and positioned for application to literature reviews for graduate student studies

    The use of implicit evidence for relevance feedback in web retrieval

    Get PDF
    In this paper we report on the application of two contrasting types of relevance feedback for web retrieval. We compare two systems; one using explicit relevance feedback (where searchers explicitly have to mark documents relevant) and one using implicit relevance feedback (where the system endeavours to estimate relevance by mining the searcher's interaction). The feedback is used to update the display according to the user's interaction. Our research focuses on the degree to which implicit evidence of document relevance can be substituted for explicit evidence. We examine the two variations in terms of both user opinion and search effectiveness

    Contextualised Browsing in a Digital Library's Living Lab

    Full text link
    Contextualisation has proven to be effective in tailoring \linebreak search results towards the users' information need. While this is true for a basic query search, the usage of contextual session information during exploratory search especially on the level of browsing has so far been underexposed in research. In this paper, we present two approaches that contextualise browsing on the level of structured metadata in a Digital Library (DL), (1) one variant bases on document similarity and (2) one variant utilises implicit session information, such as queries and different document metadata encountered during the session of a users. We evaluate our approaches in a living lab environment using a DL in the social sciences and compare our contextualisation approaches against a non-contextualised approach. For a period of more than three months we analysed 47,444 unique retrieval sessions that contain search activities on the level of browsing. Our results show that a contextualisation of browsing significantly outperforms our baseline in terms of the position of the first clicked item in the result set. The mean rank of the first clicked document (measured as mean first relevant - MFR) was 4.52 using a non-contextualised ranking compared to 3.04 when re-ranking the result lists based on similarity to the previously viewed document. Furthermore, we observed that both contextual approaches show a noticeably higher click-through rate. A contextualisation based on document similarity leads to almost twice as many document views compared to the non-contextualised ranking.Comment: 10 pages, 2 figures, paper accepted at JCDL 201
    • 

    corecore