104,320 research outputs found
From document to entity retrieval : improving precision and performance of focused text search
Text retrieval is an active area of research since decades. Several issues have\ud
been studied over the entire period, like the development of statistical models\ud
for the estimation of relevance, or the challenge to keep retrieval tasks efficient with ever growing text collections. Especially in the last decade, we have also seen a diversification of retrieval tasks. Passage or XML retrieval systems allow a more focused search. Question answering or expert search systems\ud
do not even return a ranked list of text units, but for instance persons with expertise on a given topic. The sketched situation forms the starting point of this thesis, which presents a number of task-specific search solutions and tries to set them into more generic frameworks. In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.\ud
In the first case, we show how different types of context information can\ud
be incorporated in the retrieval of documents. When users are searching for\ud
information, the search task is typically part of a wider working process. This\ud
search context, however, is often not reflected by the few search keywords\ud
stated to the retrieval system, though it can contain valuable information for\ud
query refinement. We address with this work two research questions related\ud
to the aim of developing context-aware retrieval systems. First, we show\ud
how already available information about the userâs context can be employed\ud
effectively to gain highly precise search results. Second, we investigate how\ud
such meta-data about the search context can be gathered. The proposed\ud
âquery profilesâ have a central role in the query refinement process. They\ud
automatically detect necessary context information and help the user to explicitly\ud
express context-dependent search constraints. The effectiveness of\ud
the approach is tested with retrieval experiments on newspaper data.\ud
When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is attractive to\ud
try to develop retrieval systems that make use of the additional structure\ud
information. Structured retrieval first asks for the design of a suitable language\ud
that enables the user to express queries on content and structure. We\ud
investigate here existing query languages, whether and how they support\ud
the basic needs of structured querying. However, our main focus lies on the\ud
efficiency of structured retrieval systems. Conventional inverted indices for\ud
document retrieval systems are not suitable for maintaining structure indices.\ud
We identify base operations involved in the execution of structured queries\ud
and show how they can be supported by new indices and algorithms on a\ud
database system. Efficient query processing has to be concerned with the\ud
optimization of query plans as well. We investigate low-level query plans of\ud
physical database operators for the execution of simple query patterns. Furthermore,\ud
It is demonstrated how complex queries benefit from higher level\ud
query optimization.\ud
New search tasks and interfaces for the presentation of search results,\ud
like faceted search applications, question answering, expert search, and automatic\ud
timeline construction, come with the need to rank entities instead of\ud
documents. By entities we mean unique (named) existences, such as persons,\ud
organizations or dates. Modern language processing tools are able to automatically\ud
detect and categorize named entities in large text collections. In\ud
order to estimate their relevance to a given search topic, we develop retrieval\ud
models for entities which are based on the relevance of texts that mention the\ud
entity. A graph-based relevance propagation framework is introduced for this\ud
purpose that enables to derive the relevance of entities. Several options for\ud
the modeling of entity containment graphs and different relevance propagation\ud
approaches are tested, demonstrating the usefulness of the graph-based\ud
ranking framework
Search Engine Optimisation. PageRank best Practices
Projecte realitzat en col.laboraciĂł amb el centre RWTH AachenSince the explosion of the Internet age the need of search online information
has grown as well at the light velocity. As a consequent, new marketing
disciplines arise in the digital world. This thesis describes, in the search engine
marketing framework, how the ranking in the search engine results page
(SERP) can be influenced.
Wikipedia describes search engine marketing or SEM as a form of Internet
marketing that seeks to promote websites by increasing their visibility in search
engine result pages (SERPs). Therefore, the importance of being searchable
and visible to the users reveal needs of improvement for the website designers.
Different factors are used to produce search rankings. One of them is
PageRank. The present thesis focuses on how PageRank of Google makes use
of the linking structure of the Web in order to maximise relevance of the results
in a web search. PageRank used to be the jigsaw of webmasters because of
the secrecy it used to have.
The formula that lies behind PageRank enabled the founders of Google to
convert a PhD into one of the most successful companies ever. The uniqueness
of PageRank in contrast to other Web Search Engines consist in providing the
user with the greatest relevance of the results for a specific query, thus
providing the most satisfactory user experience.
Google does use PageRank as part of their ranking formula. Although it is not
as important as many believe, it is nevertheless a measure of a web pageâs
popularity, and gives a certain indication on how âimportantâ Google considers a
page to be.
The goal of search marketing is being visible to the end user. Two different
fields within search marketing can be pointed out: Search Engine Optimisation
and search engine marketing. This study focuses on the first one, Search
Engine Optimisation, which refers to all types of initiatives and actions taken by
website designers in order to increase the relevance for the Search Engines. It
is about design, optimising content, linking structure (internal and external) and
other page specific factors.
Because of the predominance of Google, this thesis looks at which steps can be
taken in a certain website when trying to be optimized for Googleâs algorithm
PageRank. Moreover, other factors which also have an influence are analyzed
News vertical search using user-generated content
The thesis investigates how content produced by end-users on the World Wide Web â referred to
as user-generated content â can enhance the news vertical aspect of a universal Web search engine,
such that news-related queries can be satisfied more accurately, comprehensively and in a more timely
manner. We propose a news search framework to describe the news vertical aspect of a universal web
search engine. This framework is comprised of four components, each providing a different piece of
functionality. The Top Events Identification component identifies the most important events that are
happening at any given moment using discussion in user-generated content streams. The News Query
Classification component classifies incoming queries as news-related or not in real-time. The Ranking
News-Related Content component finds and ranks relevant content for news-related user queries from
multiple streams of news and user-generated content. Finally, the News-Related Content Integration
component merges the previously ranked content for the user query into theWeb search ranking. In this
thesis, we argue that user-generated content can be leveraged in one or more of these components to
better satisfy news-related user queries. Potential enhancements include the faster identification of news
queries relating to breaking news events, more accurate classification of news-related queries, increased
coverage of the events searched for by the user or increased freshness in the results returned.
Approaches to tackle each of the four components of the news search framework are proposed,
which aim to leverage user-generated content. Together, these approaches form the news vertical component
of a universal Web search engine. Each approach proposed for a component is thoroughly
evaluated using one or more datasets developed for that component. Conclusions are derived concerning
whether the use of user-generated content enhances the component in question using an appropriate
measure, namely: effectiveness when ranking events by their current importance/newsworthiness for the
Top Events Identification component; classification accuracy over different types of query for the News
Query Classification component; relevance of the documents returned for the Ranking News-Related
Content component; and end-user preference for rankings integrating user-generated content in comparison
to the unalteredWeb search ranking for the News-Related Content Integration component. Analysis of the proposed approaches themselves, the effective settings for the deployment of those approaches
and insights into their behaviour are also discussed.
In particular, the evaluation of the Top Events Identification component examines how effectively
events â represented by newswire articles â can be ranked by their importance using two different
streams of user-generated content, namely blog posts and Twitter tweets. Evaluation of the proposed
approaches for this component indicates that blog posts are an effective source of evidence to use when
ranking events and that these approaches achieve state-of-the-art effectiveness. Using the same approaches
instead driven by a stream of tweets, provide a story ranking performance that is significantly
more effective than random, but is not consistent across all of the datasets and approaches tested. Insights
are provided into the reasons for this with regard to the transient nature of discussion in Twitter.
Through the evaluation of the News Query Classification component, we show that the use of timely
features extracted from different news and user-generated content sources can increase the accuracy
of news query classification over relying upon newswire provider streams alone. Evidence also suggests
that the usefulness of the user-generated content sources varies as news events mature, with some
sources becoming more influential over time as new content is published, leading to an upward trend in
classification accuracy.
The Ranking News-Related Content component evaluation investigates how to effectively rank content
from the blogosphere and Twitter for news-related user queries. Of the approaches tested, we show
that learning to rank approaches using features specific to blog posts/tweets lead to state-of-the-art ranking
effectiveness under real-time constraints.
Finally this thesis demonstrates that the majority of end-users prefer rankings integrated with usergenerated
content for news-related queries to rankings containing only Web search results or integrated
with only newswire articles. Of the user-generated content sources tested, the most popular source is
shown to be Twitter, particularly for queries relating to breaking events.
The central contributions of this thesis are the introduction of a news search framework, the approaches
to tackle each of the four components of the framework that integrate user-generated content
and their subsequent evaluation in a simulated real-time setting. This thesis draws insights from a broad
range of experiments spanning the entire search process for news-related queries. The experiments reported
in this thesis demonstrate the potential and scope for enhancements that can be brought about by
the leverage of user-generated content for real-time news search and related applications
Recommended from our members
Clustering Information Retrieval Search Outputs
Users are known to have difficulties in dealing with information retrieval search outputs especially if the outputs are above a certain size. It has been argued by several researchers that search output clustering can help users in their interaction with IR systems. Clustering may provide users an overview of the output by exploiting the topicality information that resides in the output but has not been used in the retrieval stage. It can enable them to find the relevant documents more easily and also help them to form an understanding of the different facets of the query that have been provided for their Inspection. This project aimed to investigate the viability of using clustering as a way of mediating usersâ interaction with search outputs and attempted to identify its possible benefits.
Can&Ozkarahanâs(90) C3M algorithm was used to test the effectiveness of clustering as a way of search output presentation. C3M is a relatively simple, non-hierarchical method that has been shown to give compatible or superior results to best-known hierarchical methods.
The method was implemented in TCL and linked to the departmentâs experimental IR system Okapi. Implementation included a procedure of term selection for document representation which preceded the clustering process and a procedure involving cluster representation for usersâ viewing following the clustering process. After some tuning of the implementation parameters for the databases used, several experiments were designed and conducted to assess whether clusters could group documents in useful ways.
One group of experiments aimed to assess the ability of the implementation to bring together topically related documents. It was quite difficult to gather data for such an assessment, but the existence of a set of data generated for TREC Interactive track(1996) enabled us to design experiments that at least approximately satisfied our objective. TREC provided a set of queries, and groups of relevant documents with facet assignments made by expert users. It was thus possible to make an Inference by measuring the correlation between the clusters relevant documents were assigned to and the facet assignments made for the documents by TREC experts.
The utility of this data set was limited for various reasons discussed in the related chapters, however, it can be concluded that clusters cannot be relied on to bring together relevant documents assigned to a certain facet. While there was some correlation between the cluster and facet assignments of the documents when the clustering was done only on relevant documents, no correlation could be found when the clustering was based on results of queries defined by City participants to the Interactive track.
Another group of experiments was conducted to compare output clustering with relevance ranking as a search output representation method. This comparison was necessary as an immediate consequence of clustering search output would be the loss of relevance ranking. It had to be assessed whether clustering could help users to find the relevant documents more easily than by relevance ranking, before any clustering solution could be proposed as an alternative to relevance ranked output.
For this purpose, two sets of user experiments(n=20 and n=57) were conducted based on the usersâ own information needs. While changes have been made to the implementation between the first and the second set of experiments, the experimental design was almost the same in both runs. Users were first asked to rank clusters formed from the search output(top 50 documents) and then make relevance judgements for the individual documents for the same output. The precision of cluster(s) marked best by the users were then compared to precision values that would be attained by relevance ranking at comparable thresholds.
The results from the 1st group of user experiments were not conclusive(in some part due to the smallness of the data set), but they drew our attention to the importance of representation of clusters and documents for usersâ viewing. After some changes to the implementation, mainly related to representation issues, and an intermediate set of 10 experiments to assess two new representation formats, a set of 57 user experiments were conducted to measure and compare precision values attainable by clustering versus relevance ranking.
These experiments revealed no significant precision difference between clustered outputs and ranked lists. The number of cases where one method achieved better than the other was slightly higher for the ranked lists at the top cluster level and slightly higher for the clustered representation at the top two clusters level. However the overall average precision values were higher for the ranked list at both levels.
As such, clustering did not appear to be preferable to ranked lists especially as It also represented overheads in both computing time and resources involved in creation of the clusters, and the time and effort taken by the users to inspect them.
An interesting outcome of the user experiments was the ability of the users to identify clusters that do not include relevant information. There were less relevant documents among the clusters marked last by the users as compared to the documents ranked last at similar threshold levels. This brought out the possibility of using clusters as an exclusion tool to improve the precision of ranked lists. After exclusion of documents from the last cluster, ranked lists performed significantly better than the clusters at the top cluster level.
There was also some evidence (consisting of observation of users during the experiments and a few user comments) that clusters could be used to provide the users with a glimpse of the search results, in order to decide whether to inspect the search results or initiate a new query straight away.
In summary, cumulative experiment results imply that clustering cannot outperform relevance ranking, and seems to deserve only a secondary role in usersâ interaction with IR systems. However, it should also be noted that the experiment results are not representative of the whole set of possible user types and search situations and it may be possible to Identify search situations where clustering can be more beneficial than relevance ranking
Sharpening the Search Saw: Lessons from Expert Searchers
Many students consider themselves to be proficient searchers and yet are disappointed or frustrated when faced with the task of locating relevant scholarly articles for a literature review. This bleak experience is common among higher education students, even for those in library and information science programs who have heightened appreciation for information resources and yet may settle for âgood enough Googlingâ (Plosker, 2004, p. 34). This is in large part due to reliance on web search engines that have evolved relevance ranking into a vastly intelligent business, one in which we are both its customers and product (Vaidhyanathan, 2011). Googleâs Hummingbird nest of search algorithms (Sullivan, 2013) provides quick and targeted hits, yet it can trigger blinders-on trust in first-page results. Concern for student search practices ranges from this permissive trust all the way to lost ability to recall facts and formulate questions (Abilock, 2015), lack of confidence in oneâs own knowledge (Carr, 2010), and increased dependence on single search boxes that encourage stream-of-consciousness user input (Tucker, 2013); indeed, students may be high in tech savvy but lacking the critical thinking skills needed for information research tasks (Katz, 2007). Students have come to rely on web search engine intelligenceâand it is inarguably colossalâto such an extent that they may fail to formulate a question before charging forward to search for its answer. âGoogle is known as a search engine, yet there is barely any searching involved anymore. The gap between a question crystallizing in your mind and an answer appearing at the top of your screen is shrinking all the time. As a consequence, our ability to ask questions is atrophyingâ (Leslie, 2015, para. 4). Highly accomplished students often lament their lack of skills for higher-level searching that calls for formulating pointed questions when struggling to develop a solid literature review. In addition, many are unaware that search results are filtered based on previous searches, location, and other factors extracted from personal search patterns by the search engine. Two students working side by side and entering the same search terms may receive quite different results on Google, yet the extent to which this âfilter bubbleâ (Pariser, 2011) is personalizing their search results is difficult to assess and to overcome. Just as important, it can be impossible to know what a search might be missing: how to know whatâs not there? This portrayal of the information landscape may appear gloomy but, in fact, it could not be a more inspiring environment in which to do research, to find connections in ideas, and to benefit from and generate new ideas. A few lessons from expert searchers, focused on critical concepts and search practices, can sharpen a studentâs search saw and move the proficient student-researcher, desiring more relevant and comprehensive search results, into a trajectory toward search expertise. For the lessons involved in this journey, the focus is on two areas: first, the critical conceptsâ called threshold concepts (Meyer & Land, 2003)â found to be necessary for developing search expertise (Tucker et al., 2014); and, second, four strategic areas within search that can have significant and immediate impact on improving search results for research literature. The latter are grounded in the threshold concepts and positioned for application to literature reviews for graduate student studies
The use of implicit evidence for relevance feedback in web retrieval
In this paper we report on the application of two contrasting types of relevance feedback for web retrieval. We compare two systems; one using explicit relevance feedback (where searchers explicitly have to mark documents relevant) and one using implicit relevance feedback (where the system endeavours to estimate relevance by mining the searcher's interaction). The feedback is used to update the display according to the user's interaction. Our research focuses on the degree to which implicit evidence of document relevance can be substituted for explicit evidence. We examine the two variations in terms of both user opinion and search effectiveness
Contextualised Browsing in a Digital Library's Living Lab
Contextualisation has proven to be effective in tailoring \linebreak search
results towards the users' information need. While this is true for a basic
query search, the usage of contextual session information during exploratory
search especially on the level of browsing has so far been underexposed in
research. In this paper, we present two approaches that contextualise browsing
on the level of structured metadata in a Digital Library (DL), (1) one variant
bases on document similarity and (2) one variant utilises implicit session
information, such as queries and different document metadata encountered during
the session of a users. We evaluate our approaches in a living lab environment
using a DL in the social sciences and compare our contextualisation approaches
against a non-contextualised approach. For a period of more than three months
we analysed 47,444 unique retrieval sessions that contain search activities on
the level of browsing. Our results show that a contextualisation of browsing
significantly outperforms our baseline in terms of the position of the first
clicked item in the result set. The mean rank of the first clicked document
(measured as mean first relevant - MFR) was 4.52 using a non-contextualised
ranking compared to 3.04 when re-ranking the result lists based on similarity
to the previously viewed document. Furthermore, we observed that both
contextual approaches show a noticeably higher click-through rate. A
contextualisation based on document similarity leads to almost twice as many
document views compared to the non-contextualised ranking.Comment: 10 pages, 2 figures, paper accepted at JCDL 201
- âŠ