92 research outputs found

    A Luhn-Inspired Vector Re-weighting Approach for Improving Personalized Web Search

    Full text link

    Exploring Semantic Textual Similarity

    Get PDF
    [EN]Measuring semantic similarity and relatedness between textual items (words, sentences, paragraphs or even documents) is a very important research area in Natural Language Processing (NLP). In fact, it has many practical applications in other NLP tasks. For instance, Word Sense Disambiguation, Textual Entailment, Paraphrase detection, Machine Translation, Summarization and other related tasks such as Information Retrieval or Question Answering. In this masther thesis we study di erent approaches to compute the semantic similarity between textual items. In the framework of the european PATHS project1, we also evaluate a knowledge-base method on a dataset of cultural item descriptions. Additionaly, we describe the work carried out for the Semantic Textual Similarity (STS) shared task of SemEval-2012. This work has involved supporting the creation of datasets for similarity tasks, as well as the organization of the task itself

    Providing personalised information based on individual interests and preferences.

    Get PDF
    The main aim of personalised Information Retrieval (IR) is to provide an effective IR system whereby relevant information can be presented according to individual users' interests and preferences. In response to their queries, all Web users expect to obtain the search results in a rank order with the most relevant items at the lowest ranks. Effective IR systems rank the less relevant documents below the relevant documents. However, a commonly stated problem of Web browsers is to match the users' queries to the information base. The key challenge is to return a list of search results containing a low level of non-relevant documents while not missing out the relevant documents.To address this problem, keyword-based search of Vector Space Model is employed as an IR technique to model the Web users and build their interest profiles. Semantic-based search through Ontology is further employed to represent documents matching the users' needs without being directly contained in the users' specified keywords. The users' log files are one of the most important sources from which implicit feedback is detected through their profiles. These provide valuable information based on which alternative learning approaches (i.e. dwell-based search) can be incorporated into the IR standard measures (i.e. tf-idf) allowing a further improvement of personalisation of Web document search, thus increasing the performance of IR systems.To incorporate such a non-textual data type (i.e. dwell) into the hybridisation of the keyword-based and semantic-based searches entails a complex interaction of information attributes in the index structure. A dwell-based filter called dwell-tf-ldf that allows a standard tokeniser to be converted into a keyword tokeniser is thus proposed. The proposed filter uses an efficient hybrid indexing technique to bring textual and non-textual data types under one umbrella, thus making a move beyond simple keyword matching to improve future retrieval applications for web browsers. Adopting precision and recall, the most common evaluation measure, the superiority of the hybridisation of these approaches lies in pushing significantly relevant documents to the top of the ranked lists, as compared to any traditional search system. The results were empirically confirmed through human subjects who conducted several real-life Web searches

    Towards Personalized and Human-in-the-Loop Document Summarization

    Full text link
    The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.Comment: PhD thesi

    Engines of Order

    Get PDF
    Over the last decades, and in particular since the widespread adoption of the Internet, encounters with algorithmic procedures for ‘information retrieval’ – the activity of getting some piece of information out of a col-lection or repository of some kind – have become everyday experiences for most people in large parts of the world

    Improving Reader Motivation with Machine Learning

    Get PDF
    This thesis focuses on the problem of increasing reading motivation with machine learning (ML). The act of reading is central to modern human life, and there is much to be gained by improving the reading experience. For example, the internal reading motivation of students, especially their interest and enjoyment in reading, are important factors in their academic success. There are many topics in natural language processing (NLP) which can be applied to improving the reading experience in terms of readability, comprehension, reading speed, motivation, etc. Such topics include personalized recommendation, headline optimization, text simplification, and many others. However, to the best of our knowledge, this is the first work to explicitly address the broad and meaningful impact that NLP and ML can have on the reading experience. In particular, the aim of this thesis is to explore new approaches to supporting internal reading motivation, which is influenced by readability, situational interest, and personal interest. This is performed by identifying new or existing NLP tasks which can address reader motivation, designing novel machine learning approaches to perform these tasks, and evaluating and examining these approaches to determine what they can teach us about the factors of reader motivation. In executing this research, we make use of concepts from NLP such as textual coherence, interestingness, and summarization. We additionally use techniques from ML including supervised and self-supervised learning, deep neural networks, and sentence embeddings. This thesis, presented in an integrated-article format, contains three core contributions among its three articles. In the first article, we propose a flexible and insightful approach to coherence estimation. This approach uses a new sentence embedding which reflects predicted position distributions. Second, we introduce the new task of pull quote selection, examining a spectrum of approaches in depth. This article identifies several concrete heuristics for finding interesting sentences, both expected and unexpected. Third, we introduce a new interactive summarization task called HARE (Hone as You Read), which is especially suitable for mobile devices. Quantitative and qualitative analysis support the practicality and potential usefulness of this new type of summarization

    Congenial Web Search : A Conceptual Framework for Personalized, Collaborative, and Social Peer-to-Peer Retrieval

    Get PDF
    Traditional information retrieval methods fail to address the fact that information consumption and production are social activities. Most Web search engines do not consider the social-cultural environment of users' information needs and the collaboration between users. This dissertation addresses a new search paradigm for Web information retrieval denoted as Congenial Web Search. It emphasizes personalization, collaboration, and socialization methods in order to improve effectiveness. The client-server architecture of Web search engines only allows the consumption of information. A peer-to-peer system architecture has been developed in this research to improve information seeking. Each user is involved in an interactive process to produce meta-information. Based on a personalization strategy on each peer, the user is supported to give explicit feedback for relevant documents. His information need is expressed by a query that is stored in a Peer Search Memory. On one hand, query-document associations are incorporated in a personalized ranking method for repeated information needs. The performance is shown in a known-item retrieval setting. On the other hand, explicit feedback of each user is useful to discover collaborative information needs. A new method for a controlled grouping of query terms, links, and users was developed to maintain Virtual Knowledge Communities. The quality of this grouping represents the effectiveness of grouped terms and links. Both strategies, personalization and collaboration, tackle the problem of a missing socialization among searchers. Finally, a concept for integrated information seeking was developed. This incorporates an integrated representation to improve effectiveness of information retrieval and information filtering. An integrated information retrieval process explores a virtual search network of Peer Search Memories in order to accomplish a reputation-based ranking. In addition, the community structure is considered by an integrated information filtering process. Both concepts have been evaluated and shown to have a better performance than traditional techniques. The methods presented in this dissertation offer the potential towards more transparency, and control of Web search

    Using community trained recommender models for enhanced information retrieval

    Get PDF
    Research in Information Retrieval (IR) seeks to develop methods which better assist users in finding information which is relevant to their current information needs. Personalization is a significant focus of research for the development of next generation of IR systems. Commercial search engines are exploring methods to incorporate models of the user’s interests to facilitate personalization in IR to improve retrieval effectiveness. However, in some situations there may be no opportunity to learn about the interests of a specific user on a certain topic. This is a significant challenge for IR researchers attempting to improve search effectiveness by exploiting user search behaviour. We propose a solution to this problem based on recommender systems (RSs) in a novel IR model which combines a recommender model with traditional IR methods to improve retrieval results for search tasks, where the IR system has no opportunity to acquire prior information about the user’s knowledge of a domain for which they have not previously entered a query. We use search behaviour data from other previous users to build topic category models based on topic interests. When a user enters a query on a topic which is new to this user, but related to a topical search category, the appropriate topic category model is selected and used to predict a ranking which this user may find interesting based on previous search behaviour. The recommender outputs are used in combination with the output of a standard IR system to produce the overall output to the user. In this thesis, the IR and recommender components of this integrated model are investigated

    Annotation persistence over dynamic documents

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2005.Includes bibliographical references (p. 212-216).Annotations, as a routine practice of actively engaging with reading materials, are heavily used in the paper world to augment the usefulness of documents. By annotation, we include a large variety of creative manipulations by which the otherwise passive reader becomes actively involved in a document. Annotations in digital form possess many benefits paper annotations do not enjoy, such as annotation searching, annotation multi- referencing, and annotation sharing. The digital form also introduces challenges to the process of annotation. This study looks at one of them, annotation persistence over dynamic documents. With the development of annotation software, users now have the opportunity to annotate documents which they don't own, or to which they don't have write permission. In annotation software, annotations are normally created and saved independently of the document. The owners of the documents being annotated may have no knowledge of the fact that third parties are annotating their documents' contents. When document contents are modified, annotation software faces a difficult situation where annotations need to be reattached. Reattaching annotations in a revised version of a document is a crucial component in annotation system design. Annotation persistence over document versions is a complicated and challenging problem, as documents can go through various changes between versions. In this thesis, we treat annotation persistence over dynamic documents as a specialized information retrieval problem. We then design a scheme to reposition annotations between versions by three mechanisms: the meta-structure information match, the keywords match, and content semantics match.(cont.) Content semantics matching is the determining factor in our annotation persistence scheme design. Latent Semantic Analysis, an innovative information retrieval model, is used to extract and compare document semantics. Two editions of an introductory computer science textbook are used to evaluate the annotation persistence scheme proposed in this study. The evaluation provides substantial evidence that the annotation persistence scheme proposed in this thesis is able to make the right decisions on repositioning annotations based on their degree of modifications, i.e. to reattach annotations if modifications are light, and to orphan annotations if modifications are heavy.by Shaomin Wang.Ph.D
    corecore