813 research outputs found

    A Survey of Paraphrasing and Textual Entailment Methods

    Full text link
    Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of Informatics, Athens University of Economics and Business, Greece, 201

    Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval

    Full text link
    Semantic similarity based retrieval is playing an increasingly important role in many IR systems such as modern web search, question-answering, similar document retrieval etc. Improvements in retrieval of semantically similar content are very significant to applications like Quora, Stack Overflow, Siri etc. We propose a novel unsupervised model for semantic similarity based content retrieval, where we construct semantic flow graphs for each query, and introduce the concept of "soft seeding" in graph based semi-supervised learning (SSL) to convert this into an unsupervised model. We demonstrate the effectiveness of our model on an equivalent question retrieval problem on the Stack Exchange QA dataset, where our unsupervised approach significantly outperforms the state-of-the-art unsupervised models, and produces comparable results to the best supervised models. Our research provides a method to tackle semantic similarity based retrieval without any training data, and allows seamless extension to different domain QA communities, as well as to other semantic equivalence tasks.Comment: Published in Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM '17

    Improving Information Extraction using Knowledge Model

    Get PDF

    Discourse oriented summarization

    Get PDF
    The meaning of text appears to be tightly related to intentions and circumstances. Context sensitivity of meaning is addressed by theories of discourse structure. Few attempts have been made to exploit text organization in summarization. This thesis is an exploration of what knowledge of discourse structure can do for content selection as a subtask of automatic summarization, and query-based summarization in particular. Query-based summarization is the task of answering an arbitrary user query or question by using content from potentially relevant sources. This thesis presents a general framework for discourse oriented summarization, relying on graphs to represent semantic relations in discourse, and redundancy as a special type of semantic relation. Semantic relations occur on several levels of text analysis (query-relevance, coherence, layout, etc.), and a broad range of textual features may be required to detect them. The graph-based framework facilitates combining multiple features into an integrated semantic model of the documents to summarize. Recognizing redundancy and entailment relations between text passages is particularly important when a summary is generated of multiple documents, e.g. to avoid including redundant content in a summary. For this reason, I pay particular attention to recognizing textual entailment. Within this framework, a three-fold evaluation is performed to evaluate different aspects of discourse oriented summarization. The first is a user study, measuring the effect on user appreciation of using a particular type of knowledge for query-based summarization. In this study, three presentation strategies are compared: summarization using the rhetorical structure of the source, a baseline summarization method which uses the layout of the source, and a baseline presentation method which uses no summarization but just a concise answer to the query. Results show that knowledge of the rhetorical structure not only helps to provide the necessary context for the user to verify that the summary addresses the query adequately, but also to increase the amount of relevant content. The second evaluation is a comparison of implementations of the graph-based framework which are capable of fully automatic summarization. The two variables in the experiment are the set of textual features used to model the source and the algorithm used to search a graph for relevant content. The features are based on cosine similarity, and are realized as graph representations of the source. The graph search algorithms are inspired by existing algorithms in summarization. The quality of summaries is measured using the Rouge evaluation toolkit. The best performer would have ranked first (Rouge-2) or second (Rouge-SU4) if it had participated in the DUC 2005 query-based summarization challenge. The third study is an evaluation in the context of the DUC 2006 summarization challenge, which includes readability measurements as well as various content-based evaluation metrics. The evaluated automatic discourse oriented summarization system is similar to the one described above, but uses additional features, i.e. layout and textual entailment. The system performed well on readability at the cost of content-based scores which were well below the scores of the highest ranking DUC 2006 participant. This indicates a trade-off between readable, coherent content and useful content, an issue yet to be explored. Previous research implies that theories of text organization generalize well to multimedia. This suggests that the discourse oriented summarization framework applies to summarizing multimedia as well, provided sufficient knowledge of the organization of the (multimedia) source documents is available. The last study in this thesis is an investigation of the applicability of structural relations in multimedia for generating picture-illustrated summaries, by relating summary content to picture-associated text (i.e. captions or surrounding paragraphs). Results suggest that captions are the more suitable annotation for selecting appropriate pictures. Compared to manual illustration, results of automatic pictures are similar if the manual picture is mainly decorative

    Measuring Semantic Similarity: Representations and Methods

    Get PDF
    This dissertation investigates and proposes ways to quantify and measure semantic similarity between texts. The general approach is to rely on linguistic information at various levels, including lexical, lexico-semantic, and syntactic. The approach starts by mapping texts onto structured representations that include lexical, lexico-semantic, and syntactic information. The representation is then used as input to methods designed to measure the semantic similarity between texts based on the available linguistic information.While world knowledge is needed to properly assess semantic similarity of texts, in our approach world knowledge is not used, which is a weakness of it.We limit ourselves to answering the question of how successfully one can measure the semantic similarity of texts using just linguistic information.The lexical information in the original texts is retained by using the words in the corresponding representations of the texts. Syntactic information is encoded using dependency relations trees, which represent explicitly the syntactic relations between words. Word-level semantic information is relatively encoded through the use of semantic similarity measures like WordNet Similarity or explicitly encoded using vectorial representations such as Latent Semantic Analysis (LSA). Several methods are being studied to compare the representations, ranging from simple lexical overlap, to more complex methods such as comparing semantic representations in vector spaces as well as syntactic structures. Furthermore, a few powerful kernel models are proposed to use in combination with Support Vector Machine (SVM) classifiers for the case in which the semantic similarity problem is modeled as a classification task

    Modélisation des comportements de recherche basé sur les interactions des utilisateurs

    Get PDF
    Les utilisateurs de systèmes d'information divisent normalement les tâches en une séquence de plusieurs étapes pour les résoudre. En particulier, les utilisateurs divisent les tâches de recherche en séquences de requêtes, en interagissant avec les systèmes de recherche pour mener à bien le processus de recherche d'informations. Les interactions des utilisateurs sont enregistrées dans des journaux de requêtes, ce qui permet de développer des modèles pour apprendre automatiquement les comportements de recherche à partir des interactions des utilisateurs avec les systèmes de recherche. Ces modèles sont à la base de multiples applications d'assistance aux utilisateurs qui aident les systèmes de recherche à être plus interactifs, faciles à utiliser, et cohérents. Par conséquent, nous proposons les contributions suivantes : un modèle neuronale pour apprendre à détecter les limites des tâches de recherche dans les journaux de requête ; une architecture de regroupement profond récurrent qui apprend simultanément les représentations de requête et regroupe les requêtes en tâches de recherche ; un modèle non supervisé et indépendant d'utilisateur pour l'identification des tâches de recherche prenant en charge les requêtes dans seize langues ; et un modèle de tâche de recherche multilingue, une approche non supervisée qui modélise simultanément l'intention de recherche de l'utilisateur et les tâches de recherche. Les modèles proposés améliorent les méthodes existantes de modélisation, en tenant compte de la confidentialité des utilisateurs, des réponses en temps réel et de l'accessibilité linguistique. Le respect de la vie privée de l'utilisateur est une préoccupation majeure, tandis que des réponses rapides sont essentielles pour les systèmes de recherche qui interagissent avec les utilisateurs en temps réel, en particulier dans la recherche par conversation. Dans le même temps, l'accessibilité linguistique est essentielle pour aider les utilisateurs du monde entier, qui interagissent avec les systèmes de recherche dans de nombreuses langues. Les contributions proposées peuvent bénéficier à de nombreuses applications d'assistance aux utilisateurs, en aidant ces derniers à mieux résoudre leurs tâches de recherche lorsqu'ils accèdent aux systèmes de recherche pour répondre à leurs besoins d'information.Users of information systems normally divide tasks in a sequence of multiple steps to solve them. In particular, users divide search tasks into sequences of queries, interacting with search systems to carry out the information seeking process. User interactions are registered on search query logs, enabling the development of models to automatically learn search patterns from the users' interactions with search systems. These models underpin multiple user assisting applications that help search systems to be more interactive, user-friendly, and coherent. User assisting applications include query suggestion, the ranking of search results based on tasks, query reformulation analysis, e-commerce applications, retrieval of advertisement, query-term prediction, mapping of queries to search tasks, and so on. Consequently, we propose the following contributions: a neural model for learning to detect search task boundaries in query logs; a recurrent deep clustering architecture that simultaneously learns query representations through self-training, and cluster queries into groups of search tasks; Multilingual Graph-Based Clustering, an unsupervised, user-agnostic model for search task identification supporting queries in sixteen languages; and Language-agnostic Search Task Model, an unsupervised approach that simultaneously models user search intent and search tasks. Proposed models improve on existing methods for modeling user interactions, taking into account user privacy, realtime response times, and language accessibility. User privacy is a major concern in Ethics for intelligent systems, while fast responses are critical for search systems interacting with users in realtime, particularly in conversational search. At the same time, language accessibility is essential to assist users worldwide, who interact with search systems in many languages. The proposed contributions can benefit many user assisting applications, helping users to better solve their search tasks when accessing search systems to fulfill their information needs
    corecore