2,461 research outputs found

    Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study

    Get PDF
    Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task

    Investigating the effects of controlled language on the reading and comprehension of machine translated texts: A mixed-methods approach

    Get PDF
    This study investigates whether the use of controlled language (CL) improves the readability and comprehension of technical support documentation produced by a statistical machine translation system. Readability is operationalised here as the extent to which a text can be easily read in terms of formal linguistic elements; while comprehensibility is defined as how easily a textโ€™s content can be understood by the reader. A biphasic mixed-methods triangulation approach is taken, in which a number of quantitative and qualitative evaluation methods are combined. These include: eye tracking, automatic evaluation metrics (AEMs), retrospective interviews, human evaluations, memory recall testing, and readability indices. A further aim of the research is to investigate what, if any, correlations exist between the various metrics used, and to explore the cognitive framework of the evaluation process. The research finds that the use of CL input results in significantly higher scores for items recalled by participants, and for several of the eye tracking metrics: fixation count, fixation length, and regressions. However, the findings show slight insignificant increases for readability indices and human evaluations, and slight insignificant decreases for AEMs. Several significant correlations between the above metrics are identified as well as predictors of readability and comprehensibility

    Thematic Annotation: extracting concepts out of documents

    Get PDF
    Contrarily to standard approaches to topic annotation, the technique used in this work does not centrally rely on some sort of -- possibly statistical -- keyword extraction. In fact, the proposed annotation algorithm uses a large scale semantic database -- the EDR Electronic Dictionary -- that provides a concept hierarchy based on hyponym and hypernym relations. This concept hierarchy is used to generate a synthetic representation of the document by aggregating the words present in topically homogeneous document segments into a set of concepts best preserving the document's content. This new extraction technique uses an unexplored approach to topic selection. Instead of using semantic similarity measures based on a semantic resource, the later is processed to extract the part of the conceptual hierarchy relevant to the document content. Then this conceptual hierarchy is searched to extract the most relevant set of concepts to represent the topics discussed in the document. Notice that this algorithm is able to extract generic concepts that are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure

    Natural Language Processing for Technology Foresight Summarization and Simplification: the case of patents

    Get PDF
    Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts.Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts

    ์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™”์—์„œ์˜ ๋Œ€ํ™” ํŠน์„ฑ์„ ํ™œ์šฉํ•œ ์ง€์‹ ์„ ํƒ ๋ฐ ๋žญํ‚น ๋ฐฉ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ์ด์ƒ๊ตฌ.Knowledge grounded conversation (KGC) model aims to generate informative responses relevant to both conversation history and external knowledge. One of the most important parts of KGC models is to find the knowledge which provides the basis on which the responses are grounded. If the model selects inappropriate knowledge, it may produce responses that are irrelevant or lack knowledge. In this dissertation, we study the methods of leveraging conversational characteristics to select or rank the knowledge for knowledge grounded conversation. In particular, this dissertation provides novel two methods, where one of which focuses on the sequential structure of multi-turn conversation, and the other focuses on utilizing local context and topic of a long conversation. We first propose two knowledge selection strategies of which one preserves the sequential matching features and the other encodes the sequential nature of the conversation. Second, we propose a novel knowledge ranking model that composes an appropriate range of relevant documents by exploiting both the topic keywords and local context of a conversation. In addition, we apply the knowledge ranking model in quote recommendation with our new quote recommendation framework that provides hard negative samples to the model. Our experimental results show that the KGC models based on our proposed knowledge selection and ranking methods outperform the competitive models in terms of groundness and relevance.์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ๋ชจ๋ธ์€ ๋Œ€ํ™” ๊ธฐ๋ก๊ณผ ์™ธ๋ถ€ ์ง€์‹ ์ด ๋‘ ๊ฐ€์ง€ ๋ชจ๋‘์— ๊ด€๋ จ๋œ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ๋ชจ๋ธ์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„ ์ค‘ ํ•˜๋‚˜๋Š” ์‘๋‹ต์˜ ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•˜๋Š” ์ง€์‹์„ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค. ์ง€์‹ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ๋ฌธ๋งฅ์— ๋ถ€์ ํ•ฉํ•œ ์ง€์‹์„ ์ฐพ๋Š” ๊ฒฝ์šฐ ๊ด€๋ จ์„ฑ์ด ๋–จ์–ด์ง€๊ฑฐ๋‚˜ ์ง€์‹์ด ๋ถ€์กฑํ•œ ์‘๋‹ต์ด ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™”๋ฅผ ์œ„ํ•ด ๋Œ€ํ™” ์—ฌ๋Ÿฌ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ง€์‹์„ ์„ ์ •ํ•˜๋Š” ์ง€์‹ ์„ ํƒ ๋ชจ๋ธ๊ณผ ์ง€์‹ ์ˆœ์œ„ ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์ค‘ ํ„ด ๋Œ€ํ™”์—์„œ์˜ ์ˆœ์ฐจ์  ๊ตฌ์กฐ ๋˜๋Š” ์‘๋‹ต ์ด์ „ ๋ฌธ๋งฅ๊ณผ ๋Œ€ํ™”์˜ ์ฃผ์ œ๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ์จ ๋ณธ ๋…ผ๋ฌธ์€ ๋‘ ๊ฐ€์ง€ ์ง€์‹ ์„ ํƒ ์ „๋žต์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์ „๋žต ์ค‘ ํ•˜๋‚˜๋Š” ์ง€์‹๊ณผ ๋Œ€ํ™” ํ„ด ๊ฐ„์˜ ์ˆœ์ฐจ์  ๋งค์นญ ํŠน์ง•์„ ๋ณด์กดํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๊ณ  ๋‹ค๋ฅธ ์ „๋žต์€ ๋Œ€ํ™”์˜ ์ˆœ์ฐจ์  ํŠน์„ฑ์„ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ์ง€์‹์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ ๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€ํ™”์˜ ์ฃผ์ œ ํ‚ค์›Œ๋“œ์™€ ์‘๋‹ต ๋ฐ”๋กœ ์ด์ „์˜ ๋ฌธ๋งฅ์„ ๋ชจ๋‘ ํ™œ์šฉํ•˜์—ฌ ์ ์ ˆํ•œ ๋ฒ”์œ„์˜ ๊ด€๋ จ ๋ฌธ์„œ๋“ค๋กœ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ƒˆ๋กœ์šด ์ง€์‹ ์ˆœ์œ„ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ง€์‹ ์ˆœ์œ„ ๋ชจ๋ธ์˜ ์ ์‘์„ฑ ๊ฒ€์ฆ์„ ์œ„ํ•ด ์ •๋‹ต ์ธ์šฉ๊ตฌ์™€ ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ •๋‹ต์€ ์•„๋‹Œ ์ธ์šฉ๊ตฌ์˜ ์ง‘ํ•ฉ์„ ์ธ์šฉ๊ตฌ ์ˆœ์œ„ ๋ชจ๋ธ์— ์ œ๊ณตํ•˜๋Š” ์ธ์šฉ๊ตฌ ์ถ”์ฒœ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์ง€์‹ ์„ ํƒ ๋ฐ ์ˆœ์œ„ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ๋ชจ๋ธ์ด ๊ฒฝ์Ÿ ๋ชจ๋ธ๋ณด๋‹ค ์™ธ๋ถ€ ์ง€์‹ ๋ฐ ๋Œ€ํ™” ๋ฌธ๋งฅ๊ณผ์˜ ๊ด€๋ จ์„ฑ ์ธก๋ฉด์—์„œ ์šฐ์ˆ˜ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์‚ฌ๋žŒ ๊ฐ„์˜ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋‹ค์ˆ˜์˜ ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆํ•˜์˜€๋‹ค.Abstract 1 1. Introduction 17 2. Background and Related Works 25 2.1 Terminology 25 2.2 Overview of Technologies for Conversational Systems 27 2.2.1 Open-domain Dialogue System 27 2.2.2 Task-oriented Dialogue System 29 2.2.3 Question Answering System 29 2.3 Components of Knowledge Grounded Conversation Model 31 2.4 Related Works 36 2.4.1 KGC datasets 36 2.4.2 Soft Selection-based KGC Model 36 2.4.3 Hard Selection-based KGC Model 37 2.4.4 Retrieval-based KGC Models 39 2.4.5 Response Generation with Knowledge Integration 39 2.4.6 Quote Recommendation 42 2.5 Evaluation Methods 44 2.6 Problem Statements 47 3. Knowledge Selection with Sequential Structure of Conversation 48 3.1 Motivation 48 3.2 Reduce-Match Strategy & Match-Reduce Strategy 49 3.2.1 Backbone architecture 51 3.2.2 Reduce-Match Strategy-based Models 52 3.2.3 Match-Reduce Strategy-based Models 56 3.3 Experiments 62 3.3.1 Experimental Setup 62 3.3.2 Experimental Results 70 3.4 Analysis 72 3.4.1 Case Study 72 3.4.2 Impact of Matching Difficulty 75 3.4.3 Impact of Length of Context 77 3.4.4 Impact of Dialogue Act of Message 78 4. Knowledge Ranking with Local Context and Topic Keywords 81 4.1 Motivation 81 4.2 Retrieval-Augmented Knowledge Grounded Conversation Model 85 4.2.1 Base Model 86 4.2.2 Topic-aware Dual Matching for Knowledge Re-ranking 86 4.2.3 Data Weighting Scheme for Retrieval Augmented Generation Models 89 4.3 Experiments 90 4.3.1 Experimental Setup 90 4.3.2 Experimental Results 94 4.4 Analysis 98 4.4.1 Case Study 98 4.4.2 Ablation Study 99 4.4.3 Model Variations 104 4.4.4 Error Analysis 105 5. Application: Quote Recommendation with Knowledge Ranking 110 5.1 Motivation 110 5.2 CAGAR: A Framework for Quote Recommendation 112 5.2.1 Conversation Encoder 114 5.2.2 Quote Encoder 114 5.2.3 Candidate Generator 115 5.2.4 Re-ranker 116 5.2.5 Training and Inference 116 5.3 Experiments 117 5.3.1 Experimental Setup 117 5.3.2 Experimental Results 119 5.4 Analysis 120 5.4.1 Ablation Study 120 5.4.2 Case Study 121 5.4.3 Impact of Length of Context 121 5.4.4 Impact of Training Set Size per Quote 123 6. Conclusion 125 6.1 Contributions and Limitations 126 6.2 Future Works 128 Appendix A. Preliminary Experiments for Quote Recommendations 131 A.1 Methods 131 A.1.1 Matching Granularity Adjustment 131 A.1.2 Random Forest 133 A.1.3 Convolutional Neural Network 133 A.1.4 Recurrent Neural Network 134 A.2 Experiments 135 A.2.1 Baselines and Implementation Details 135 A.2.2 Datasets 136 A.2.3 Results and Discussions 137 ์ดˆ๋ก 162๋ฐ•

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding
    • โ€ฆ
    corecore