109 research outputs found

    Natural Language Processing for Technology Foresight Summarization and Simplification: the case of patents

    Get PDF
    Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts.Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts

    PersoNER: Persian named-entity recognition

    Full text link
    © 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network

    Abstractive multi-document summarization - paraphrasing and compressing with neural networks

    Get PDF
    This thesis presents studies in neural text summarization for single and multiple documents.The focus is on using sentence paraphrasing and compression for generating fluent summaries, especially in multi-document summarization where there is data paucity. A novel solution is to use transfer-learning from downstream tasks with an abundance of data. For this purpose, we pre-train three models for each of extractive summarization, paraphrase generation and sentence compression. We find that summarization datasets – CNN/DM and NEWSROOM – contain a number of noisy samples. Hence, we present a method for automatically filtering out this noise. We combine the representational power of the GRU-RNN and TRANSFORMER encoders in our paraphrase generation model. In training our sentence compression model, we investigate the impact of using different early-stopping criteria, such as embedding-based cosine similarity and F1. We utilize the pre-trained models (ours, GPT2 and T5) in different settings for single and multi-document summarization.SGS Tuition Award Alberta Innovates Technology Futures (AITF

    Generative Transformers for Design Concept Generation

    Full text link
    Generating novel and useful concepts is essential during the early design stage to explore a large variety of design opportunities, which usually requires advanced design thinking ability and a wide range of knowledge from designers. Growing works on computer-aided tools have explored the retrieval of knowledge and heuristics from design data. However, they only provide stimuli to inspire designers from limited aspects. This study explores the recent advance of the natural language generation (NLG) technique in the artificial intelligence (AI) field to automate the early-stage design concept generation. Specifically, a novel approach utilizing the generative pre-trained transformer (GPT) is proposed to leverage the knowledge and reasoning from textual data and transform them into new concepts in understandable language. Three concept generation tasks are defined to leverage different knowledge and reasoning: domain knowledge synthesis, problem-driven synthesis, and analogy-driven synthesis. The experiments with both human and data-driven evaluation show good performance in generating novel and useful concepts.Comment: Accepted by J. Comput. Inf. Sci. En

    A logic model of the European patent system

    Get PDF
    In patent law most of the crucial legal questions such as patentability and infringement are linked to the patent claims. The European Patent Office regards patent claims as a set of independent features which are examined separately in a more or less formal way. The author has found that this approach allows for developing a simple logic model which treats patent claim features as logical statements and patent claims as compound statements wherein the individual logical statements are connected by logical connectives. The proposed logic model provides a uniform system for examining various legal questions that are dealt with separately under current case-law, moreover, it allows for examining the logical coherence between the different case-law decisions as well as detecting any hidden logical inconsistencies. The present paper offers an overview of the different legal questions linked to the patent claim and demonstrates the practical application of the proposed model
    • …
    corecore