1,019 research outputs found

    D-TERMINE : data-driven term extraction methodologies investigated

    Get PDF
    Automatic term extraction is a task in the field of natural language processing that aims to automatically identify terminology in collections of specialised, domain-specific texts. Terminology is defined as domain-specific vocabulary and consists of both single-word terms (e.g., corpus in the field of linguistics, referring to a large collection of texts) and multi-word terms (e.g., automatic term extraction). Terminology is a crucial part of specialised communication since terms can concisely express very specific and essential information. Therefore, quickly and automatically identifying terms is useful in a wide range of contexts. Automatic term extraction can be used by language professionals to find which terms are used in a domain and how, based on a relevant corpus. It is also useful for other tasks in natural language processing, including machine translation. One of the main difficulties with term extraction, both manual and automatic, is the vague boundary between general language and terminology. When different people identify terms in the same text, it will invariably produce different results. Consequently, creating manually annotated datasets for term extraction is a costly, time- and effort- consuming task. This can hinder research on automatic term extraction, which requires gold standard data for evaluation, preferably even in multiple languages and domains, since terms are language- and domain-dependent. Moreover, supervised machine learning methodologies rely on annotated training data to automatically deduce the characteristics of terms, so this knowledge can be used to detect terms in other corpora as well. Consequently, the first part of this PhD project was dedicated to the construction and validation of a new dataset for automatic term extraction, called ACTER – Annotated Corpora for Term Extraction Research. Terms and Named Entities were manually identified with four different labels in twelve specialised corpora. The dataset contains corpora in three languages and four domains, leading to a total of more than 100k annotations, made over almost 600k tokens. It was made publicly available during a shared task we organised, in which five international teams competed to automatically extract terms from the same test data. This illustrated how ACTER can contribute towards advancing the state-of-the-art. It also revealed that there is still a lot of room for improvement, with moderate scores even for the best teams. Therefore, the second part of this dissertation was devoted to researching how supervised machine learning techniques might contribute. The traditional, hybrid approach to automatic term extraction relies on a combination of linguistic and statistical clues to detect terms. An initial list of unique candidate terms is extracted based on linguistic information (e.g., part-of-speech patterns) and this list is filtered based on statistical metrics that use frequencies to measure whether a candidate term might be relevant. The result is a ranked list of candidate terms. HAMLET – Hybrid, Adaptable Machine Learning Approach to Extract Terminology – was developed based on this traditional approach and applies machine learning to efficiently combine more information than could be used with a rule-based approach. This makes HAMLET less susceptible to typical issues like low recall on rare terms. While domain and language have a large impact on results, robust performance was reached even without domain- specific training data, and HAMLET compared favourably to a state-of-the-art rule-based system. Building on these findings, the third and final part of the project was dedicated to investigating methodologies that are even further removed from the traditional approach. Instead of starting from an initial list of unique candidate terms, potential terms were labelled immediately in the running text, in their original context. Two sequential labelling approaches were developed, evaluated and compared: a feature- based conditional random fields classifier, and a recurrent neural network with word embeddings. The latter outperformed the feature-based approach and was compared to HAMLET as well, obtaining comparable and even better results. In conclusion, this research resulted in an extensive, reusable dataset and three distinct new methodologies for automatic term extraction. The elaborate evaluations went beyond reporting scores and revealed the strengths and weaknesses of the different approaches. This identified challenges for future research, since some terms, especially ambiguous ones, remain problematic for all systems. However, overall, results were promising and the approaches were complementary, revealing great potential for new methodologies that combine multiple strategies

    Treatment of Semantic Heterogeneity in Information Retrieval

    Full text link
    "Nowadays, users of information services are faced with highly decentralised, heterogeneous document sources with different content analysis. Semantic heterogeneity occurs e.g. when resources using different systems for content description are searched using a single query system. This report describes several approaches of handling semantic heterogeneity used in projects of the German Social Science Information Centre." (author's abstract

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    AI-assisted patent prior art searching - feasibility study

    Get PDF
    This study seeks to understand the feasibility, technical complexities and effectiveness of using artificial intelligence (AI) solutions to improve operational processes of registering IP rights. The Intellectual Property Office commissioned Cardiff University to undertake this research. The research was funded through the BEIS Regulators’ Pioneer Fund (RPF). The RPF fund was set up to help address barriers to innovation in the UK economy

    Lynx: A knowledge-based AI service platform for content processing, enrichment and analysis for the legal domain

    Get PDF
    The EU-funded project Lynx focuses on the creation of a knowledge graph for the legal domain (Legal Knowledge Graph, LKG) and its use for the semantic processing, analysis and enrichment of documents from the legal domain. This article describes the use cases covered in the project, the entire developed platform and the semantic analysis services that operate on the documents. © 202

    Enhancing Word Representation Learning with Linguistic Knowledge

    Get PDF
    Representation learning, the process whereby representations are modelled from data, has recently become a central part of Natural Language Processing (NLP). Among the most widely used learned representations are word embeddings trained on large corpora of unannotated text, where the learned embeddings are treated as general representations that can be used across multiple NLP tasks. Despite their empirical successes, word embeddings learned entirely from data can only capture patterns of language usage from the particular linguistic domain of the training data. Linguistic knowledge, which does not vary among linguistic domains, can potentially be used to address this limitation. The vast sources of linguistic knowledge that are readily available nowadays can help train more general word embeddings (i.e. less affected by distance between linguistic domains) by providing them with such information as semantic relations, syntactic structure, word morphology, etc. In this research, I investigate the different ways in which word embedding models capture and encode words’ semantic and contextual information. To this end, I propose two approaches to integrate linguistic knowledge into the statistical learning of word embeddings. The first approach is based on augmenting the training data for a well-known Skip-gram word embedding model, where synonym information is extracted from a lexical knowledge base and incorporated into the training data in the form of additional training examples. This data augmentation approach seeks to enforce synonym relations in the learned embeddings. The second approach exploits structural information in text by transforming every sentence in the data into its corresponding dependency parse trees and training an autoencoder to recover the original sentence. While learning a mapping from a dependency parse tree to its originating sentence, this novel Structure-to-Sequence (Struct2Seq) model produces word embeddings that contain information about a word’s structural context. Given that the combination of knowledge and statistical methods can often be unpredictable, a central focus of this thesis is on understanding the effects of incorporating linguistic knowledge into word representation learning. Through the use of intrinsic (geometric characteristics) and extrinsic (performance on downstream tasks) evaluation metrics, I aim to measure the specific influence that the injected knowledge can have on different aspects of the informational composition of word embeddings

    DCC Digital Curation Manual: Instalment on Ontologies

    Get PDF
    Instalment on the role of ontologies within the digital curation life-cycle. Describes the increasingly important role of ontologies for digital curation, some practical applications, the topic’s place within the OAIS reference model, and advice on developing institution-specific selection frameworks

    Capturing and Measuring Thematic Relatedness

    Get PDF
    In this paper we explain the difference between two aspects of semantic relatedness: taxonomic and thematic relations. We notice the lack of evaluation tools for measuring thematic relatedness, identify two datasets that can be recommended as thematic benchmarks, and verify them experimentally. In further experiments, we use these datasets to perform a comprehensive analysis of the performance of an extensive sample of computational models of semantic relatedness, classified according to the sources of information they exploit. We report models that are best at each of the two dimensions of semantic relatedness and those that achieve a good balance between the two

    Semantic Text Analysis tool: SeTA

    Get PDF
    An ever-growing number and length of documents, number and depth of topics covered by legislation, and ever new phrases and their slowly changing meaning, these are all contributing factors that make policy analysis more and more complex. As implication, human policy analysts and policy developers face increasing entanglement of both content and semantical levels. To overcome several of these issues, JRC has developed a central pilot tool called AI-KEAPA to support policy analysis and development in any domain. Recent developments in big data, machine learning and especially in natural language processing allow converting unfathomable complexity of many hundreds of thousands of documents into a normalised high-dimensional vector space preserving the knowledge. Unstructured text in document corpora and big data sources, until recently considered just an archive, is quickly becoming core source of analytical information using text mining methods to extract qualitative and quantitative data. Semantic analysis allows us to extract better information for policy analysis from metadata titles and abstracts than from the structured human-entered descriptions. This digital assistant allows document search and extraction over many different sources, discovery of phrase meaning, context and temporal development. It can recommend most relevant documents including their semantic and temporal interdependencies. But most importantly, it helps bursting knowledge bubbles and fast-learning new domains. This way we hope to mainstream artificial intelligence into policy support. The tool is now fit for purpose. It was thoroughly tested in real-life conditions for about two years mainly in the area of legislative impact assessments for policy formulation, and other domains such as large data infrastructure analysis, agri-environmental measures or natural disasters, some of which are detailed in this document. This approach boosts the strategic JRC focus on application of scientific analysis and development. This service adds to the JRC competence and central position in semantic reasoning for policy analysis, active information recommendation, and inferred knowledge in policy design and development.JRC.I.3-Text and Data Minin
    • 

    corecore