Search CORE

7 research outputs found

EELECTION at SemEval-2017 Task 10: Ensemble of nEural Learners for kEyphrase ClassificaTION

Author: Dinh Erik-Lân Do
Eger Steffen
Gurevych Iryna
Kiaeeha Masoud
Kuznetsov Ilia
Publication venue
Publication date: 10/04/2017
Field of study

This paper describes our approach to the SemEval 2017 Task 10: "Extracting Keyphrases and Relations from Scientific Publications", specifically to Subtask (B): "Classification of identified keyphrases". We explored three different deep learning approaches: a character-level convolutional neural network (CNN), a stacked learner with an MLP meta-classifier, and an attention based Bi-LSTM. From these approaches, we created an ensemble of differently hyper-parameterized systems, achieving a micro-F1-score of 0.63 on the test data. Our approach ranks 2nd (score of 1st placed system: 0.64) out of four according to this official score. However, we erroneously trained 2 out of 3 neural nets (the stacker and the CNN) on only roughly 15% of the full data, namely, the original development set. When trained on the full data (training+development), our ensemble has a micro-F1-score of 0.69. Our code is available from https://github.com/UKPLab/semeval2017-scienceie.Comment: In revision, changed to pdfTeX outpu

arXiv.org e-Print Archive

TUbiblio

SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications

Author: Augenstein Isabelle
Das Mrinal
McCallum Andrew
Riedel Sebastian
Vikraman Lakshmi
Publication venue
Publication date: 01/01/2017
Field of study

We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities

arXiv.org e-Print Archive

UCL Discovery

SEAL: Scientific Keyphrase Extraction and Classification

Author: Garg Ayush
Kagi Sammed Shantinath
Singh Mayank
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/06/2020
Field of study

Automatic scientific keyphrase extraction is a challenging problem facilitating several downstream scholarly tasks like search, recommendation, and ranking. In this paper, we introduce SEAL, a scholarly tool for automatic keyphrase extraction and classification. The keyphrase extraction module comprises two-stage neural architecture composed of Bidirectional Long Short-Term Memory cells augmented with Conditional Random Fields. The classification module comprises of a Random Forest classifier. We extensively experiment to showcase the robustness of the system. We evaluate multiple state-of-the-art baselines and show a significant improvement. The current system is hosted at http://lingo.iitgn.ac.in:5000/.Comment: Accepted at JCDL 202

arXiv.org e-Print Archive

Crossref

SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications

Author: Augenstein I
Das M
McCallum A
Riedel S
Vikraman L
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

UCL Discovery

Leveraging structure for learning representations of words, sentences and knowledge bases

Author: Komninos Alexandros
Publication venue: University of York
Publication date: 24/01/2018
Field of study

This thesis presents work on learning representations of text and Knowledge Bases by taking into consideration their respective structures. The tasks for which the methods are developed and evaluated on are: Short-text classification, Word Sense Induction and Disambiguation, Knowledge Base Completion with linked text corpora, and large-scale Knowledge Base Question Answering. An introductory chapter states the aims and scope of the thesis, followed by a chapter on technical background and definitions. In chapter 3, the impact of dependency syntax on word representation learning in the context of short-text classification is investigated. A new definition of context in dependency graphs is proposed, which generalizes and extends previous definitions used in word representation learning. The resulting word and dependency feature embeddings are used together to represent dependency graph substructures in text classifiers. In chapter 4, a probabilistic latent variable model for Word Sense Induction and Disambiguation is presented. The model estimates sense clusters using pretrained continuous feature vectors of multiple context types: syntactic, local lexical and global lexical, while the number of sense clusters is determined by the Integrated Complete Likelihood criterion. A model for Knowledge Base Completion with linked text corpora is presented in chapter 5. The proposed model represents potential facts by merging subgraphs of the knowledge base with text through linked entities. The model learns to embed the merged graphs into a lower dimensional space and score the plausibility of the fact with a Multilayer Perceptron. Chapter 6 presents a system for Question Answering on Knowledge Bases. The system learns to decompose questions into entity and relation mentions and score their compatibility with queries on the knowledge base expressed as subgraphs. The model consists of several components trained jointly in order to match parts of the question with parts of a potential query by embedding their corresponding structures in lower dimensional spaces

White Rose E-theses Online

The Role of Linguistics in Probing Task Design

Author: Kuznetsov Ilia
Publication venue
Publication date: 01/01/2021
Field of study

Over the past decades natural language processing has evolved from a niche research area into a fast-paced and multi-faceted discipline that attracts thousands of contributions from academia and industry and feeds into real-world applications. Despite the recent successes, natural language processing models still struggle to generalize across domains, suffer from biases and lack transparency. Aiming to get a better understanding of how and why modern NLP systems make their predictions for complex end tasks, a line of research in probing attempts to interpret the behavior of NLP models using basic probing tasks. Linguistic corpora are a natural source of such tasks, and linguistic phenomena like part of speech, syntax and role semantics are often used in probing studies. The goal of probing is to find out what information can be easily extracted from a pre-trained NLP model or representation. To ensure that the information is extracted from the NLP model and not learned during the probing study itself, probing models are kept as simple and transparent as possible, exposing and augmenting conceptual inconsistencies between NLP models and linguistic resources. In this thesis we investigate how linguistic conceptualization can affect probing models, setups and results. In Chapter 2 we investigate the gap between the targets of classical type-level word embedding models like word2vec, and the items of lexical resources and similarity benchmarks. We show that the lack of conceptual alignment between word embedding vocabularies and lexical resources penalizes the word embedding models in both benchmark-based and our novel resource-based evaluation scenario. We demonstrate that simple preprocessing techniques like lemmatization and POS tagging can partially mitigate the issue, leading to a better match between word embeddings and lexicons. Linguistics often has more than one way of describing a certain phenomenon. In Chapter 3 we conduct an extensive study of the effects of lingustic formalism on probing modern pre-trained contextualized encoders like BERT. We use role semantics as an excellent example of a data-rich multi-framework phenomenon. We show that the choice of linguistic formalism can affect the results of probing studies, and deliver additional insights on the impact of dataset size, domain, and task architecture on probing. Apart from mere labeling choices, linguistic theories might differ in the very way of conceptualizing the task. Whereas mainstream NLP has treated semantic roles as a categorical phenomenon, an alternative, prominence-based view opens new opportunities for probing. In Chapter 4 we investigate prominence-based probing models for role semantics, incl. semantic proto-roles and our novel regression-based role probe. Our results indicate that pre-trained language models like BERT might encode argument prominence. Finally, we propose an operationalization of thematic role hierarchy - a widely used linguistic tool to describe syntactic behavior of verbs, and show that thematic role hierarchies can be extracted from text corpora and transfer cross-lingually. The results of our work demonstrate the importance of linguistic conceptualization for probing studies, and highlight the dangers and the opportunities associated with using linguistics as a meta-langauge for NLP model interpretation

TUbiblio

tuprints

TUdatalib Repository (TU Darmstadt)

EELECTION at SemEval-2017 Task 10: Ensemble of nEural Learners for kEyphrase ClassificaTION

Author: Do Dinh Erik-Lân
Eger Steffen
Gurevych Iryna
Kiaeeha Masoud
Kuznetsov Ilia
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/08/2017
Field of study

This paper describes our approach to the  SemEval 2017 Task 10: “Extracting Keyphrases and Relations from Scientific Publications”, specifically to Subtask (B): “Classification of identified keyphrases”. We explored three different deep learning approaches: a character-level convolutional neural network (CNN), a stacked learner with an MLP meta-classifier, and an attention based Bi-LSTM. From these approaches, we created an ensemble of differently hyper-parameterized systems, achieving a micro-F1-score of 0.63 on the test data. Our approach ranks 2nd (score of 1st placed system: 0.64) out of four according to this official score. However, we erroneously trained 2 out of 3 neural nets (the stacker and the CNN) on only roughly 15% of the full data, namely, the original development set.  When trained on the full data (training + development), our ensemble has a micro-F1-score of 0.69. Our code is available from https://github.com/UKPLab/semeval2017-scienceie

TUbiblio