Search CORE

22 research outputs found

Recommended from our members

Representation Learning beyond Semantic Similarity: Character-aware and Function-specific Approaches

Author: Gerz Daniela Susanne
Publication venue: University of Cambridge
Publication date: 28/04/2020
Field of study

Representation learning is a research area within machine learning and natural language processing (NLP) concerned with building machine-understandable representations of discrete units of text. Continuous representations are at the core of modern machine learning applications, and representation learning has thereby become one of the central research areas in NLP. The induction of text representations is typically based on the distributional hypothesis, and consequently encodes general information about word similarity. Words or phrases with similar meaning obtain similar representations in a vector space constructed for this purpose. This established methodology excels for morphologically-simple languages such as English, and in data-rich settings. However, several useful lexical relations such as entailment or selectional preference, are not captured or get conflated with other relations. Another challenge is dealing with low-data regimes for morphologically-complex and under-resourced languages. In this thesis we construct novel representation learning methods that go beyond the limitations of the distributional hypothesis and investigate solutions that induce vector spaces with diverse properties. In particular, we look at how the vector space induction process influences the contained information, and how the information manifests in a number of core NLP tasks: semantic similarity, lexical entailment, selectional preference, and language modeling. We contribute novel evaluations of state-of-the-art models highlighting their current capabilities and limitations. An analysis of language modeling in 50 typologically-diverse languages demonstrates that representations can indeed pose a performance bottleneck. We introduce a novel approach to leveraging subword-level information in word representations: our solution lifts this bottleneck in low-resource scenarios. Finally, we introduce a novel paradigm of function-specific representation learning that aims to integrate fine-grained semantic relations and real-world knowledge into the word vector spaces. We hope this thesis can serve as a valuable overview on word representations, and inspire future work in modeling \textit{semantic similarity and beyond}.ERC Consolidator Grant LEXICAL (648909

Apollo (Cambridge)

An Evaluation of Text Representation Techniques for Fake News Detection Using: TF-IDF, Word Embeddings, Sentence Embeddings with Linear Support Vector Machine.

Author: Sriram Sangita
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2020
Field of study

In a world where anybody can share their views, opinions and make it sound like these are facts about the current situation of the world, Fake News poses a huge threat especially to the reputation of people with high stature and to organizations. In the political world, this could lead to opposition parties making use of this opportunity to gain popularity in their elections. In the medical world, a fake scandalous message about a medicine giving side effects, hospital treatment gone wrong or even a false message against a practicing doctor could become a big menace to everyone involved in that news. In the world of business, one false news becoming a trending topic could definitely disrupt their future business earnings. The detection of such false news becomes very important in today’s world, where almost everyone has an access to use a mobile phone and can cause enough disruption by creating one false statement and making it a viral hit. Generation of fake news articles gathered more attention during the US Presidential Elections in 2016, leading to a high number of scientists and researchers to explore this NLP problem with deep interest and a sense of urgency too. This research intends to develop and compare a Fake News classifier using Linear Support Vector Machine Classifier built on traditional text feature representation technique Term Frequency Inverse Document Frequency (Ahmed, Traore & Saad, 2017), against a classifier built on the latest developments for text feature representations such as: word embeddings using ‘word2vec’ and sentence embeddings using ‘Universal Sentence Encoder’

Arrow@TUDublin

Recommended from our members

Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring

Author: Gritta Milan
Publication venue: University of Cambridge
Publication date: 16/07/2019
Field of study

The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high-quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commission’s MediSys.I was generously funded by DREAM CDT, which was funded by NERC of UKRI

Apollo (Cambridge)

Learning discrete word embeddings to achieve better interpretability and processing efficiency

Author: Beland-Leblanc Samuel
Publication venue
Publication date: 01/12/2020
Field of study

L’omniprésente utilisation des plongements de mot dans le traitement des langues naturellesest la preuve de leur utilité et de leur capacité d’adaptation a une multitude de tâches. Ce-pendant, leur nature continue est une importante limite en terme de calculs, de stockage enmémoire et d’interprétation. Dans ce travail de recherche, nous proposons une méthode pourapprendre directement des plongements de mot discrets. Notre modèle est une adaptationd’une nouvelle méthode de recherche pour base de données avec des techniques dernier crien traitement des langues naturelles comme les Transformers et les LSTM. En plus d’obtenirdes plongements nécessitant une fraction des ressources informatiques nécéssaire à leur sto-ckage et leur traitement, nos expérimentations suggèrent fortement que nos représentationsapprennent des unités de bases pour le sens dans l’espace latent qui sont analogues à desmorphèmes. Nous appelons ces unités dessememes, qui, de l’anglaissemantic morphemes,veut dire morphèmes sémantiques. Nous montrons que notre modèle a un grand potentielde généralisation et qu’il produit des représentations latentes montrant de fortes relationssémantiques et conceptuelles entre les mots apparentés.The ubiquitous use of word embeddings in Natural Language Processing is proof of theirusefulness and adaptivity to a multitude of tasks. However, their continuous nature is pro-hibitive in terms of computation, storage and interpretation. In this work, we propose amethod of learning discrete word embeddings directly. The model is an adaptation of anovel database searching method using state of the art natural language processing tech-niques like Transformers and LSTM. On top of obtaining embeddings requiring a fractionof the resources to store and process, our experiments strongly suggest that our representa-tions learn basic units of meaning in latent space akin to lexical morphemes. We call theseunitssememes, i.e., semantic morphemes. We demonstrate that our model has a greatgeneralization potential and outputs representation showing strong semantic and conceptualrelations between related words

Dépôt Institutionnel Numérique

How Do Multilingual Encoders Learn Cross-lingual Representation?

Author: Wu Shijie
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 25/07/2022
Field of study

NLP systems typically require support for more than one language. As different languages have different amounts of supervision, cross-lingual transfer benefits languages with little to no training data by transferring from other languages. From an engineering perspective, multilingual NLP benefits development and maintenance by serving multiple languages with a single system. Both cross-lingual transfer and multilingual NLP rely on cross-lingual representations serving as the foundation. As BERT revolutionized representation learning and NLP, it also revolutionized cross-lingual representations and cross-lingual transfer. Multilingual BERT was released as a replacement for single-language BERT, trained with Wikipedia data in 104 languages. Surprisingly, without any explicit cross-lingual signal, multilingual BERT learns cross-lingual representations in addition to representations for individual languages. This thesis first shows such surprising cross-lingual effectiveness compared against prior art on various tasks. Naturally, it raises a set of questions, most notably how do these multilingual encoders learn cross-lingual representations. In exploring these questions, this thesis will analyze the behavior of multilingual models in a variety of settings on high and low resource languages. We also look at how to inject different cross-lingual signals into multilingual encoders, and the optimization behavior of cross-lingual transfer with these models. Together, they provide a better understanding of multilingual encoders on cross-lingual transfer. Our findings will lead us to suggested improvements to multilingual encoders and cross-lingual transfer

JScholarship

Developing a Framework to Identify Professional Skills Required for Banking Sector Employee in UK using Natural Language Processing (NLP) Techniques

Author: Anthony Gayanika
Publication venue
Publication date: 26/03/2024
Field of study

The banking sector is changing dramatically, and new studies reveal that many financial institutions are having challenges keeping up with technology advancements and an acute shortage of skilled workers. The banking industry is changing into a dynamic field where success requires a wide range of talents. For the industry to properly analyses, match, and develop personnel, a strong skill identification process is needed. The objective of this research is to establish a framework for determining the competencies needed by banking industry experts through data extraction from job postings on UK websites.Data is extracted from job vacancy websites leveraging web-based annotation tools and Natural Language Processing (NLP) techniques. This study starts by conducting a thorough examination of the literature to investigate the theoretical underpinnings of NLP techniques, its applications in talent management and human resources within the banking industry, and its potential for skill identification. Next, textual data from job ads is processed using NLP techniques to extract and categorize talents unique to these categories. Advanced algorithms and approaches are used in the NLP-based development process to automatically extract skills from unstructured textual material, guaranteeing that the skills gathered are accurate and most relevant to the needs of the banking industry. To make sure the NLP techniques-driven skill identification is accurate and up to date, the extracted skills are verified by expert feedback. In the final phase, machine learning models are employed to predict the skills required for banking sector employees. This study delves into various machine learning techniques, which are implemented within the framework. By preprocessing and training on skills extracted from job advertisements, these models undergo evaluation to assess their effectiveness in skill prediction. The results offer a detailed analysis of each model's performance, with metrics such as recall, precision, and F1-score being used for assessment. This comprehensive examination underscores the potential of machine learning in skill identification and highlights its relevance in the banking sector.Key Words: Machine Learning, Banking Sector, Employability, Data Mining, NLP, Semantic analysis, Skill assessment, Skill Recognition, Talent managemen

University of Salford Institutional Repository