405 research outputs found
One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis
When learning a new skill, you take advantage of your preexisting skills and
knowledge. For instance, if you are a skilled violinist, you will likely have
an easier time learning to play cello. Similarly, when learning a new language
you take advantage of the languages you already speak. For instance, if your
native language is Norwegian and you decide to learn Dutch, the lexical overlap
between these two languages will likely benefit your rate of language
acquisition. This thesis deals with the intersection of learning multiple tasks
and learning multiple languages in the context of Natural Language Processing
(NLP), which can be defined as the study of computational processing of human
language. Although these two types of learning may seem different on the
surface, we will see that they share many similarities.
The traditional approach in NLP is to consider a single task for a single
language at a time. However, recent advances allow for broadening this
approach, by considering data for multiple tasks and languages simultaneously.
This is an important approach to explore further as the key to improving the
reliability of NLP, especially for low-resource languages, is to take advantage
of all relevant data whenever possible. In doing so, the hope is that in the
long term, low-resource languages can benefit from the advances made in NLP
which are currently to a large extent reserved for high-resource languages.
This, in turn, may then have positive consequences for, e.g., language
preservation, as speakers of minority languages will have a lower degree of
pressure to using high-resource languages. In the short term, answering the
specific research questions posed should be of use to NLP researchers working
towards the same goal.Comment: PhD thesis, University of Groninge
Sparks of Large Audio Models: A Survey and Outlook
This survey paper provides a comprehensive overview of the recent
advancements and challenges in applying large language models to the field of
audio signal processing. Audio processing, with its diverse signal
representations and a wide range of sources--from human voices to musical
instruments and environmental sounds--poses challenges distinct from those
found in traditional Natural Language Processing scenarios. Nevertheless,
\textit{Large Audio Models}, epitomized by transformer-based architectures,
have shown marked efficacy in this sphere. By leveraging massive amount of
data, these models have demonstrated prowess in a variety of audio tasks,
spanning from Automatic Speech Recognition and Text-To-Speech to Music
Generation, among others. Notably, recently these Foundational Audio Models,
like SeamlessM4T, have started showing abilities to act as universal
translators, supporting multiple speech tasks for up to 100 languages without
any reliance on separate task-specific systems. This paper presents an in-depth
analysis of state-of-the-art methodologies regarding \textit{Foundational Large
Audio Models}, their performance benchmarks, and their applicability to
real-world scenarios. We also highlight current limitations and provide
insights into potential future research directions in the realm of
\textit{Large Audio Models} with the intent to spark further discussion,
thereby fostering innovation in the next generation of audio-processing
systems. Furthermore, to cope with the rapid development in this area, we will
consistently update the relevant repository with relevant recent articles and
their open-source implementations at
https://github.com/EmulationAI/awesome-large-audio-models.Comment: work in progress, Repo URL:
https://github.com/EmulationAI/awesome-large-audio-model
Natural Language Interfaces for Tabular Data Querying and Visualization: A Survey
The emergence of natural language processing has revolutionized the way users
interact with tabular data, enabling a shift from traditional query languages
and manual plotting to more intuitive, language-based interfaces. The rise of
large language models (LLMs) such as ChatGPT and its successors has further
advanced this field, opening new avenues for natural language processing
techniques. This survey presents a comprehensive overview of natural language
interfaces for tabular data querying and visualization, which allow users to
interact with data using natural language queries. We introduce the fundamental
concepts and techniques underlying these interfaces with a particular emphasis
on semantic parsing, the key technology facilitating the translation from
natural language to SQL queries or data visualization commands. We then delve
into the recent advancements in Text-to-SQL and Text-to-Vis problems from the
perspectives of datasets, methodologies, metrics, and system designs. This
includes a deep dive into the influence of LLMs, highlighting their strengths,
limitations, and potential for future improvements. Through this survey, we aim
to provide a roadmap for researchers and practitioners interested in developing
and applying natural language interfaces for data interaction in the era of
large language models.Comment: 20 pages, 4 figures, 5 tables. Submitted to IEEE TKD
Deep learning based semantic textual similarity for applications in translation technology
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Semantic Textual Similarity (STS) measures the equivalence of meanings
between two textual segments. It is a fundamental task for many natural
language processing applications. In this study, we focus on employing STS in
the context of translation technology. We start by developing models to estimate
STS. We propose a new unsupervised vector aggregation-based STS method
which relies on contextual word embeddings. We also propose a novel Siamese
neural network based on efficient recurrent neural network units. We empirically
evaluate various unsupervised and supervised STS methods, including these
newly proposed methods in three different English STS datasets, two non-
English datasets and a bio-medical STS dataset to list the best supervised and
unsupervised STS methods.
We then embed these STS methods in translation technology applications.
Firstly we experiment with Translation Memory (TM) systems. We propose a
novel TM matching and retrieval method based on STS methods that outperform
current TM systems. We then utilise the developed STS architectures in
translation Quality Estimation (QE). We show that the proposed methods are
simple but outperform complex QE architectures and improve the state-of-theart
results. The implementations of these methods have been released as open
source
Automatic Terminology Coding for the Biomedical Domain
The biomedical sector, rich in unstructured data from sources like clinical notes and health records, presents a prime opportunity for Natural Language Processing (NLP) applications. Especially pivotal is the task of entity linking, wherein textual mentions are mapped to medical concepts within a knowledge base, in this case, represented by the Unified Medical Language System (UMLS) Metathesaurus. Within this realm, the Italian language faces resource constraints (only 4% of UMLS 4M concepts have a label in the Italian language). Current systems like MAPS Group’s Clinika software lean on label matching to link the extracted facts to the corresponding UMLS concepts. This dissertation deals with the design of a new Clinika component aimed at enhancing entity linking for Italian terms against UMLS, even in the absence of direct Italian labels. Employing transformer-based multilingual embeddings, a novel 'concept guesser' architecture was developed to tackle the linking challenge intelligently, maximizing the level of exploitation of the currently available knowledge. This innovation not only enhances Clinika’s effectiveness but also paves the way for advanced multilingual clinical decision support systems
Exploiting Cross-Lingual Representations For Natural Language Processing
Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages.
In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest
On the Evolution of Knowledge Graphs: A Survey and Perspective
Knowledge graphs (KGs) are structured representations of diversified
knowledge. They are widely used in various intelligent applications. In this
article, we provide a comprehensive survey on the evolution of various types of
knowledge graphs (i.e., static KGs, dynamic KGs, temporal KGs, and event KGs)
and techniques for knowledge extraction and reasoning. Furthermore, we
introduce the practical applications of different types of KGs, including a
case study in financial analysis. Finally, we propose our perspective on the
future directions of knowledge engineering, including the potential of
combining the power of knowledge graphs and large language models (LLMs), and
the evolution of knowledge extraction, reasoning, and representation
14th Conference on DATA ANALYSIS METHODS for Software Systems
DAMSS-2023 is the 14th International Conference on Data Analysis Methods for Software Systems, held in Druskininkai, Lithuania. Every year at the same venue and time. The exception was in 2020, when the world was gripped by the Covid-19 pandemic and the movement of people was severely restricted. After a year’s break, the conference was back on track, and the next conference was successful in achieving its primary goal of lively scientific communication. The conference focuses on live interaction among participants. For better efficiency of communication among participants, most of the presentations are poster presentations.
This format has proven to be highly effective. However, we have several oral sections, too. The history of the conference dates back to 2009 when 16 papers were presented. It began as a workshop and has evolved into a well-known conference. The idea of such a workshop originated at the Institute of Mathematics and Informatics, now the Institute of Data Science and Digital Technologies of Vilnius University. The Lithuanian Academy of Sciences and the Lithuanian Computer Society supported this idea, which gained enthusiastic acceptance from both the Lithuanian and international scientific communities. This year’s conference features 84 presentations, with 137 registered participants from 11 countries. The conference serves as a gathering point for researchers from six Lithuanian universities, making it the main annual meeting for Lithuanian computer scientists. The primary aim of the conference is to showcase research conducted at Lithuanian and foreign universities in the fields of data science and software engineering. The annual organization of the conference facilitates the rapid exchange of new ideas within the scientific community. Seven IT companies supported the conference this year, indicating the relevance of the conference topics to the business sector. In addition, the conference is supported by the Lithuanian Research Council and the National Science and Technology Council (Taiwan, R. O. C.). The conference covers a wide range of topics, including Applied Mathematics, Artificial Intelligence, Big Data, Bioinformatics, Blockchain Technologies, Business Rules, Software Engineering, Cybersecurity, Data Science, Deep Learning, High-Performance Computing, Data Visualization, Machine Learning, Medical Informatics, Modelling Educational Data, Ontological Engineering, Optimization, Quantum Computing, Signal Processing. This book provides an overview of all presentations from the DAMSS-2023 conference
Unifying Large Language Models and Knowledge Graphs: A Roadmap
Large language models (LLMs), such as ChatGPT and GPT4, are making new waves
in the field of natural language processing and artificial intelligence, due to
their emergent ability and generalizability. However, LLMs are black-box
models, which often fall short of capturing and accessing factual knowledge. In
contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are
structured knowledge models that explicitly store rich factual knowledge. KGs
can enhance LLMs by providing external knowledge for inference and
interpretability. Meanwhile, KGs are difficult to construct and evolving by
nature, which challenges the existing methods in KGs to generate new facts and
represent unseen knowledge. Therefore, it is complementary to unify LLMs and
KGs together and simultaneously leverage their advantages. In this article, we
present a forward-looking roadmap for the unification of LLMs and KGs. Our
roadmap consists of three general frameworks, namely, 1) KG-enhanced LLMs,
which incorporate KGs during the pre-training and inference phases of LLMs, or
for the purpose of enhancing understanding of the knowledge learned by LLMs; 2)
LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding,
completion, construction, graph-to-text generation, and question answering; and
3) Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a
mutually beneficial way to enhance both LLMs and KGs for bidirectional
reasoning driven by both data and knowledge. We review and summarize existing
efforts within these three frameworks in our roadmap and pinpoint their future
research directions.Comment: 29 pages, 25 figure
- …