    LLMs Perform Poorly at Concept Extraction in Cyber-security Research Literature

    The cybersecurity landscape evolves rapidly and poses threats to organizations. To enhance resilience, one needs to track the latest developments and trends in the domain. It has been demonstrated that standard bibliometrics approaches show their limits in such a fast-evolving domain. For this purpose, we use large language models (LLMs) to extract relevant knowledge entities from cybersecurity-related texts. We use a subset of arXiv preprints on cybersecurity as our data and compare different LLMs in terms of entity recognition (ER) and relevance. The results suggest that LLMs do not produce good knowledge entities that reflect the cybersecurity context, but our results show some potential for noun extractors. For this reason, we developed a noun extractor boosted with some statistical analysis to extract specific and relevant compound nouns from the domain. Later, we tested our model to identify trends in the LLM domain. We observe some limitations, but it offers promising results to monitor the evolution of emergent trends.Comment: 24 pages, 9 figure

    The SEMAINE API: Towards a Standards-Based Framework for Building Emotion-Oriented Systems

    This paper presents the SEMAINE API, an open source framework for building emotion-oriented systems. By encouraging and simplifying the use of standard representation formats, the framework aims to contribute to interoperability and reuse of system components in the research community. By providing a Java and C++ wrapper around a message-oriented middleware, the API makes it easy to integrate components running on different operating systems and written in different programming languages. The SEMAINE system 1.0 is presented as an example of a full-scale system built on top of the SEMAINE API. Three small example systems are described in detail to illustrate how integration between existing and new components is realised with minimal effort

    LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

    Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training.Comment: Under submission at Computer Science and Language. Preprint allowe

    Democratizing Information Access through Low Overhead Systems

    Despite its importance, accessing information in storage systems or raw data is challenging or impossible for most people due to the sheer amount and heterogeneity of data as well as the overheads and complexities of existing systems. In this thesis, we propose several approaches to improve on that and therefore democratize information access. Data-driven and AI based approaches make it possible to provide the necessary information access for many tasks at scale. Unfortunately, most existing approaches can only be built and used by IT experts and data scientists, yet the current demand for data scientists cannot be met by far. Furthermore, their application is expensive. To counter this, approaches with low overhead, i.e., without the need for large amounts of training data, manually annotating or extracting information, and extensive computation are needed. However, such systems still need to adapt to special terminology of different domains, and the individual information needs of the users. Moreover, they should be usable without extensive training; we thus aim to create ready-to-use systems that provide intuitive or familiar ways for interaction, e.g., chatbot-like natural language input or graphical user interfaces. In this thesis, we propose a number of contributions to three important subfields of data exploration and processing: Natural Language Interfaces for Data Access & Manipulation, Personalized Summarizations of Text Collections, and Information Extraction & Integration. These approaches allow data scientists, domain experts and end users to access and manipulate information in a quick and easy way. First, we propose two natural language interfaces for data access and manipulation. Natural language is a useful alternative interface for relational databases, since it allows users to formulate complex questions without requiring knowledge of SQL. We propose an approach based on weak supervision that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. Moreover, we apply the idea to build a training pipeline for conversational agents (i.e., chatbot-like systems allowing to interact with a database and perform actions like ticket booking). The pipeline uses weak supervision to generate the training data automatically from a relational database and its set of defined transactions. Our approach is data-aware, i.e., it leverages the data characteristics of the DB at runtime to optimize the dialogue flow and reduce necessary interactions. Additionally, we complement this research by presenting a meta-study on the reproducibility and availability of natural language interfaces for databases (NLIDBs) for real-world applications, and a benchmark to evaluate the linguistic robustness of NLIDBs. Second, we work on personalized summarization and its usage for data exploration. The central idea is to produce summaries that exactly cover the current information need of the users. By creating multiple summaries or shifting the focus during the interactive creation process, these summaries can be used to explore the contents of unknown text collections. We propose an approach to create such personalized summaries at interactive speed; this is achieved by carefully sampling from the inputs. As part of our research on multi-document summary, we noticed that there is a lack of diverse evaluation corpora for this task. We therefore present a framework that can be used to automatically create new summarization corpora, and apply and validate it. Third, we provide ways to democratize information extraction and integration. This becomes relevant when data is scattered across different sources and there is no tabular representation that already contains all information needed. Therefore, it might be necessary to integrate different structured sources, or to even extract the required information pieces from text collections first and then to organize them. To integrate existing structured data sources, we present and evaluate a novel end-to-end approach for schema matching based on neural embeddings. Finally, we tackle the automatic creation of tables from text for situations where no suitable structured source to answer an information need is available. Our proposed approach can execute SQL-like queries on text collections in an ad-hoc manner, both to directly extract facts from text documents, and to produce aggregated tables stating information that is not explicitly mentioned in the documents. Our approach works by generalizing user feedback and therefore does not need domain-specific resources for the domain adaption. It runs at interactive speed even on commodity hardware. Overall, our approaches can provide a quality level compared to state-of-the-art approaches, but often at a fraction of the associated costs. In other fields like the table extractions, we even provide functionality that is—to our knowledge—not covered by any generic tooling available to end users. There are still many interesting challenges to solve, and the recent rise of large language models has shifted what seems possible with regard to dealing with human language once more. Yet, we hope that our contributions provide a useful step towards democratization of information access

    Representation Learning: A Review and New Perspectives

    The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning

    MuSe 2020 challenge and workshop: multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: emotional car reviews in-the-wild

    ABSTRACT Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic, in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CAR, the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34 * UAR + 0.66 * F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.Funding from the EP- SRC Grant No. 2021037, and the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B). We thank the sponsors of the Challenge BMW Group and audEERING

    Prototype of a Conversational Assistant for Satellite Mission Operations

    The very first artificial satellite, Sputnik, was launched in 1957 marking a new era. Concurrently, satellite mission operations emerged. These start at launch and finish at the end of mission, when the spacecraft is decommissioned. Running a satellite mission requires the monitoring and control of telemetry data, to verify and maintain satellite health, reconfigure and command the spacecraft, detect, identify and resolve anomalies and perform launch and early orbit operations. The very first chatbot, ELIZA was created in 1966, and also marked a new era of Artificial Intelligence Systems. Said systems answer users’ questions in the most diverse domains, interpreting the human language input and responding in the same manner. Nowadays, these systems are everywhere, and the list of possible applications seems endless. The goal of the present master’s dissertation is to develop a prototype of a chatbot for mission operations. For this purpose implementing a Natural Language Processing (NLP) model for satellite missions allied to a dialogue flow model. The performance of the conversational assistant is evaluated with its implementation on a mission operated by the European Space Agency (ESA), implying the generation of the spacecraft’s Database Knowledge Graph (KG). Throughout the years, many tools have been developed and added to the systems used to monitor and control spacecrafts helping Flight Control Teams (FCT) either by maintaining a comprehensive overview of the spacecraft’s status and health, speeding up failure investigation, or allowing to easily correlate time series of telemetry data. However, despite all the advances made which facilitate the daily tasks, the teams still need to navigate through thousands of parameters and events spanning years of data, using purposely built user interfaces and relying on filters and time series plots. The solution presented in this dissertation and proposed by VisionSpace Technologies focuses on improving operational efficiency whilst dealing with the mission’s complex and extensive databases.O primeiro satélite artificial, Sputnik, foi lançado em 1957 e marcou o início de uma nova era. Simultaneamente, surgiram as operações de missão de satélites. Estas iniciam com o lançamento e terminam com desmantelamento do veículo espacial, que marca o fim da missão. A operação de satélites exige o acompanhamento e controlo de dados de telemetria, com o intuito de verificar e manter a saúde do satélite, reconfigurar e comandar o veículo, detetar, identificar e resolver anomalias e realizar o lançamento e as operações iniciais do satélite. Em 1966, o primeiro Chatbot foi criado, ELIZA, e também marcou uma nova era, de sistemas dotados de Inteligência Artificial. Tais sistemas respondem a perguntas nos mais diversos domínios, para tal interpretando linguagem humana e repondendo de forma similar. Hoje em dia, é muito comum encontrar estes sistemas e a lista de aplicações possíveis parece infindável. O objetivo da presente dissertação de mestrado consiste em desenvolver o protótipo de um Chatbot para operação de satélites. Para este proposito, criando um modelo de Processamento de Linguagem Natural (NLP) aplicado a missoões de satélites aliado a um modelo de fluxo de diálogo. O desempenho do assistente conversacional será avaliado com a sua implementação numa missão operada pela Agência Espacial Europeia (ESA), o que implica a elaboração do grafico de conhecimentos associado à base de dados da missão. Ao longo dos anos, várias ferramentas foram desenvolvidas e adicionadas aos sistemas que acompanham e controlam veículos espaciais, que colaboram com as equipas de controlo de missão, mantendo uma visão abrangente sobre a condição do satélite, acelerando a investigação de falhas, ou permitindo correlacionar séries temporais de dados de telemetria. No entanto, apesar de todos os progressos que facilitam as tarefas diárias, as equipas ainda necessitam de navegar por milhares de parametros e eventos que abrangem vários anos de recolha de dados, usando interfaces para esse fim e dependendo da utilização de filtros e gráficos de series temporais. A solução apresentada nesta dissertação e proposta pela VisionSpace Technologies tem como foco melhorar a eficiência operacional lidando simultaneamente com as suas complexas e extensas bases de dados
