21 research outputs found
Benchmarking Top-K Keyword and Top-K Document Processing with TK and TKD
Top-k keyword and top-k document extraction are very popular text analysis
techniques. Top-k keywords and documents are often computed on-the-fly, but
they exploit weighted vocabularies that are costly to build. To compare
competing weighting schemes and database implementations, benchmarking is
customary. To the best of our knowledge, no benchmark currently addresses these
problems. Hence, in this paper, we present TK, a top-k keywords and
documents benchmark, and its decision support-oriented evolution
TKD. Both benchmarks feature a real tweet dataset and queries
with various complexities and selectivities. They help evaluate weighting
schemes and database implementations in terms of computing performance. To
illustrate our bench-marks' relevance and genericity, we successfully ran
performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand,
and on different relational (Oracle, PostgreSQL) and document-oriented
(MongoDB) database implementations, on the other hand
An OWL ontology for ISO-based discourse marker annotation
Purpose: Discourse markers are linguistic cues that indicate how an utterance relates to the discourse context and what role it plays in conversation. The authors are preparing an annotated corpus in nine languages, and specifically aim to explore the role of Linguistic Linked Open Data (/LLOD) technologies in the process, i.e., the application of web standards such as RDF and the Web Ontology Language (OWL) for publishing and integrating data. We demonstrate the advantages of this approach
LLODIA: A Linguistic Linked Open Data Model for Diachronic Analysis
editorial reviewedThis article proposes a linguistic linked open data model for diachronic analysis (LLODIA) that combines data derived from diachronic analysis of multilingual corpora with dictionary-based evidence. A humanities use case was devised as a proof of concept that includes examples in five languages (French, Hebrew, Latin, Lithuanian and Romanian) related to various meanings of the term “revolution” considered at different time intervals. The examples were compiled through diachronic word embedding and dictionary alignment
ISO-based annotated multilingual parallel corpus for discourse markers
Discourse markers carry information about the discourse structure and organization, and also signal local dependencies or
epistemological stance of speaker. They provide instructions on how to interpret the discourse, and their study is paramount
to understand the mechanism underlying discourse organization. This paper presents a new language resource, an ISO-based
annotated multilingual parallel corpus for discourse markers. The corpus comprises nine languages, Bulgarian, Lithuanian,
German, European Portuguese, Hebrew, Romanian, Polish, and Macedonian, with English as a pivot language. In order to
represent the meaning of the discourse markers, we propose an annotation scheme of discourse relations from ISO 24617-8
with a plug-in to ISO 24617-2 for communicative functions. We describe an experiment in which we applied the annotation
scheme to assess its validity. The results reveal that, although some extensions are required to cover all the multilingual data,
it provides a proper representation of discourse markers value. Additionally, we report some relevant contrastive phenomena
concerning discourse markers interpretation and role in discourse. This first step will allow us to develop deep learning methods
to identify and extract discourse relations and communicative functions, and to represent that information as Linguistic Linked
Open Data (LLOD)
Validation of language agnostic models for discourse marker detection
Using language models to detect or predict the
presence of language phenomena in the text has
become a mainstream research topic. With the
rise of generative models, experiments using
deep learning and transformer models trigger
intense interest. Aspects like precision of predictions,
portability to other languages or phenomena,
scale have been central to the research
community. Discourse markers, as language
phenomena, perform important functions, such
as signposting, signalling, and rephrasing, by
facilitating discourse organization. Our paper
is about discourse markers detection, a complex
task as it pertains to a language phenomenon
manifested by expressions that can occur as
content words in some contexts and as discourse
markers in others. We have adopted
language agnostic model trained in English to
predict the discourse marker presence in texts
in 8 other unseen by the model languages with
the goal to evaluate how well the model performs
in different structure and lexical properties
languages. We report on the process of
evaluation and validation of the model's performance
across European Portuguese, Hebrew,
German, Polish, Romanian, Bulgarian, Macedonian,
and Lithuanian and about the results
of this validation. This research is a key step
towards multilingual language processing
ISO-based Annotated Multilingual Parallel Corpus for Discourse Markers
CC-BY-NC 4.0Discourse markers carry information about the discourse structure and organization, and also signal local dependencies or
epistemological stance of speaker. They provide instructions on how to interpret the discourse, and their study is paramount
to understand the mechanism underlying discourse organization. This paper presents a new language resource, an ISO-based
annotated multilingual parallel corpus for discourse markers. The corpus comprises nine languages, Bulgarian, Lithuanian,
German, European Portuguese, Hebrew, Romanian, Polish, and Macedonian, with English as a pivot language. In order to
represent the meaning of the discourse markers, we propose an annotation scheme of discourse relations from ISO 24617-8
with a plug-in to ISO 24617-2 for communicative functions. We describe an experiment in which we applied the annotation
scheme to assess its validity. The results reveal that, although some extensions are required to cover all the multilingual data,
it provides a proper representation of discourse markers value. Additionally, we report some relevant contrastive phenomena
concerning discourse markers interpretation and role in discourse. This first step will allow us to develop deep learning methods
to identify and extract discourse relations and communicative functions, and to represent that information as Linguistic Linked
Open Data (LLOD)
An OWL Ontology for ISO-Based Discourse Marker Annotation
Purpose: Discourse markers are linguistic cues that indicate how an utterance
relates to the discourse context and what role it plays in conversation. The authors are
preparing an annotated corpus in nine languages, and specifically aim to explore the
role of Linguistic Linked Open Data (/LLOD) technologies in the process, i.e., the application of web standards such as RDF and the Web Ontology Language (OWL) for
publishing and integrating data. We demonstrate the advantages of this approach
Cross-Lingual Link Discovery for Under-Resourced Languages
CC BY-NC 4.0In this paper, we provide an overview of current technologies for cross-lingual link discovery, and we discuss challenges,
experiences and prospects of their application to under-resourced languages. We first introduce the goals of cross-lingual
linking and associated technologies, and in particular, the role that the Linked Data paradigm (Bizer et al., 2011) applied
to language data can play in this context. We define under-resourced languages with a specific focus on languages actively
used on the internet, i.e., languages with a digitally versatile speaker community, but limited support in terms of language
technology. We argue that languages for which considerable amounts of textual data and (at least) a bilingual word list are
available, techniques for cross-lingual linking can be readily applied, and that these enable the implementation of downstream
applications for under-resourced languages via the localisation and adaptation of existing technologies and resources
Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions
The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 "ML4Microbiome" that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies
MisRoB ae RTa : Transformers versus Misinformation
Misinformation is considered a threat to our democratic values and principles. The spread of such content on social media polarizes society and undermines public discourse by distorting public perceptions and generating social unrest while lacking the rigor of traditional journalism. Transformers and transfer learning proved to be state-of-the-art methods for multiple well-known natural language processing tasks. In this paper, we propose MisRoB ae RTa, a novel transformer-based deep neural ensemble architecture for misinformation detection. MisRoB ae RTa takes advantage of two state-of-the art transformers, i.e., BART and RoBERTa, to improve the performance of discriminating between real news and different types of fake news. We also benchmarked and evaluated the performances of multiple transformers on the task of misinformation detection. For training and testing, we used a large real-world news articles dataset (i.e., 100,000 records) labeled with 10 classes, thus addressing two shortcomings in the current research: (1) increasing the size of the dataset from small to large, and (2) moving the focus of fake news detection from binary classification to multi-class classification. For this dataset, we manually verified the content of the news articles to ensure that they were correctly labeled. The experimental results show that the accuracy of transformers on the misinformation detection problem was significantly influenced by the method employed to learn the context, dataset size, and vocabulary dimension. We observe empirically that the best accuracy performance among the classification models that use only one transformer is obtained by BART, while DistilRoBERTa obtains the best accuracy in the least amount of time required for fine-tuning and training. However, the proposed MisRoB ae RTa outperforms the other transformer models in the task of misinformation detection. To arrive at this conclusion, we performed ample ablation and sensitivity testing with MisRoB ae RTa on two datasets