2,441 research outputs found
MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach
Entity linking has recently been the subject of a significant body of
research. Currently, the best performing approaches rely on trained
mono-lingual models. Porting these approaches to other languages is
consequently a difficult endeavor as it requires corresponding training data
and retraining of the models. We address this drawback by presenting a novel
multilingual, knowledge-based agnostic and deterministic approach to entity
linking, dubbed MAG. MAG is based on a combination of context-based retrieval
on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data
sets and in 7 languages. Our results show that the best approach trained on
English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse
on datasets in other languages. MAG, on the other hand, achieves
state-of-the-art performance on English datasets and reaches a micro F-measure
that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc
An Adaptive Approach for Interlinking Georeferenced Data
International audienceThe resources published on the Web of data are often described by spatial references such as coordinates. The common data linking approaches are mainly based on the hypothesis that spatially close resources are more likely to represent the same thing. However, this assumption is valid only when the spatial references that are compared have been produced with the same positional accuracy, and when they actually represent the same spatial characteristic of the resources captured in an unambiguous way. Otherwise, spatial distance-based matching algorithms may produce erroneous links. In this article, we first suggest to formalize and acquire the knowledge about the spatial references, namely their positional accuracy, their geometric modeling, their level of detail, and the vagueness of the spatial entities they represent. We then propose an interlinking approach that dynamically adapts the way spatial references are compared, based on this knowledge
Forecasting the Spreading of Technologies in Research Communities
Technologies such as algorithms, applications and formats are an important part of the knowledge produced and reused in the research process. Typically, a technology is expected to originate in the context of a research area and then spread and contribute to several other fields. For example, Semantic Web technologies have been successfully adopted by a variety of fields, e.g., Information Retrieval, Human Computer Interaction, Biology, and many others. Unfortunately, the spreading of technologies across research areas may be a slow and inefficient process, since it is easy for researchers to be unaware of potentially relevant solutions produced by other research communities. In this paper, we hypothesise that it is possible to learn typical technology propagation patterns from historical data and to exploit this knowledge i) to anticipate where a technology may be adopted next and ii) to alert relevant stakeholders about emerging and relevant technologies in other fields. To do so, we propose the Technology-Topic Framework, a novel approach which uses a semantically enhanced technology-topic model to forecast the propagation of technologies to research areas. A formal evaluation of the approach on a set of technologies in the Semantic Web and Artificial Intelligence areas has produced excellent results, confirming the validity of our solution
Hierarchical Network with Label Embedding for Contextual Emotion Recognition
Emotion recognition has been used widely in various applications such as mental health monitoring and emotional management. Usually, emotion recognition is regarded as a text classification task. Emotion recognition is a more complex problem, and the relations of emotions expressed in a text are nonnegligible. In this paper, a hierarchical model with label embedding is proposed for contextual emotion recognition. Especially, a hierarchical model is utilized to learn the emotional representation of a given sentence based on its contextual information. To give emotion correlation-based recognition, a label embedding matrix is trained by joint learning, which contributes to the final prediction. Comparison experiments are conducted on Chinese emotional corpus RenCECps, and the experimental results indicate that our approach has a satisfying performance in textual emotion recognition task
Do Language Models Plagiarize?
Past literature has illustrated that language models (LMs) often memorize
parts of training instances and reproduce them in natural language generation
(NLG) processes. However, it is unclear to what extent LMs "reuse" a training
corpus. For instance, models can generate paraphrased sentences that are
contextually similar to training samples. In this work, therefore, we study
three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2
generated texts, in comparison to its training data, and further analyze the
plagiarism patterns of fine-tuned LMs with domain-specific corpora which are
extensively used in practice. Our results suggest that (1) three types of
plagiarism widely exist in LMs beyond memorization, (2) both size and decoding
methods of LMs are strongly associated with the degrees of plagiarism they
exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus
similarity and homogeneity. Given that a majority of LMs' training data is
scraped from the Web without informing content owners, their reiteration of
words, phrases, and even core ideas from training sets into generated texts has
ethical implications. Their patterns are likely to exacerbate as both the size
of LMs and their training data increase, raising concerns about
indiscriminately pursuing larger models with larger training corpora.
Plagiarized content can also contain individuals' personal and sensitive
information. These findings overall cast doubt on the practicality of current
LMs in mission-critical writing tasks and urge more discussions around the
observed phenomena. Data and source code are available at
https://github.com/Brit7777/LM-plagiarism.Comment: Accepted to WWW'2
Orchestrating a Network of Mereo(topo)logical Theories
Parthood is used widely in ontologies across subject domains. Some modelling guidance can be gleaned from Ontology, yet it offers multiple mereological theories, and even more when combined with topology, i.e., mereotopology. To complicate the landscape, decidable languages put restrictions on the language features, so that only fragments of the mereo(topo)logical theories can be represented, yet during modelling, those full features may be needed to check correctness. We address these issues by specifying a structured network of theories formulated in multiple logics that are glued together by the various linking constructs of the Distributed Ontology Language, \DOL. For the KGEMT mereotopological theory and five sub-theories, together with the DL-based OWL species and first- and second-order logic, this network in \DOL orchestrates 28 ontologies. Further, we propose automated steps toward resolution of language feature conflicts when combining modules, availing of the new `OWL classifier' tool that pinpoints profile violations
Federated Skewed Label Learning with Logits Fusion
Federated learning (FL) aims to collaboratively train a shared model across
multiple clients without transmitting their local data. Data heterogeneity is a
critical challenge in realistic FL settings, as it causes significant
performance deterioration due to discrepancies in optimization among local
models. In this work, we focus on label distribution skew, a common scenario in
data heterogeneity, where the data label categories are imbalanced on each
client. To address this issue, we propose FedBalance, which corrects the
optimization bias among local models by calibrating their logits. Specifically,
we introduce an extra private weak learner on the client side, which forms an
ensemble model with the local model. By fusing the logits of the two models,
the private weak learner can capture the variance of different data, regardless
of their category. Therefore, the optimization direction of local models can be
improved by increasing the penalty for misclassifying minority classes and
reducing the attention to majority classes, resulting in a better global model.
Extensive experiments show that our method can gain 13\% higher average
accuracy compared with state-of-the-art methods.Comment: 9 pages, 4 figures, 4 table
Opinion Mining for Software Development: A Systematic Literature Review
Opinion mining, sometimes referred to as sentiment analysis, has gained increasing attention in software engineering (SE) studies.
SE researchers have applied opinion mining techniques in various contexts, such as identifying developers’ emotions expressed in
code comments and extracting users’ critics toward mobile apps. Given the large amount of relevant studies available, it can take
considerable time for researchers and developers to figure out which approaches they can adopt in their own studies and what perils
these approaches entail.
We conducted a systematic literature review involving 185 papers. More specifically, we present 1) well-defined categories of opinion
mining-related software development activities, 2) available opinion mining approaches, whether they are evaluated when adopted in
other studies, and how their performance is compared, 3) available datasets for performance evaluation and tool customization, and 4)
concerns or limitations SE researchers might need to take into account when applying/customizing these opinion mining techniques.
The results of our study serve as references to choose suitable opinion mining tools for software development activities, and provide
critical insights for the further development of opinion mining techniques in the SE domain
- …