189,640 research outputs found
Cross-lingual Entity Alignment via Joint Attribute-Preserving Embedding
Entity alignment is the task of finding entities in two knowledge bases (KBs)
that represent the same real-world object. When facing KBs in different natural
languages, conventional cross-lingual entity alignment methods rely on machine
translation to eliminate the language barriers. These approaches often suffer
from the uneven quality of translations between languages. While recent
embedding-based techniques encode entities and relationships in KBs and do not
need machine translation for cross-lingual entity alignment, a significant
number of attributes remain largely unexplored. In this paper, we propose a
joint attribute-preserving embedding model for cross-lingual entity alignment.
It jointly embeds the structures of two KBs into a unified vector space and
further refines it by leveraging attribute correlations in the KBs. Our
experimental results on real-world datasets show that this approach
significantly outperforms the state-of-the-art embedding approaches for
cross-lingual entity alignment and could be complemented with methods based on
machine translation
GERNERMED++: Transfer Learning in German Medical NLP
We present a statistical model for German medical natural language processing
trained for named entity recognition (NER) as an open, publicly available
model. The work serves as a refined successor to our first GERNERMED model
which is substantially outperformed by our work. We demonstrate the
effectiveness of combining multiple techniques in order to achieve strong
results in entity recognition performance by the means of transfer-learning on
pretrained deep language models (LM), word-alignment and neural machine
translation. Due to the sparse situation on open, public medical entity
recognition models for German texts, this work offers benefits to the German
research community on medical NLP as a baseline model. Since our model is based
on public English data, its weights are provided without legal restrictions on
usage and distribution. The sample code and the statistical model is available
at: https://github.com/frankkramer-lab/GERNERMED-p
Probabilistic Bag-Of-Hyperlinks Model for Entity Linking
Many fundamental problems in natural language processing rely on determining
what entities appear in a given text. Commonly referenced as entity linking,
this step is a fundamental component of many NLP tasks such as text
understanding, automatic summarization, semantic search or machine translation.
Name ambiguity, word polysemy, context dependencies and a heavy-tailed
distribution of entities contribute to the complexity of this problem.
We here propose a probabilistic approach that makes use of an effective
graphical model to perform collective entity disambiguation. Input mentions
(i.e.,~linkable token spans) are disambiguated jointly across an entire
document by combining a document-level prior of entity co-occurrences with
local information captured from mentions and their surrounding context. The
model is based on simple sufficient statistics extracted from data, thus
relying on few parameters to be learned.
Our method does not require extensive feature engineering, nor an expensive
training procedure. We use loopy belief propagation to perform approximate
inference. The low complexity of our model makes this step sufficiently fast
for real-time usage. We demonstrate the accuracy of our approach on a wide
range of benchmark datasets, showing that it matches, and in many cases
outperforms, existing state-of-the-art methods
DCU's experiments in NTCIR-8 IR4QA task
We describe DCU's participation in the NTCIR-8 IR4QA
task [16]. This task is a cross-language information retrieval(CLIR) task from English to Simplified Chinese which seeks to provide relevant documents for later cross language question answering (CLQA) tasks. For the IR4QA task, we submitted 5 official runs including two monolingual runs and three CLIR runs. For the monolingual retrieval we tested two information retrieval models. The results show that the KL-Divergence language model method performs better than the Okapi BM25 model for the Simplified Chinese retrieval task. This agrees with our previous CLIR experimental results at NTCIR-5. For the CLIR task, we compare query translation and document translation methods. In the query translation based runs, we tested a method for query expansion from external resource (QEE) before query translation. Our result for this run is slightly lower than the run without QEE. Our results show that the document translation method achieves 68.24% MAP performance compared to our best query translation run. For the document translation method, we found that the main issue is the lack of named entity translation in the documents since we do not have a suitable parallel corpus for training data for the statistical machine translation system. Our best CLIR run comes from the combination of query translation using Google translate and the KL-Divergence language model retrieval method. It achieves 79.94% MAP relative to our best monolingual run
Resource description framework triples entity formations using statistical language model
A method in formatting unstructured sentences from the source corpus to a specificknowledge representation such as RDF is needed. A method for RDF entity formations from aparagraph of text using statistical language model based on N-gram is introduced. Theimplementation of RDF entity formation is applied on natural language query for informationretrieval of the Islamic knowledge. 300 concepts from the English translation of Holy Quranwith 350 relationships are used as a knowledge base. We evaluate our approach on collectionof queries from the Islamic Research Foundation website with a total, 82 queries and comparethe performance against previous method used in FREyA. The result shown the proposedmethod improved 17.07% on the accuracy of the natural language formulation analysis, whichtested on search strategy. It shows the increment on recall and precision with 7% and 3%.Keywords: semantic web; N-gram; ontology; statistical mode
Coherence in Machine Translation
Coherence ensures individual sentences work together to form a meaningful document. When properly translated, a coherent document in one language should result in a coherent document in another language. In Machine Translation, however, due to reasons of modeling and computational complexity, sentences are pieced together from words or phrases based on short context windows and
with no access to extra-sentential context.
In this thesis I propose ways to automatically assess the coherence of machine translation output. The work is structured around three dimensions: entity-based coherence, coherence as evidenced via syntactic patterns, and coherence as
evidenced via discourse relations.
For the first time, I evaluate existing monolingual coherence models on this new task, identifying issues and challenges that are specific to the machine translation setting. In order to address these issues, I adapted a state-of-the-art syntax
model, which also resulted in improved performance for the monolingual task. The results clearly indicate how much more difficult the new task is than the task of detecting shuffled texts. I proposed a new coherence model, exploring the crosslingual transfer of discourse relations in machine translation. This model is novel in that it measures the correctness of the discourse relation by comparison to the source text rather than to a reference translation. I identified patterns of incoherence common across different language pairs, and created a corpus of machine translated output annotated with coherence errors for evaluation purposes. I then examined
lexical coherence in a multilingual context, as a preliminary study for crosslingual transfer. Finally, I determine how the new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine translation would benefit from having a coherence component that evaluated the translation output with respect to the source text
GERNERMED: an open German medical NER model
The current state of adoption of well-structured electronic health records
and integration of digital methods for storing medical patient data in
structured formats can often considered as inferior compared to the use of
traditional, unstructured text based patient data documentation. Data mining in
the field of medical data analysis often needs to rely solely on processing of
unstructured data to retrieve relevant data. In natural language processing
(NLP), statistical models have been shown successful in various tasks like
part-of-speech tagging, relation extraction (RE) and named entity recognition
(NER). In this work, we present GERNERMED, the first open, neural NLP model for
NER tasks dedicated to detect medical entity types in German text data. Here,
we avoid the conflicting goals of protection of sensitive patient data from
training data extraction and the publication of the statistical model weights
by training our model on a custom dataset that was translated from publicly
available datasets in foreign language by a pretrained neural machine
translation model. The sample code and the statistical model is available at:
https://github.com/frankkramer-lab/GERNERME
CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval
We present the Charles University system for the MRL~2023 Shared Task on
Multi-lingual Multi-task Information Retrieval. The goal of the shared task was
to develop systems for named entity recognition and question answering in
several under-represented languages. Our solutions to both subtasks rely on the
translate-test approach. We first translate the unlabeled examples into English
using a multilingual machine translation model. Then, we run inference on the
translated data using a strong task-specific model. Finally, we project the
labeled data back into the original language. To keep the inferred tags on the
correct positions in the original language, we propose a method based on
scoring the candidate positions using a label-sensitive translation model. In
both settings, we experiment with finetuning the classification models on the
translated data. However, due to a domain mismatch between the development data
and the shared task validation and test sets, the finetuned models could not
outperform our baselines.Comment: 8 pages, 2 figures; System description paper at the MRL 2023 workshop
at EMNLP 202
- …