56 research outputs found
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We fur- ther identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the met- rics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our pro- posed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages
Cross-Lingual Knowledge Transfer for Clinical Phenotyping
Clinical phenotyping enables the automatic extraction of clinical conditions
from patient records, which can be beneficial to doctors and clinics worldwide.
However, current state-of-the-art models are mostly applicable to clinical
notes written in English. We therefore investigate cross-lingual knowledge
transfer strategies to execute this task for clinics that do not use the
English language and have a small amount of in-domain data available. We
evaluate these strategies for a Greek and a Spanish clinic leveraging clinical
notes from different clinical domains such as cardiology, oncology and the ICU.
Our results reveal two strategies that outperform the state-of-the-art:
Translation-based methods in combination with domain-specific encoders and
cross-lingual encoders plus adapters. We find that these strategies perform
especially well for classifying rare phenotypes and we advise on which method
to prefer in which situation. Our results show that using multilingual data
overall improves clinical phenotyping models and can compensate for data
sparseness.Comment: LREC 2022 submmision: January 202
Étiquetage morphosyntaxique de langues non dotées à partir de ressources pour une langue étymologiquement proche
International audienceWe introduce a generic approach for transferring part-of-speech annotations from a resourced language to a non-resourced but etymologically close language. We do not rely on the existence of any parallel corpora or any linguistic knowledge for the non-resourced language (no lexicons, no annotated corpora). Our approach only makes use of cognate pairs that are automatically induced in an unsupervised way, based on character-based statistical machine translation and on a morphosyntactic lexicon for the resourced language. Frequent and short words are treated differently, as we tag them directly based on a cross-language similarity assessment of immediate morphosyntactic contexts. Using German as a resourced language, we evaluate our approach on Dutch --- in fact a resourced language --- and on Palatine German. We reach tagging accuracies of 67.2% on Dutch and 60.7% on Palatine German.Nous présentons une approche générique du transfert d'annotations morphosyntaxiques d'une langue dotée vers une langue non dotée étymologiquement proche. Nous ne présupposons aucun corpus parallèle et aucune connaissance préalable de la langue non dotée (ni lexique, ni corpus annoté). Notre approche repose uniquement sur des paires de cognats obtenues par apprentissage non-supervisé selon le paradigme de la traduction automatique statistique à base de caractères, et sur un dictionnaire morphosyntaxique de la langue dotée. Pour les mots fréquents et courts, nous préférons assigner les étiquettes directement aux mots de la langue non dotée en fonction de mesures de similarité inter-langues du contexte morphosyntaxique immédiat. Partant de l'allemand comme langue dotée, nous évaluons notre approche sur le néerlandais, qui est en réalité dotée, et le palatin. Nous obtenons une précision d'étiquetage de 67,2\% pour le néerlandais et de 60,7\% pour le palatin
- …