Search CORE

3,797 research outputs found

Large-scale Hierarchical Alignment for Data-driven Text Rewriting

Author: Hahnloser Richard H. R.
Nikolov Nikola I.
Publication venue
Publication date: 01/01/2019
Field of study

We propose a simple unsupervised method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different text styles, such as news articles and scientific papers. Our approach does not require a seed parallel corpus, but instead relies solely on hierarchical search over pre-trained embeddings of documents and sentences. We demonstrate the effectiveness of our method through automatic and extrinsic evaluation on text simplification from the normal to the Simple Wikipedia. We show that pseudo-parallel sentences extracted with our method not only supplement existing parallel data, but can even lead to competitive performance on their own.Comment: RANLP 201

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

ZORA

Unsupervised Machine Translation Using Cross-Lingual N-gram Embeddings

Author: Tättar Andre
Publication venue
Publication date: 01/01/2018
Field of study

Praegused parimad masintõlke süsteemid saavutavad suurepäraseid tulemusi, kuid nõuavad tulemuste saamiseks suuri paralleelkorpusi. Palju tööd on tehtud, et saada häid tõlketulemusi väikese paralleelkorpusega keeltepaaridele, aga võrreldavaid tulemusi suurte paaralleelkorpusega keeltele pole saadud. Selles töös ma pakun välja uudse süsteemi, mis teeb juhendamata masintõlget kasutades n-grammide(fraaside) vektoresitusi, mille abil õpitakse keeltevahelisi fraaside vektoresitusi. Minu lahendus nõuab ainult ühekeelseid korpuseid. Ma raporteerin oma tulemused eesti - inglise - eesti keelepaari vahel. Arendatud süsteem ei tööta nii hästi kui loodetud, aga testide järgi võib öelda, et see töötab paremini kui, sõna-sõnalt otse tõlkida.The current best machine translation systems have achieved excellent results, but rely heavily on large parallel corpora. There have been many attempts on getting the same good results on low-resource languages, but these tries have been somewhat unsuccessful. In this work, I propose a novel unsupervised machine translation system that uses n-gram embeddings for getting the translations, by learning cross-lingual embeddings. This solution requires only monolingual corpora, not a single parallel sentence is needed, which is achieved by using unsupervised word translation. I report my findings for Estonian - English - Estonian language pair. The solution does not work as well as expected, but tests suggest that it works better than simple word-by-word translation

DSpace at Tartu University Library

Modeling relation paths for knowledge base completion via joint adversarial training

Author: Du Linfeng
He Min
Li Chen
Peng Hao
Peng Xutan
Wang Lihong
Yu Philip S.
Zhang Shanghang
Publication venue: 'Elsevier BV'
Publication date: 19/05/2020
Field of study

Knowledge Base Completion (KBC), which aims at determining the missing relations between entity pairs, has received increasing attention in recent years. Most existing KBC methods focus on either embedding the Knowledge Base (KB) into a specific semantic space or leveraging the joint probability of Random Walks (RWs) on multi-hop paths. Only a few unified models take both semantic and path-related features into consideration with adequacy. In this paper, we propose a novel method to explore the intrinsic relationship between the single relation (i.e. 1-hop path) and multi-hop paths between paired entities. We use Hierarchical Attention Networks (HANs) to select important relations in multi-hop paths and encode them into low-dimensional vectors. By treating relations and multi-hop paths as two different input sources, we use a feature extractor, which is shared by two downstream components (i.e. relation classifier and source discriminator), to capture shared/similar information between them. By joint adversarial training, we encourage our model to extract features from the multi-hop paths which are representative for relation completion. We apply the trained model (except for the source discriminator) to several large-scale KBs for relation completion. Experimental results show that our method outperforms existing path information-based approaches. Since each sub-module of our model can be well interpreted, our model can be applied to a large number of relation learning tasks.Comment: Accepted by Knowledge-Based System

arXiv.org e-Print Archive

White Rose Research Online

Spanish named entity recognition in the biomedical domain

Author: Cotik Viviana
Rodríguez Hontoria Horacio
Vivaldi Palatresi Jorge
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC