1,273 research outputs found
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We fur- ther identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the met- rics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our pro- posed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages
Plan Optimization to Bilingual Dictionary Induction for Low-Resource Language Families
Creating bilingual dictionary is the first crucial step in enriching
low-resource languages. Especially for the closely-related ones, it has been
shown that the constraint-based approach is useful for inducing bilingual
lexicons from two bilingual dictionaries via the pivot language. However, if
there are no available machine-readable dictionaries as input, we need to
consider manual creation by bilingual native speakers. To reach a goal of
comprehensively create multiple bilingual dictionaries, even if we already have
several existing machine-readable bilingual dictionaries, it is still difficult
to determine the execution order of the constraint-based approach to reducing
the total cost. Plan optimization is crucial in composing the order of
bilingual dictionaries creation with the consideration of the methods and their
costs. We formalize the plan optimization for creating bilingual dictionaries
by utilizing Markov Decision Process (MDP) with the goal to get a more accurate
estimation of the most feasible optimal plan with the least total cost before
fully implementing the constraint-based bilingual lexicon induction. We model a
prior beta distribution of bilingual lexicon induction precision with language
similarity and polysemy of the topology as and parameters. It
is further used to model cost function and state transition probability. We
estimated the cost of all investment plan as a baseline for evaluating the
proposed MDP-based approach with total cost as an evaluation metric. After
utilizing the posterior beta distribution in the first batch of experiments to
construct the prior beta distribution in the second batch of experiments, the
result shows 61.5\% of cost reduction compared to the estimated all investment
plan and 39.4\% of cost reduction compared to the estimated MDP optimal plan.
The MDP-based proposal outperformed the baseline on the total cost.Comment: 29 pages, 16 figures, 9 tables, accepted for publication in ACM
TALLI
Plan Optimization for Creating Bilingual Dictionaries of Low-Resource Languages
The constraint-based approach has been proven useful for inducing bilingual lexicons for closely-related low- resource languages. When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by bilingual language experts if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. We adopt the Markov Decision Process (MDP) in formalizing plan optimization for creating bilingual dictionaries; the goal is to better predict the most feasible optimal plan with the least total cost before fully implementing the constraint-based bilingual dictionary induction framework. We define heuristics based on input language characteristics to devise a baseline plan for evaluating our MDP-based approach with total cost as an evaluation metric. The MDP-based proposal outperformed heuristic planning on the total cost for all datasets examined
近縁言語のための帰納的な対訳辞書生成フレームワーク
京都大学0048新制・課程博士博士(情報学)甲第21395号情博第681号新制||情||117(附属図書館)京都大学大学院情報学研究科社会情報学専攻(主査)教授 石田 亨, 教授 吉川 正俊, 教授 河原 達也学位規則第4条第1項該当Doctor of InformaticsKyoto UniversityDFA
- …