750 research outputs found
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We fur- ther identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the met- rics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our pro- posed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages
Plan Optimization to Bilingual Dictionary Induction for Low-Resource Language Families
Creating bilingual dictionary is the first crucial step in enriching
low-resource languages. Especially for the closely-related ones, it has been
shown that the constraint-based approach is useful for inducing bilingual
lexicons from two bilingual dictionaries via the pivot language. However, if
there are no available machine-readable dictionaries as input, we need to
consider manual creation by bilingual native speakers. To reach a goal of
comprehensively create multiple bilingual dictionaries, even if we already have
several existing machine-readable bilingual dictionaries, it is still difficult
to determine the execution order of the constraint-based approach to reducing
the total cost. Plan optimization is crucial in composing the order of
bilingual dictionaries creation with the consideration of the methods and their
costs. We formalize the plan optimization for creating bilingual dictionaries
by utilizing Markov Decision Process (MDP) with the goal to get a more accurate
estimation of the most feasible optimal plan with the least total cost before
fully implementing the constraint-based bilingual lexicon induction. We model a
prior beta distribution of bilingual lexicon induction precision with language
similarity and polysemy of the topology as and parameters. It
is further used to model cost function and state transition probability. We
estimated the cost of all investment plan as a baseline for evaluating the
proposed MDP-based approach with total cost as an evaluation metric. After
utilizing the posterior beta distribution in the first batch of experiments to
construct the prior beta distribution in the second batch of experiments, the
result shows 61.5\% of cost reduction compared to the estimated all investment
plan and 39.4\% of cost reduction compared to the estimated MDP optimal plan.
The MDP-based proposal outperformed the baseline on the total cost.Comment: 29 pages, 16 figures, 9 tables, accepted for publication in ACM
TALLI
Plan Optimization for Creating Bilingual Dictionaries of Low-Resource Languages
The constraint-based approach has been proven useful for inducing bilingual lexicons for closely-related low- resource languages. When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by bilingual language experts if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. We adopt the Markov Decision Process (MDP) in formalizing plan optimization for creating bilingual dictionaries; the goal is to better predict the most feasible optimal plan with the least total cost before fully implementing the constraint-based bilingual dictionary induction framework. We define heuristics based on input language characteristics to devise a baseline plan for evaluating our MDP-based approach with total cost as an evaluation metric. The MDP-based proposal outperformed heuristic planning on the total cost for all datasets examined
Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages
The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages.
When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native
speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation
of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based
approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5
languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary
generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native
speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a
63.3% cost reduction
Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages
The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages.
When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native
speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation
of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based
approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5
languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary
generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native
speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a
63.3% cost reduction
- …