5,810 research outputs found
Memory-Based Lexical Acquisition and Processing
Current approaches to computational lexicology in language technology are
knowledge-based (competence-oriented) and try to abstract away from specific
formalisms, domains, and applications. This results in severe complexity,
acquisition and reusability bottlenecks. As an alternative, we propose a
particular performance-oriented approach to Natural Language Processing based
on automatic memory-based learning of linguistic (lexical) tasks. The
consequences of the approach for computational lexicology are discussed, and
the application of the approach on a number of lexical acquisition and
disambiguation tasks in phonology, morphology and syntax is described.Comment: 18 page
A syntactic skeleton for statistical machine translation
We present a method for improving statistical machine translation performance by using linguistically motivated syntactic information. Our algorithm recursively decomposes source language sentences into syntactically simpler and shorter chunks, and recomposes their translation to form target language sentences. This improves both the word order and lexical selection of the translation. We report statistically significant relative improvementsof 3.3% BLEU score in an experiment (English!Spanish) carried out on
an 800-sentence test set extracted from the Europarl corpus
Lost in translation: the problems of using mainstream MT evaluation metrics for sign language translation
In this paper we consider the problems of applying corpus-based techniques to minority languages that are neither politically recognised nor have a formally accepted writing system, namely sign languages. We discuss the adoption of an annotated form of sign language data as a suitable corpus for the development of a data-driven machine translation (MT) system, and deal with issues that arise from its use. Useful software tools that facilitate easy annotation of video data are also discussed. Furthermore, we address the problems of using traditional MT evaluation metrics for sign language translation. Based on the candidate translations produced from our example-based machine translation system, we discuss why standard metrics fall short of providing an accurate evaluation and suggest more suitable evaluation methods
Does BLEU Score Work for Code Migration?
Statistical machine translation (SMT) is a fast-growing sub-field of
computational linguistics. Until now, the most popular automatic metric to
measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score.
Lately, SMT along with the BLEU metric has been applied to a Software
Engineering task named code migration. (In)Validating the use of BLEU score
could advance the research and development of SMT-based code migration tools.
Unfortunately, there is no study to approve or disapprove the use of BLEU score
for source code. In this paper, we conducted an empirical study on BLEU score
to (in)validate its suitability for the code migration task due to its
inability to reflect the semantics of source code. In our work, we use human
judgment as the ground truth to measure the semantic correctness of the
migrated code. Our empirical study demonstrates that BLEU does not reflect
translation quality due to its weak correlation with the semantic correctness
of translated code. We provided counter-examples to show that BLEU is
ineffective in comparing the translation quality between SMT-based models. Due
to BLEU's ineffectiveness for code migration task, we propose an alternative
metric RUBY, which considers lexical, syntactical, and semantic representations
of source code. We verified that RUBY achieves a higher correlation coefficient
with the semantic correctness of migrated code, 0.775 in comparison with 0.583
of BLEU score. We also confirmed the effectiveness of RUBY in reflecting the
changes in translation quality of SMT-based translation models. With its
advantages, RUBY can be used to evaluate SMT-based code migration models.Comment: 12 pages, 5 figures, ICPC '19 Proceedings of the 27th International
Conference on Program Comprehensio
Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks
Towards the vision of translating code that implements an algorithm from one programming language into another, this paper proposes an approach for automated program classification using bilateral tree-based convolutional neural networks (BiTBCNNs). It is layered on top of two tree-based convolutional neural networks (TBCNNs), each of which recognizes the algorithm of code written in an individual programming language. The combination layer of the networks recognizes the similarities and differences among code in different programming languages. The BiTBCNNs are trained using the source code in different languages but known to implement the same algorithms and/or functionalities. For a preliminary evaluation, we use 3591 Java and 3534 C++ code snippets from 6 algorithms we crawled systematically from GitHub. We obtained over 90% accuracy in the cross-language binary classification task to tell whether any given two code snippets implement a same algorithm. Also, for the algorithm classification task, i.e., to predict which one of the six algorithm labels is implemented by an arbitrary C++ code snippet, we achieved over 80% precision
Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts
Understanding how ideas relate to each other is a fundamental question in
many domains, ranging from intellectual history to public communication.
Because ideas are naturally embedded in texts, we propose the first framework
to systematically characterize the relations between ideas based on their
occurrence in a corpus of documents, independent of how these ideas are
represented. Combining two statistics --- cooccurrence within documents and
prevalence correlation over time --- our approach reveals a number of different
ways in which ideas can cooperate and compete. For instance, two ideas can
closely track each other's prevalence over time, and yet rarely cooccur, almost
like a "cold war" scenario. We observe that pairwise cooccurrence and
prevalence correlation exhibit different distributions. We further demonstrate
that our approach is able to uncover intriguing relations between ideas through
in-depth case studies on news articles and research papers.Comment: 11 pages, 9 figures, to appear in Proceedings of ACL 2017, code and
data available at https://chenhaot.com/pages/idea-relations.html (fixed a
typo
- …