23,845 research outputs found
A Comparison of Different Machine Transliteration Models
Machine transliteration is a method for automatically converting words in one
language into phonetically equivalent ones in another language. Machine
transliteration plays an important role in natural language applications such
as information retrieval and machine translation, especially for handling
proper nouns and technical terms. Four machine transliteration models --
grapheme-based transliteration model, phoneme-based transliteration model,
hybrid transliteration model, and correspondence-based transliteration model --
have been proposed by several researchers. To date, however, there has been
little research on a framework in which multiple transliteration models can
operate simultaneously. Furthermore, there has been no comparison of the four
models within the same framework and using the same data. We addressed these
problems by 1) modeling the four models within the same framework, 2) comparing
them under the same conditions, and 3) developing a way to improve machine
transliteration through this comparison. Our comparison showed that the hybrid
and correspondence-based models were the most effective and that the four
models can be used in a complementary manner to improve machine transliteration
performance
Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation
Interpretability of a predictive model is a powerful feature that gains the
trust of users in the correctness of the predictions. In word sense
disambiguation (WSD), knowledge-based systems tend to be much more
interpretable than knowledge-free counterparts as they rely on the wealth of
manually-encoded elements representing word senses, such as hypernyms, usage
examples, and images. We present a WSD system that bridges the gap between
these two so far disconnected groups of methods. Namely, our system, providing
access to several state-of-the-art WSD models, aims to be interpretable as a
knowledge-based system while it remains completely unsupervised and
knowledge-free. The presented tool features a Web interface for all-word
disambiguation of texts that makes the sense predictions human readable by
providing interpretable word sense inventories, sense representations, and
disambiguation results. We provide a public API, enabling seamless integration.Comment: In Proceedings of the the Conference on Empirical Methods on Natural
Language Processing (EMNLP 2017). 2017. Copenhagen, Denmark. Association for
Computational Linguistic
DeepAPT: Nation-State APT Attribution Using End-to-End Deep Neural Networks
In recent years numerous advanced malware, aka advanced persistent threats
(APT) are allegedly developed by nation-states. The task of attributing an APT
to a specific nation-state is extremely challenging for several reasons. Each
nation-state has usually more than a single cyber unit that develops such
advanced malware, rendering traditional authorship attribution algorithms
useless. Furthermore, those APTs use state-of-the-art evasion techniques,
making feature extraction challenging. Finally, the dataset of such available
APTs is extremely small.
In this paper we describe how deep neural networks (DNN) could be
successfully employed for nation-state APT attribution. We use sandbox reports
(recording the behavior of the APT when run dynamically) as raw input for the
neural network, allowing the DNN to learn high level feature abstractions of
the APTs itself. Using a test set of 1,000 Chinese and Russian developed APTs,
we achieved an accuracy rate of 94.6%
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
Examining Scientific Writing Styles from the Perspective of Linguistic Complexity
Publishing articles in high-impact English journals is difficult for scholars
around the world, especially for non-native English-speaking scholars (NNESs),
most of whom struggle with proficiency in English. In order to uncover the
differences in English scientific writing between native English-speaking
scholars (NESs) and NNESs, we collected a large-scale data set containing more
than 150,000 full-text articles published in PLoS between 2006 and 2015. We
divided these articles into three groups according to the ethnic backgrounds of
the first and corresponding authors, obtained by Ethnea, and examined the
scientific writing styles in English from a two-fold perspective of linguistic
complexity: (1) syntactic complexity, including measurements of sentence length
and sentence complexity; and (2) lexical complexity, including measurements of
lexical diversity, lexical density, and lexical sophistication. The
observations suggest marginal differences between groups in syntactical and
lexical complexity.Comment: 6 figure
A Review of Accent-Based Automatic Speech Recognition Models for E-Learning Environment
The adoption of electronics learning (e-learning) as a method of disseminating knowledge in the global educational system is growing at a rapid rate, and has created a shift in the knowledge acquisition methods from the conventional classrooms and tutors to the distributed e-learning technique that enables access to various learning resources much more conveniently and flexibly. However, notwithstanding the adaptive advantages of learner-centric contents of e-learning programmes, the distributed e-learning environment has unconsciously adopted few international languages as the languages of communication among the participants despite the various accents (mother language influence) among these participants. Adjusting to and accommodating these various accents has brought about the introduction of accents-based automatic speech recognition into the e-learning to resolve the effects of the accent differences. This paper reviews over 50 research papers to determine the development so far made in the design and implementation of accents-based automatic recognition models for the purpose of e-learning between year 2001 and 2021. The analysis of the review shows that 50% of the models reviewed adopted English language, 46.50% adopted the major Chinese and Indian languages and 3.50% adopted Swedish language as the mode of communication. It is therefore discovered that majority of the ASR models are centred on the European, American and Asian accents, while unconsciously excluding the various accents peculiarities associated with the less technologically resourced continents
A Study of English Loanwords in Chinese through Chinese Newswriting
The purpose of the present study, therefore, is to research the signified loanwords found in current newspapers. More specifically, answer to the following questions are to be discovered: 1. How extensive is the standardization of the conventional translation or transliteration of English loanwords in Chinese in terms of explicative hybrid, loan-blend, independent hybrid, word-for-word translation, descriptive translation, and doublet? 2. What kind of proportion of these English loanwords in Chinese exist in selected newswriting in terms of the socio-political, technical-scientific, scholarly, sports, and business-economic terminology
- …