122,326 research outputs found
Building basic vocabulary across 40 languages
The paper explores the options for building bilingual dictionaries by automated methods. We define the notion âbasic vocabulary â and investigate how well the conceptual units that make up this language-independent vocabulary are covered by language-specific bindings in 40 languages
Recommended from our members
Word frequency and trends in the development of French vocabulary in lower intermediate students during Year 12 in English schools
Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs
Binary code analysis allows analyzing binary code without having access to
the corresponding source code. A binary, after disassembly, is expressed in an
assembly language. This inspires us to approach binary analysis by leveraging
ideas and techniques from Natural Language Processing (NLP), a rich area
focused on processing text of various natural languages. We notice that binary
code analysis and NLP share a lot of analogical topics, such as semantics
extraction, summarization, and classification. This work utilizes these ideas
to address two important code similarity comparison problems. (I) Given a pair
of basic blocks for different instruction set architectures (ISAs), determining
whether their semantics is similar or not; and (II) given a piece of code of
interest, determining if it is contained in another piece of assembly code for
a different ISA. The solutions to these two problems have many applications,
such as cross-architecture vulnerability discovery and code plagiarism
detection. We implement a prototype system INNEREYE and perform a comprehensive
evaluation. A comparison between our approach and existing approaches to
Problem I shows that our system outperforms them in terms of accuracy,
efficiency and scalability. And the case studies utilizing the system
demonstrate that our solution to Problem II is effective. Moreover, this
research showcases how to apply ideas and techniques from NLP to large-scale
binary code analysis.Comment: Accepted by Network and Distributed Systems Security (NDSS) Symposium
201
Evaluation of Croatian Word Embeddings
Croatian is poorly resourced and highly inflected language from Slavic
language family. Nowadays, research is focusing mostly on English. We created a
new word analogy corpus based on the original English Word2vec word analogy
corpus and added some of the specific linguistic aspects from Croatian
language. Next, we created Croatian WordSim353 and RG65 corpora for a basic
evaluation of word similarities. We compared created corpora on two popular
word representation models, based on Word2Vec tool and fastText tool. Models
has been trained on 1.37B tokens training data corpus and tested on a new
robust Croatian word analogy corpus. Results show that models are able to
create meaningful word representation. This research has shown that free word
order and the higher morphological complexity of Croatian language influences
the quality of resulting word embeddings.Comment: In review process on LREC 2018 conferenc
- âŠ