Search CORE

122,326 research outputs found

Building basic vocabulary across 40 languages

Author: Kornai András
Pajkossy Katalin
Ács Judit
Publication venue: Omnipress
Publication date: 01/01/2013
Field of study

The paper explores the options for building bilingual dictionaries by automated methods. We define the notion ‘basic vocabulary ’ and investigate how well the conceptual units that make up this language-independent vocabulary are covered by language-specific bindings in 40 languages

CiteSeerX

SZTAKI Publication Repository

Recommended from our members

Word frequency and trends in the development of French vocabulary in lower intermediate students during Year 12 in English schools

Author: Biriotti L.
Brian Richards
Daller H.
David Malvern
Dickinson D.
Gallagher-Brett A.
Gougenheim G.
Graham S.
Graham S.
Harley B.
Harris V.
Macaro E.
MacWhinney B.
Malvern D. D.
Meara P. M.
Milton J.
Milton J.
Nation I. S.P.
Richards B. J.
Snow C. E.
Suzanne Graham
Zhang H.
Zipf G. K.
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2008
Field of study

Central Archive at the University of Reading

Crossref

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Author: Li Xiaopeng
Luo Lannan
Young Patrick
Zeng Qiang
Zhang Zhexin
Zuo Fei
Publication venue: 'Internet Society'
Publication date: 16/12/2018
Field of study

Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.Comment: Accepted by Network and Distributed Systems Security (NDSS) Symposium 201

arXiv.org e-Print Archive

Crossref

Developing language in the primary school: literacy and primary languages (National strategies: primary)

Author
Publication venue: Department for Children, Schools and Families (DCSF)
Publication date: 01/01/2009
Field of study

Digital Education Resource Archive

Language Engineering and the Destiny of Man in Africa

Author: Ogbulogo Charles
Publication venue
Publication date: 01/04/2013
Field of study

Covenant University Repository

Evaluation of Croatian Word Embeddings

Author: Beliga Slobodan
Svoboda Lukas
Publication venue
Publication date: 07/11/2017
Field of study

Croatian is poorly resourced and highly inflected language from Slavic language family. Nowadays, research is focusing mostly on English. We created a new word analogy corpus based on the original English Word2vec word analogy corpus and added some of the specific linguistic aspects from Croatian language. Next, we created Croatian WordSim353 and RG65 corpora for a basic evaluation of word similarities. We compared created corpora on two popular word representation models, based on Word2Vec tool and fastText tool. Models has been trained on 1.37B tokens training data corpus and tested on a new robust Croatian word analogy corpus. Results show that models are able to create meaningful word representation. This research has shown that free word order and the higher morphological complexity of Croatian language influences the quality of resulting word embeddings.Comment: In review process on LREC 2018 conferenc

arXiv.org e-Print Archive

Repository of the University of Rijeka