1,423 research outputs found

    Learning Language Representations for Typology Prediction

    Full text link
    One central mystery of neural NLP is what neural models "know" about their subject matter. When a neural machine translation system learns to translate from one language to another, does it learn the syntax or semantics of the languages? Can this knowledge be extracted from the system to fill holes in human scientific knowledge? Existing typological databases contain relatively full feature specifications for only a few hundred languages. Exploiting the existence of parallel texts in more than a thousand languages, we build a massive many-to-one neural machine translation (NMT) system from 1017 languages into English, and use this to predict information missing from typological databases. Experiments show that the proposed method is able to infer not only syntactic, but also phonological and phonetic inventory features, and improves over a baseline that has access to information about the languages' geographic and phylogenetic neighbors.Comment: EMNLP 201

    Compounds in Portuguese

    Get PDF
    In this article, Portuguese compounds are analysed according to different criteria, such as: (i) the morphological, categorial and semantic properties of their basic units, (ii) the grammatical relations linking their constituents, (iii) their syntactic atomicity and lexical opacity and (iv) the patterns of inflection. The problem of the boundaries of compoundhood, namely those existing between compounds and phrasal nouns, is also addressed, as well as the accuracy of the tests adopted to distinguish compounds (especially phrasal or prepositional compounds) from phrases. We assume that in conjunction with the criteria mentioned above, the referential identity of the entity, object, event or property denoted by the compound is a crucial dimension for the conceptual integrity of each compound lexeme. Keywords: compounding, word-formation, morphology, portuguese

    Corpus Linguistics and Translation

    Get PDF
    شهد القرن الواحد والعشرين تطورا تكنولوجيا وتنظيميا في الاقتصاد والمجتمع الدولي مما عكس أثرا كبيرا على الترجمة والدراسات الترجمة. تعد المدونة مجموعة لغوية طبيعية تعرض  للقراءة على الحاسوب  لاغراض التحليل اللغوي لنماذج لغوية متنوعة وهي الأساس لظهور علم لغة المدونات. لقد عمل علم لغة المدونات على تسهيل مهمة المترجمين في الاستفادة من كميات هائلة من المعلومات المخزونة في الحاسوب لغرض دراسة وترجمة اللغة الهدف . تهدف الدراسة الحالية الى تعريف طلبة الترجمة والمترجمين على طرق التطبيق العملي لمدونات الحاسوب لاستخدامها في مختلف حقول المعرفة. لقد أظهرت الدراسة أن علم لغة المدونات ذا أهمية بالغة في التوصيف الدقيق لنماذج مختلفة من اللغة حيث يوضح كيفية التفاعل بين المفردات والنحو وعلم لغة الدلالة لإنتاج الترجمة الملائمة. تضم مدونات الحاسوب نصوصا مكتوبة، شفوية، عامية، رسمية، روائية وغير روائية لمختلف المناطق السكانية وعلم جغرافية اللغة ومن هنا تبرز الحاجة لتكوين مدونات عربية وطنية للمؤلفين العرب لغرض المقارنة مع المدونات الاجنبية.The 21stcentury witnesses tremendous technological and organizational advances in world’s economy and societies. This has left a great impact on translation and translation studies. Corpus is a machine-readable representative collection of naturally occurring language assembled for linguistic analysis accessible with software such as, concordances that can find list and source of linguistic patterns. It lays the foundation of Corpus linguistics which makes it possible for translators and translation studies to make use of large quantity of stored data on computers for examining target language translations. Computer corpora includes spoken/written, casual/formal, fiction/non-fiction texts representing various demographic areas. The study aims at familiarizing student of translation and translators  with the methods and practical applications of computer corpora in various fields of language use. The study reveals that Corpus data are essential for accurately describing various samples of language by showing how lexis, grammar and semantics interact to serve appropriate translation output

    Isomorphic Transfer of Syntactic Structures in Cross-Lingual NLP

    Get PDF
    The transfer or share of knowledge between languages is a popular solution to resource scarcity in NLP. However, the effectiveness of cross-lingual transfer can be challenged by variation in syntactic structures. Frameworks such as Universal Dependencies (UD) are designed to be cross-lingually consistent, but even in carefully designed resources trees representing equivalent sentences may not always overlap. In this paper, we measure cross-lingual syntactic variation, or anisomorphism, in the UD treebank collection, considering both morphological and structural properties. We show that reducing the level of anisomorphism yields consistent gains in cross-lingual transfer tasks. We introduce a source language selection procedure that facilitates effective cross-lingual parser transfer, and propose a typologically driven method for syntactic tree processing which reduces anisomorphism. Our results show the effectiveness of this method for both machine translation and cross-lingual sentence similarity, demonstrating the importance of syntactic structure compatibility for boosting cross-lingual transfer in NLP

    Typological parameters of genericity

    Get PDF
    Different languages employ different morphosyntactic devices for expressing genericity. And, of course, they also make use of different morphosyntactic and semantic or pragmatic cues which may contribute to the interpretation of a sentence as generic rather than episodic. [...] We will advance the strong hypo thesis that it is a fundamental property of lexical elements in natural language that they are neutral with respect to different modes of reference or non-reference. That is, we reject the idea that a certain use of a lexical element, e.g. a use which allows reference to particular spatio-temporally bounded objects in the world, should be linguistically prior to all other possible uses, e.g. to generic and non-specific uses. From this it follows that we do not consider generic uses as derived from non-generic uses as it is occasionally assumed in the literature. Rather, we regard these two possibilities of use as equivalent alternative uses of lexical elements. The typological differences to be noted therefore concern the formal and semantic relationship of generic and non-generic uses to each other; they do not pertain to the question of whether lexical elements are predetermined for one of these two uses. Even supposing we found a language where generic uses are always zero-marked and identical to lexical sterns, we would still not assume that lexical elements in this language primarily have a generic use from which the non-generic uses are derived. (Incidentally, none of the languages examined, not even Vietnamese, meets this criterion.

    Attaining Fluency in English through Collocations

    Get PDF

    Measuring Short Text Semantic Similarity with Deep Learning Models

    Get PDF
    Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken, which is a subfield of artificial intelligence (AI). The development of NLP applications is challenging because computers traditionally require humans to speak" to them in a programming language that is precise, unambiguous and highly structured, or through a limited number of clearly enunciated voice commands. We study the use of deep learning models, the state-of-the-art artificial intelligence (AI) method, for the problem of measuring short text semantic similarity in NLP area. In particular, we propose a novel deep neural network architecture to identify semantic similarity for pairs of question sentence. In the proposed network, multiple channels of knowledge for pairs of question text can be utilized to improve the representation of text. Then a dense layer is used to learn a classifier for classifying duplicated question pairs. Through extensive experiments on the Quora test collection, our proposed approach has shown remarkable and significant improvement over strong baselines, which verifies the effectiveness of the deep models as well as the proposed deep multi-channel framework
    corecore