Search CORE

19,041 research outputs found

Limitations of Cross-Lingual Learning from Image Search

Author: Hartmann Mareike
Soegaard Anders
Publication venue
Publication date: 18/09/2017
Field of study

Cross-lingual representation learning is an important step in making NLP scale to all the world's languages. Recent work on bilingual lexicon induction suggests that it is possible to learn cross-lingual representations of words based on similarities between images associated with these words. However, that work focused on the translation of selected nouns only. In our work, we investigate whether the meaning of other parts-of-speech, in particular adjectives and verbs, can be learned in the same way. We also experiment with combining the representations learned from visual data with embeddings learned from textual data. Our experiments across five language pairs indicate that previous work does not scale to the problem of learning cross-lingual representations beyond simple nouns

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Lexical Similarities and Differences in the Mathematics, Science and English Language Textbooks

Author: Menon S. (Sujatha)
Mukundan J. (Jayakaran)
Publication venue: 'Petra Christian University'
Publication date: 01/12/2007
Field of study

The teaching of Science and Math in English in Malaysia is an area of great concern to educators and students alike. This study looks, in particular, at the common word classes among keywords identified in the Science, Math and English language Form One textbooks used in Malaysia and the differences in language use identified in the Science and Math textbooks

Neliti

Recommended from our members

Simulating the Noun-Verb Asymmetry in the Productivity of Children’s Speech

Author: Freudenthal D
Gobet F
Pine J M
Publication venue: 'Energy Psychology Press'
Publication date: 01/01/2007
Field of study

Several authors propose that children may acquire syntactic categories on the basis of co-occurrence statistics of words in the input. This paper assesses the relative merits of two such accounts by assessing the type and amount of productive language that results from computing co-occurrence statistics over conjoint and independent preceding and following contexts. This is achieved through the implementation of these methods in MOSAIC, a computational model of syntax acquisition that produces utterances that can be directly compared to child speech, and has a developmental component (i.e. produces increasingly long utterances). It is shown that the computation of co-occurrence statistics over conjoint contexts or frames results in a pattern of productive speech that more closely resembles that displayed by language learning children. The simulation of the developmental patterning of children’s productive speech furthermore suggests two refinements to this basic mechanism: inclusion of utterance boundaries, and the weighting of frames for their lexical content

Brunel University Research Archive

AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis

Author: Borth Damian
Elizalde Benjamin
Lane Ian
Raj Bhiksha
Sager Sebastian
Schulze Christian
Publication venue
Publication date: 08/01/2018
Field of study

Recently, sound recognition has been used to identify sounds, such as car and river. However, sounds have nuances that may be better described by adjective-noun pairs such as slow car, and verb-noun pairs such as flying insects, which are under explored. Therefore, in this work we investigate the relation between audio content and both adjective-noun pairs and verb-noun pairs. Due to the lack of datasets with these kinds of annotations, we collected and processed the AudioPairBank corpus consisting of a combined total of 1,123 pairs and over 33,000 audio files. One contribution is the previously unavailable documentation of the challenges and implications of collecting audio recordings with these type of labels. A second contribution is to show the degree of correlation between the audio content and the labels through sound recognition experiments, which yielded results of 70% accuracy, hence also providing a performance benchmark. The results and study in this paper encourage further exploration of the nuances in audio and are meant to complement similar research performed on images and text in multimedia analysis.Comment: This paper is a revised version of "AudioSentibank: Large-scale Semantic Ontology of Acoustic Concepts for Audio Content Analysis

arXiv.org e-Print Archive

Directory of Open Access Journals

Assessing the contribution of shallow and deep knowledge sources for word sense disambiguation

Author: C. Fellbaum
D. Yarowsky
Lucia Specia
M. Stevenson
Maria das Graças Volpe Nunes
Mark Stevenson
S. Muggleton
S. Muggleton
S. Muggleton
S. Muggleton
Y. Wilks
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2010
Field of study

Corpus-based techniques have proved to be very beneficial in the development of efficient and accurate approaches to word sense disambiguation (WSD) despite the fact that they generally represent relatively shallow knowledge. It has always been thought, however, that WSD could also benefit from deeper knowledge sources. We describe a novel approach to WSD using inductive logic programming to learn theories from first-order logic representations that allows corpus-based evidence to be combined with any kind of background knowledge. This approach has been shown to be effective over several disambiguation tasks using a combination of deep and shallow knowledge sources. Is it important to understand the contribution of the various knowledge sources used in such a system. This paper investigates the contribution of nine knowledge sources to the performance of the disambiguation models produced for the SemEval-2007 English lexical sample task. The outcome of this analysis will assist future work on WSD in concentrating on the most useful knowledge sources

Crossref

White Rose Research Online

Diacritic Restoration and the Development of a Part-of-Speech Tagset for the Māori Language

Author: Cocks John
Publication venue: 'University of Waikato'
Publication date: 16/03/2012
Field of study

This thesis investigates two fundamental problems in natural language processing: diacritic restoration and part-of-speech tagging. Over the past three decades, statistical approaches to diacritic restoration and part-of-speech tagging have grown in interest as a consequence of the increasing availability of manually annotated training data in major languages such as English and French. However, these approaches are not practical for most minority languages, where appropriate training data is either non-existent or not publically available. Furthermore, before developing a part-of-speech tagging system, a suitable tagset is required for that language. In this thesis, we make the following contributions to bridge this gap: Firstly, we propose a method for diacritic restoration based on naive Bayes classifiers that act at word-level. Classifications are based on a rich set of features, extracted automatically from training data in the form of diacritically marked text. This method requires no additional resources, which makes it language independent. The algorithm was evaluated on one language, namely Māori, and an accuracy exceeding 99% was observed. Secondly, we present our work on creating one of the necessary resources for the development of a part-of-speech tagging system in Māori, that of a suitable tagset. The tagset described was developed in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora, and was the result of in-depth analysis of the Māori grammar

Research Commons@Waikato