316 research outputs found

    Efficient deep processing of japanese

    Get PDF
    We present a broad coverage Japanese grammar written in the HPSG formalism with MRS semantics. The grammar is created for use in real world applications, such that robustness and performance issues play an important role. It is connected to a POS tagging and word segmentation tool. This grammar is being developed in a multilingual context, requiring MRS structures that are easily comparable across languages

    Proceedings of the 17th Annual Conference of the European Association for Machine Translation

    Get PDF
    Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT

    Exploring the adaptive structure of the mental lexicon

    Get PDF
    The mental lexicon is a complex structure organised in terms of phonology, semantics and syntax, among other levels. In this thesis I propose that this structure can be explained in terms of the pressures acting on it: every aspect of the organisation of the lexicon is an adaptation ultimately related to the function of language as a tool for human communication, or to the fact that language has to be learned by subsequent generations of people. A collection of methods, most of which are applied to a Spanish speech corpus, reveal structure at different levels of the lexicon.• The patterns of intra-word distribution of phonological information may be a consequence of pressures for optimal representation of the lexicon in the brain, and of the pressure to facilitate speech segmentation.• An analysis of perceived phonological similarity between words shows that the sharing of different aspects of phonological similarity is related to different functions. Phonological similarity perception sometimes relates to morphology (the stressed final vowel determines verb tense and person) and at other times shows processing biases (similarity in the word initial and final segments is more readily perceived than in word-internal segments).• Another similarity analysis focuses on cooccurrence in speech to create a representation of the lexicon where the position of a word is determined by the words that tend to occur in its close vicinity. Variations of context-based lexical space naturally categorise words syntactically and semantically.• A higher level of lexicon structure is revealed by examining the relationships between the phonological and the cooccurrence similarity spaces. A study in Spanish supports the universality of the small but significant correlation between these two spaces found in English by Shillcock, Kirby, McDonald and Brew (2001). This systematicity across levels of representation adds an extra layer of structure that may help lexical acquisition and recognition. I apply it to a new paradigm to determine the function of parameters of phonological similarity based on their relationships with the syntacticsemantic level. I find that while some aspects of a language's phonology maintain systematicity, others work against it, perhaps responding to the opposed pressure for word identification.This thesis is an exploratory approach to the study of the mental lexicon structure that uses existing and new methodology to deepen our understanding of the relationships between language use and language structure

    A Guide to Text Analysis with Latent Semantic Analysis in R with Annotated Code: Studying Online Reviews and the Stack Exchange Community

    Get PDF
    In this guide, we introduce researchers in the behavioral sciences in general and MIS in particular to text analysis as done with latent semantic analysis (LSA). The guide contains hands-on annotated code samples in R that walk the reader through a typical process of acquiring relevant texts, creating a semantic space out of them, and then projecting words, phrase, or documents onto that semantic space to calculate their lexical similarities. R is an open source, popular programming language with extensive statistical libraries. We introduce LSA as a concept, discuss the process of preparing the data, and note its potential and limitations. We demonstrate this process through a sequence of annotated code examples: we start with a study of online reviews that extracts lexical insight about trust. That R code applies singular value decomposition (SVD). The guide next demonstrates a realistically large data analysis of Stack Exchange, a popular Q&A site for programmers. That R code applies an alternative sparse SVD method. All the code and data are available on github.com

    Design of a Controlled Language for Critical Infrastructures Protection

    Get PDF
    We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen

    Modelling first and second language acquisition and processing with temporal self-organizing maps

    Full text link
    Starting from the evidence provided by researchers at ComPhys Lab of the Institute for Computational Linguistics, Italian National Research Council (Pisa, ILC-CNR), the main goal of my thesis was to extend the application of computational modelling of language acquisition in monolingual and bilingual contexts to Spanish, which has not yet been treated within the given research framework. For the first step, I briefly outlined some of the most prominent psycholinguistic approaches to the study of language acquisition. Secondly, three major models of morphological processing has been presented. For instance, three models of lexical representation and processing has been explained, following the classification proposed by Bybee (1995), i.e. dual-processing model, connectionist model, and network model. The difference between these three models lies in whether they make a distinction between regular and irregular verbs and their processing models, and whether or not the type/token frequency of verbal morphological patterns plays any role at all. The experimental part of this study was focussed on the first and second language acquisition of Spanish verbs, contrasted with parallel datasets in the Italian and German languages. In order to compile the dataset, I extracted the 50 most frequent verb paradigms from European Spanish Web Corpus (2011), available in Sketch Engine, for a total of 750 inflected forms (corresponding to the forms of the infinitive, present, and past participle, singular and plural simple present, singular and plural simple past). The frequency distribution was provided for each inflected form. For an analysis and evaluation of the emergent organization of paradigmatic relations, I annotated each form with morpho-syntactic information (i.e. stem and affix length, paradigmatic cell, formal (ir)regularity, paradigm). Specific difficulties arose during the segmentation of Spanish verbs, due to the peculiarities of some irregular patterns. The computational modelling and processing of Spanish verbs forms has been simulated with Temporal Self-Organizing Maps (TSOMs), based on Kohonen¿s Self-Organizing Maps (2001), augmented with a temporal layer. Basically, this computational model reproduces dynamics of lexical learning and processing by imitating the emergence of neural self-organization, through the incremental adaptation of topologically and temporally aligned synaptic connections. I concluded that an adaptive self-organization during learning is conducive to the emergence of relations between word forms, which are stored in the mental lexicon in a concurrent and competitive dynamic. In particular, in a bilingual perspective, monitoring the acquisitional trajectories of more than one lexica (in both L1+L2 and L1/L1 contexts) showed how recycled memory resources and weaker connections affect L2 acquisition and processing, with a smaller specialization for context-specific input chunks, depending on the exposure conditions.El principal objetivo de la tesis es ampliar la aplicación del modelado computacional de la adquisición del lenguaje en contextos monolingües y bilingües del español, que todavía no ha sido tratado dentro del marco de investigación dado, a partir de las pruebas aportadas por los investigadores del ComPhys Lab del Instituto de Lingüística Computacional, Consejo Nacional Italiano de Investigación (Pisa, ILC-CNR). En primer lugar, resumimos brevemente algunos de los enfoques psicolingüísticos más destacados para el estudio de la adquisición del lenguaje. En segundo lugar, presentamos los tres modelos principales de procesamiento morfológico. Por ejemplo, se han explicado tres modelos de representación y procesamiento léxico, siguiendo la clasificación propuesta por Bybee (1995), es decir, el modelo de procesamiento dual, el modelo conexionista y el modelo de red. La diferencia entre estos tres modelos radica en si hacen una distinción entre verbos regulares e irregulares y sus modelos de procesamiento, y si la frecuencia tipo/caso de los patrones morfológicos verbales representan alguna función. La parte experimental del estudio se centró en la adquisición de la primera y segunda lengua en los verbos en español, en contraste con el conjunto de datos paralelos en italiano y alemán. Para compilar los datos, extrajimos los 50 paradigmas verbales más frecuentes del European Spanish Web Corpus (2011), disponible en Sketch Engine, de un total de 750 formas flexionadas (correspondientes a las formas del infinitivo, presente y participio pasado, singular y plural del presente simple, singular y plural de pasado simple). Se proporcionó la distribución de la frecuencia para cada forma flexionada. Para un análisis y evaluación de la organización emergente de las relaciones paradigmáticas, anotamos cada forma con información morfo-sintáctica (es decir, longitud de raíz y afijo, elemento paradigmático, (ir) regularidad formal, paradigma). Surgieron dificultades específicas durante la segmentación de los verbos en español, debido a las particularidades de algunos patrones irregulares. El modelo computacional y el proceso de las formas verbales españolas ha sido simulado con Temporal Self-Organizing Maps (TSOMs), basado en Kohonen¿s Self-Organizing Maps (2001), mejorado con una capa temporal. Básicamente, este modelo computacional reproduce las dinámicas de aprendizaje y procesamiento léxico imitando la aparición del auto organización neural, a través de la adaptación incremental de conexiones sinápticas alineadas topológica y temporalmente. Podemos concluir que una auto-organización adaptativa durante el aprendizaje conduce a la aparición de las relaciones entre las formas de las palabras, que se almacenan en el léxico mental en una dinámica concurrente y competitiva. En particular, en una perspectiva bilingüe, el monitoreo de las trayectorias de adquisición de más de una unidad léxica (en ambos contextos L1+L2 y L1/L1) mostró cómo los recursos de memoria reciclados y las conexiones más débiles afectan la adquisición y procesamiento de L2, con una especialización menor para los fragmentos de entradas específicos del contexto, dependiendo de las condiciones de exposición.Belik, P. (2017). Modelación computacional del aprendizaje y procesamiento de primera y segunda lengua con los mapas temporales auto-organizados. http://hdl.handle.net/10251/86383TFG

    Dynamic language modeling for European Portuguese

    Get PDF
    Doutoramento em Engenharia InformáticaActualmente muitas das metodologias utilizadas para transcrição e indexação de transmissões noticiosas são baseadas em processos manuais. Com o processamento e transcrição deste tipo de dados os prestadores de serviços noticiosos procuram extrair informação semântica que permita a sua interpretação, sumarização, indexação e posterior disseminação selectiva. Pelo que, o desenvolvimento e implementação de técnicas automáticas para suporte deste tipo de tarefas têm suscitado ao longo dos últimos anos o interesse pela utilização de sistemas de reconhecimento automático de fala. Contudo, as especificidades que caracterizam este tipo de tarefas, nomeadamente a diversidade de tópicos presentes nos blocos de notícias, originam um elevado número de ocorrência de novas palavras não incluídas no vocabulário finito do sistema de reconhecimento, o que se traduz negativamente na qualidade das transcrições automáticas produzidas pelo mesmo. Para línguas altamente flexivas, como é o caso do Português Europeu, este problema torna-se ainda mais relevante. Para colmatar este tipo de problemas no sistema de reconhecimento, várias abordagens podem ser exploradas: a utilização de informações específicas de cada um dos blocos noticiosos a ser transcrito, como por exemplo os scripts previamente produzidos pelo pivot e restantes jornalistas, e outro tipo de fontes como notícias escritas diariamente disponibilizadas na Internet. Este trabalho engloba essencialmente três contribuições: um novo algoritmo para selecção e optimização do vocabulário, utilizando informação morfosintáctica de forma a compensar as diferenças linguísticas existentes entre os diferentes conjuntos de dados; uma metodologia diária para adaptação dinâmica e não supervisionada do modelo de linguagem, utilizando múltiplos passos de reconhecimento; metodologia para inclusão de novas palavras no vocabulário do sistema, mesmo em situações de não existência de dados de adaptação e sem necessidade re-estimação global do modelo de linguagem.Most of today methods for transcription and indexation of broadcast audio data are manual. Broadcasters process thousands hours of audio and video data on a daily basis, in order to transcribe that data, to extract semantic information, and to interpret and summarize the content of those documents. The development of automatic and efficient support for these manual tasks has been a great challenge and over the last decade there has been a growing interest in the usage of automatic speech recognition as a tool to provide automatic transcription and indexation of broadcast news and random and relevant access to large broadcast news databases. However, due to the common topic changing over time which characterizes this kind of tasks, the appearance of new events leads to high out-of-vocabulary (OOV) word rates and consequently to degradation of recognition performance. This is especially true for highly inflected languages like the European Portuguese language. Several innovative techniques can be exploited to reduce those errors. The use of news shows specific information, such as topic-based lexicons, pivot working script, and other sources such as the online written news daily available in the Internet can be added to the information sources employed by the automatic speech recognizer. In this thesis we are exploring the use of additional sources of information for vocabulary optimization and language model adaptation of a European Portuguese broadcast news transcription system. Hence, this thesis has 3 different main contributions: a novel approach for vocabulary selection using Part-Of-Speech (POS) tags to compensate for word usage differences across the various training corpora; language model adaptation frameworks performed on a daily basis for single-stage and multistage recognition approaches; a new method for inclusion of new words in the system vocabulary without the need of additional data or language model retraining
    corecore