1,744 research outputs found

    Revising the Annotation of a Broadcast News Corpus: a Linguistic Approach

    Get PDF
    This paper presents a linguistic revision process of a speech corpus of Portuguese broadcast news focusing on metadata annotation for rich transcription, and reports on the impact of the new data on the performance for several modules. The main focus of the revision process consisted on annotating and revising structural metadata events, such as disfluencies and punctuation marks. The resultant revised data is now being extensively used, and was of extreme importance for improving the performance of several modules, especially the punctuation and capitalization modules, but also the speech recognition system, and all the subsequent modules. The resultant data has also been recently used in disfluency studies across domains.info:eu-repo/semantics/publishedVersio

    Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation

    Get PDF
    This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora

    SEA_AP: una herramienta de segmentación y etiquetado para el análisis prosódico

    Get PDF
    This paper introduces a tool that performs segmentation and labelling of sound chains in phono units, syllables and/or words departing from a sound signal and its corresponding orthographic transcription. In addition, it also integrates acoustic analysis scripts applied to the Praat programme with the aim of reducing the time spent on tasks related to analysis, correction, smoothing and generation of graphics of the melodic curve. The tool is implemented for Galician, Spanish and Brazilian Portuguese. Our goal is to contribute, by means of this application, to automatize some of the tasks of segmentation, labelling and prosodic analysis, since these tasks require a large investment of time and human resources.En este artículo se presenta una herramienta que realiza la segmentación y el etiquetado de cadenas sonoras en unidades de fono, sílaba y/o palabra partiendo de una señal sonora y de su correspondiente transcripción ortográfica. Además, integra scripts de análisis acústico que se ejecutan sobre el programa Praat con el fin de reducir el tiempo invertido en las tareas de análisis, corrección, suavizado y generación de gráficos de la curva melódica. La herramienta está implementada para gallego, español y portugués de Brasil. Nuestro objetivo es contribuir con esta aplicación a automatizar algunas de las labores de segmentación, etiquetado y análisis prosódico, pues constituyen tareas que requieren una gran inversión de tiempo y de recursos humanos.This work would have not been possible without the help of the Spanish Government (Project ‘SpeechTech4All’ TEC2012-38939-C03-01), the European Regional Development Fund (ERDF), the Government of the Autonomous Community of Galicia (GRC2014/024, “Consolidación de Unidades de Investigación: Proyecto AtlantTIC” CN2012/160) and the “Red de Investigación TecAnDAli” from the Council of Culture, Education and University Planning, Xunta de GaliciaS

    Syntactic REAP.PT: Exercises on Clitic Pronouning

    Get PDF
    The emerging interdisciplinary field of Intelligent Computer Assisted Language Learning (ICALL) aims to integrate the knowledge from computational linguistics into computer-assisted language learning (CALL). REAP.PT is a project emerging from this new field, aiming to teach Portuguese in an innovative and appealing way, and adapted to each student. In this paper, we present a new improvement of the REAP.PT system, consisting in developing new, automatically generated, syntactic exercises. These exercises deal with the complex phenomenon of pronominalization, that is, the substitution of a syntactic constituent with an adequate pronominal form. Though the transformation may seem simple, it involves complex lexical, syntactical and semantic constraints. The issues on pronominalization in Portuguese make it a particularly difficult aspect of language learning for non-native speakers. On the other hand, even native speakers can often be uncertain about the correct clitic positioning, due to the complexity and interaction of competing factors governing this phenomenon. A new architecture for automatic syntactic exercise generation is proposed. It proved invaluable in easing the development of this complex exercise, and is expected to make a relevant step forward in the development of future syntactic exercises, with the potential of becoming a syntactic exercise generation framework. A pioneer feedback system with detailed and automatically generated explanations for each answer is also presented, improving the learning experience, as stated in user comments. The expert evaluation and crowd-sourced testing positive results demonstrated the validity of the present approach

    In search of effective training models for Mozambican translators and interpreters

    Get PDF
    Although Eduardo Mondlane University (UEM) has the longest history of BA Honours level translator and interpreter training in Mozambique, the university still lacks an effective model for the development of translation and interpreting competence in students. To address this problem, the present study seeks to find a practical model for the training of Mozambican professional translators and interpreters at BA Honours level that can guide the design of conducive curricula. The critical question the study attempts to answer is: What model for developing translation and interpreting competence could lead to an effective curriculum design that best meets the employment needs of Mozambican students? To this end, the study has been designed as action-research because this design enables better understanding and improvement of training processes (Cravo & Neves 2007). Three data collection tools are used to generate both qualitative and quantitative data from over 120 participants, namely: (i) a survey, (ii) an English translation test and (iii) a sample of archived Portuguese translations produced by former students. The survey findings suggest the need for a model whereby translators and interpreters are trained simultaneously within the same programme. Moreover, the results of macro- and micro-textual analysis show that, overall, the translation competence of former students is poor, suggesting that the current curriculum at UEM is failing to produce BA Honours translation/interpreting professionals. The proposed solution would be a curriculum based on a new integrated translation and interpreting competence development model with the following four pillars: communicative competence, general knowledge, strategic competence and service provision.Nangona iyunivesithi iEduardo Mondlane University (UEM) inembali kwizifundo zesidanga esiphakamileyo (BA Honours) kuqeqesho lwabaguquleli neetoliki eMozambique, le yunivesithi ayikabi namzekelo unguwo nosebenzayo ekuphuhliseni izakhono zokuguqula nokutolika kubafundi bayo. Ukukhawulelana nale ngxaki olu phando lujolise ekufumaneni owona mzekelo usebenzayo onokusetyenziswa nonokuthi ube sisikhokelo kuyilo lwekharityhulam yesidanga esiphakamileyo iBA Honours esithatha iminyaka emine ekuqeqesheni abafundi babe ngabaguquleli neetoliki eziphume izandla zaseMozambique. Umbuzo ongundoqo nozanywa ukuphendulwa lolu phando ngulo: Ngowuphi umzekelo wokuphuhlisa izakhono zokuguqulela nokutolika onokukhokelela kuyilo lwekharityhulam esebenzayo nefezekisa iimfuno zengqesho zabafundi baseMozambique? Kungoko olu phando luye lwasebenzisa indlela yokuphanda ekuthiwa yi-action research eyenza kube lula ukuqonda nokuphucula iinkqubo zoqeqesho (Cravo & Neves 2007). Kuye kwasetyenziswa iindlela ezintathu zophando ukufumana ulwazi kubathathi-nxaxheba abali-120 ezizezi: (1) uvavanyo lwezimvo, (ii) uvavanyo lwenguqulelo yesicatshulwa esibhalwe ngesiNgesi kunye (iii) neenguqulelo zesiPhuthukezi ezenziwe ngabafundi bangaphambili ezigciniweyo. Iziphumo zeemvavanyo zezimvo zibonisa ukuba kufuneka kukho umzekelo onokusetyenziswa ekuqeqesheni abaguquleli neetoliki ngaxeshanye phantsi kwenkqubo enye. Ukongeza koku, iziphumo zohlalutyo lwesicatshulwa zibonise ukuba izakhono zokuguqulela zabafundi bangaphambili azikho mgangathweni, nto leyo ethetha ukuba ikharityhulam esetyenziswayo eUEM iyasilela ekukhupheni abaguquleli neetoliki ezinobuchule neziziincutshe. Isisombululo esicetyiswayo ke ngoko, sesokuba kubekho ikharityhulam ehlangeneyo eza kuphuhlisa izakhono zabaguquleli neetoliki esekelwe kwiintsika ezine ezizezi: isakhono sonxibelelwano, ulwazi jikelele, isakhono sobuchule kunye nokunikezwa kweenkonzo.Nakuba iNyuvesi i-Eduardo Mondlane University (i-UEM) isinomlando omude kakhulu wokuqeqesha abahumushi notolika ezingeni leziqu ze-BA Honours eMozambique, le nyuvesi ayikabi nayo indlela esebenzayo yokuthuthukisa amakhono ezitshudeni kwezokuhumusha nokutolika. Ukubhekana nale nkinga lolu cwaningo kuhloswe ngalo ukuthola indlela esebenzayo yokuqeqesha ongoti babahumushi notolika baseMozambique ezingeni leziqu ze-BA Honours (iziqu zeminyaka emine) ezingahlahla indlela yokuklama uhlelo lwezifundo olungasiza kule nkinga. Umbuzo omkhulu lolu cwaningo oluzama ukuwuphendula ngothi: Iyiphi indlela yokuthuthukisa amakhono okuhumusha nokutolika engaholela ekuklanyweni kohlelo lwezifundo olungahlangabezana nezidingo zokuqasheka kwezitshudeni zaseMozambique? Ukufeza le nhloso, lolu cwaningo lusebenzisa uhlelo lokucwaninga olubizwa nge-action-research ngoba luyasiza ekuqondeni kangcono nasekuthuthukiseni inqubo yokuqeqesha (Cravo & Neves 2007). Kusetshenziswa amathuluzi amathathu okuqoqa imininingobunjalo nemininingobuningi evela kubabambiqhaza abangaphezu kwabayi-120, okuyilokhu: (i) ngohlolocwaningo (i-survey), (ii) ngesivivinyo sesihumusho sesiNgisi (iii) nangesampula lezihumusho zesiPutukezi zabafundi baphambilini. Okutholakale kulo uhlolocwaningo kuveza isidingo sendlela yokuqeqesha abahumushi notolika kanyekanye ohlelweni lokufundiswa olufanayo. Ngaphezu kwalokho, imiphumela yokuhlaziywa kwemibhalo nezimo eyabhalwa ngaphansi kwazo ibonisa ukuthi, ngokubanzi, amakhono okuhumusha abafundi baphambilini awamahle, okuyinkomba yokuthi uhlelo lwezifundo lwamanje e-UEM luyehluleka ukukhiqiza ongoti bokuhumusha nokutolika ezingeni le-BA Honours. Isixazululo esiphakanyiswayo ngesendlela entsha esuselwa ekuthuthukiseni amakhono edidiyela ukuhumusha nokutolika, enalezi zinsika ezine ezilandelayo: amakhono kwezokuxhumana, ulwazi ngokubanzi, ikhono lokusebenzisa amasu, nokuhlinzekwa kwezidingo.Linguistics and Modern LanguagesD. Phil. (Languages, Linguistics and Literature

    FreeLing: From a multilingual open-source analyzer suite to an EBMT platform.

    Get PDF
    FreeLing is an open-source library providing a wide range of language analysis utilities for several different languages. It is intended to provide NLP application developers with any text processing and language annotation tools they may need in order to simplify their development task. Moreover, FreeLing is customizable and extensible. Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.), or extend them, adapt to particular domains, or even develop new resources for specific languages. Being open-source has enabled FreeLing to grow far beyond its original capabilities, especially with regard to linguistic data: contributions from its community of users, for instance, include morphological dictionaries and PoS tagger training data for Galician, Italian, Portuguese, Asturian, and Welsh. In this paper we present the basic architecture and the main services in FreeLing, and we outline how developers might use it to build competitive NLP systems and indicate how it might be extended to support the development of Example-Based Machine Translation systems.Postprint (published version

    FreeLing 3.0: Towards Wider Multilinguality

    Get PDF
    FreeLing is an open-source multilingual language processing library providing a wide range of analyzers for several languages. It offers text processing and language annotation facilities to NLP application developers, lowering the cost of building those applications. FreeLing is customizable, extensible, and has a strong orientation to real-world applications in terms of speed and robustness. Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.), extend/adapt them to specific domains, or –since the library is open source– develop new ones for specific languages or special application needs. This paper describes the general architecture of the library, presents the major changes and improvements included in FreeLing version 3.0, and summarizes some relevant industrial projects in which it has been used.Postprint (published version

    Doing Business Across Cultures

    Get PDF
    Посібник розроблений відповідно до навчальної програми курсу “Англійсь-ка мова за професійним спрямуванням” для спеціальності “Міжнародна економіка”. Запропонована у посібнику система завдань базується на автентичних матеріалах і покликана підвищити рівень соціокультурної компетенції студентів, сприяти розвитку їх комунікативних вмінь англомовного ділового спілкування, формуванню адекватної вербальної та невербальної поведінки в різноманітних етноспецифічних контекстах бізнес-комунікації.Understanding how core values in particular business environments vary from culture to culture; being aware of different cultural patterns; ap-plying intercultural insights and behaving appropriately while interacting within various social and business-related situations – all this appears to be a very important part of learning a foreign language. If you make a gram-mar mistake, it may be “wrong”, but very often people will understand you anyway. But if you don’t know what to say or how to recognize gestures, maintain eye contact, observe personal space or use appropriate body lan-guage in each situation, it may turn out frustrating enough for both – you and the person you are talking to. “Doing Business Across Cultures” will come quite handy since it has been designed to revise and consolidate learners’ knowledge of the variety of business corporate cultures as well as develop skills of practicing this knowledge

    Automatic extraction of definitions

    Get PDF
    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2014This doctoral research work provides a set of methods and heuristics for building a definition extractor or for fine-tuning an existing one. In order to develop and test the architecture, a generic definitions extractor for the Portuguese language is built. Furthermore, the methods were tested in the construction of an extractor for two languages different from Portuguese, which are English and, less extensively, Dutch. The approach presented in this work makes the proposed extractor completely different in nature in comparison to the other works in the field. It is a matter of fact that most systems that automatically extract definitions have been constructed taking into account a specific corpus on a specific topic, and are based on the manual construction of a set of rules or patterns capable of identifyinf a definition in a text. This research focused on three types of definitions, characterized by the connector between the defined term and its description. The strategy adopted can be seen as a "divide and conquer"approach. Differently from the other works representing the state of the art, specific heuristics were developed in order to deal with different types of definitions, namely copula, verbal and punctuation definitions. We used different methodology for each type of definition, namely we propose to use rule-based methods to extract punctuation definitions, machine learning with sampling algorithms for copula definitions, and machine learning with a method to increase the number of positive examples for verbal definitions. This architecture is justified by the increasing linguistic complexity that characterizes the different types of definitions. Numerous experiments have led to the conclusion that the punctuation definitions are easily described using a set of rules. These rules can be easily adapted to the relevant context and translated into other languages. However, in order to deal with the other two definitions types, the exclusive use of rules is not enough to get good performance and it asks for more advanced methods, in particular a machine learning based approach. Unlike other similar systems, which were built having in mind a specific corpus or a specific domain, the one reported here is meant to obtain good results regardless the domain or context. All the decisions made in the construction of the definition extractor take into consideration this central objective.Este trabalho de doutoramento visa proporcionar um conjunto de métodos e heurísticas para a construção de um extractor de definição ou para melhorar o desempenho de um sistema já existente, quando usado com um corpus específico. A fim de desenvolver e testar a arquitectura, um extractor de definic˛ões genérico para a língua Portuguesa foi construído. Além disso, os métodos foram testados na construção de um extractor para um idioma diferente do Português, nomeadamente Inglês, algumas heurísticas também foram testadas com uma terceira língua, ou seja o Holandês. A abordagem apresentada neste trabalho torna o extractor proposto neste trabalho completamente diferente em comparação com os outros trabalhos na área. É um fato que a maioria dos sistemas de extracção automática de definicões foram construídos tendo em conta um corpus específico com um tema bem determinado e são baseados na construc˛ão manual de um conjunto de regras ou padrões capazes de identificar uma definição num texto dum domínio específico. Esta pesquisa centrou-se em três tipos de definições, caracterizadas pela ligacão entre o termo definido e a sua descrição. A estratégia adoptada pode ser vista como uma abordagem "dividir para conquistar". Diferentemente de outras pesquisa nesta área, foram desenvolvidas heurísticas específicas a fim de lidar com as diferentes tipologias de definições, ou seja, cópula, verbais e definicões de pontuação. No presente trabalho propõe-se utilizar uma metodologia diferente para cada tipo de definição, ou seja, propomos a utilização de métodos baseados em regras para extrair as definições de pontuação, aprendizagem automática, com algoritmos de amostragem para definições cópula e aprendizagem automática com um método para aumentar automáticamente o número de exemplos positivos para a definição verbal. Esta arquitetura é justificada pela complexidade linguística crescente que caracteriza os diferentes tipos de definições. Numerosas experiências levaram à conclusão de que as definições de pontuação são facilmente descritas utilizando um conjunto de regras. Essas regras podem ser facilmente adaptadas ao contexto relevante e traduzido para outras línguas. No entanto, a fim de lidar com os outros dois tipos de definições, o uso exclusivo de regras não é suficiente para obter um bom desempenho e é preciso usar métodos mais avançados, em particular aqueles baseados em aprendizado de máquina. Ao contrário de outros sistemas semelhantes, que foram construídos tendo em mente um corpus ou um domínio específico, o sistema aqui apresentado foi desenvolvido de maneira a obter bons resultados, independentemente do domínio ou da língua. Todas as decisões tomadas na construção do extractor de definição tiveram em consideração este objectivo central.Fundação para a Ciência e a Tecnologia (FCT, SFRH/ BD/36732/2007

    TectoMT – a deep-­linguistic core of the combined Chimera MT system

    Get PDF
    Chimera is a machine translation system that combines the TectoMT deep-linguistic core with phrase-based MT system Moses. For English–Czech pair it also uses the Depfix post-correction system. All the components run on Unix/Linux platform and are open source (available from Perl repository CPAN and the LINDAT/CLARIN repository). The main website is https://ufal.mff.cuni.cz/tectomt. The development is currently supported by the QTLeap 7th FP project (http://qtleap.eu)
    corecore