18 research outputs found

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    Improving Neural Question Answering with Retrieval and Generation

    Get PDF
    Text-based Question Answering (QA) is a subject of interest both for its practical applications, and as a test-bed to measure the key Artificial Intelligence competencies of Natural Language Processing (NLP) and the representation and application of knowledge. QA has progressed a great deal in recent years by adopting neural networks, the construction of large training datasets, and unsupervised pretraining. Despite these successes, QA models require large amounts of hand-annotated data, struggle to apply supplied knowledge effectively, and can be computationally ex- pensive to operate. In this thesis, we employ natural language generation and information retrieval techniques in order to explore and address these three issues. We first approach the task of Reading Comprehension (RC), with the aim of lifting the requirement for in-domain hand-annotated training data. We describe a method for inducing RC capabilities without requiring hand-annotated RC instances, and demonstrate performance on par with early supervised approaches. We then explore multi-lingual RC, and develop a dataset to evaluate methods which enable training RC models in one language, and testing them in another. Second, we explore open-domain QA (ODQA), and consider how to build mod- els which best leverage the knowledge contained in a Wikipedia text corpus. We demonstrate that retrieval-augmentation greatly improves the factual predictions of large pretrained language models in unsupervised settings. We then introduce a class of retrieval-augmented generator model, and demonstrate its strength and flexibility across a range of knowledge-intensive NLP tasks, including ODQA. Lastly, we study the relationship between memorisation and generalisation in ODQA, developing a behavioural framework based on memorisation to contextualise the performance of ODQA models. Based on these insights, we introduce a class of ODQA model based on the concept of representing knowledge as question- answer pairs, and demonstrate how, by using question generation, such models can achieve high accuracy, fast inference, and well-calibrated predictions

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    Negative vaccine voices in Swedish social media

    Get PDF
    Vaccinations are one of the most significant interventions to public health, but vaccine hesitancy creates concerns for a portion of the population in many countries, including Sweden. Since discussions on vaccine hesitancy are often taken on social networking sites, data from Swedish social media are used to study and quantify the sentiment among the discussants on the vaccination-or-not topic during phases of the COVID-19 pandemic. Out of all the posts analyzed a majority showed a stronger negative sentiment, prevailing throughout the whole of the examined period, with some spikes or jumps due to the occurrence of certain vaccine-related events distinguishable in the results. Sentiment analysis can be a valuable tool to track public opinions regarding the use, efficacy, safety, and importance of vaccination

    Semantic approaches to domain template construction and opinion mining from natural language

    Get PDF
    Most of the text mining algorithms in use today are based on lexical representation of input texts, for example bag of words. A possible alternative is to first convert text into a semantic representation, one that captures the text content in a structured way and using only a set of pre-agreed labels. This thesis explores the feasibility of such an approach to two tasks on collections of documents: identifying common structure in input documents (»domain template construction«), and helping users find differing opinions in input documents (»opinion mining«). We first discuss ways of converting natural text to a semantic representation. We propose and compare two new methods with varying degrees of target representation complexity. The first method, showing more promise, is based on dependency parser output which it converts to lightweight semantic frames, with role fillers aligned to WordNet. The second method structures text using Semantic Role Labeling techniques and aligns the output to the Cyc ontology.\ud Based on the first of the above representations, we next propose and evaluate two methods for constructing frame-based templates for documents from a given domain (e.g. bombing attack news reports). A template is the set of all salient attributes (e.g. attacker, number of casualties, \ldots). The idea of both methods is to construct abstract frames for which more specific instances (according to the WordNet hierarchy) can be found in the input documents. Fragments of these abstract frames represent the sought-for attributes. We achieve state of the art performance and additionally provide detailed type constraints for the attributes, something not possible with competing methods. Finally, we propose a software system for exposing differing opinions in the news. For any given event, we present the user with all known articles on the topic and let them navigate them by three semantic properties simultaneously: sentiment, topical focus and geography of origin. The result is a dynamically reranked set of relevant articles and a near real time focused summary of those articles. The summary, too, is computed from the semantic text representation discussed above. We conducted a user study of the whole system with very positive results

    A Grammatically Structured Noun Phrase Extractor for Vietnamese

    No full text

    The acquisition of lexical bundles in english for academic purposes: a multidisciplinary study of novice authors

    Get PDF
    Los paquetes léxicos (lexical bundles), también conocidos como “racimos” (clusters), y N-grams, se definen como secuencias de más de dos palabras que ocurren en un registro determinado con una frecuencia mayor de la esperada por azar (Biber et al., 1999). A pesar de que son estructuralmente incompletas y semánticamente transparentes, estas secuencias frásticas altamente recurrentes desempeñan importantes funciones en la construcción del discurso (Biber et al., 1999; Biber, 2006). Se ha demostrado que los hablantes competentes almacenan las frases formulaicas y las reproducen enteramente, sin analizar sus componentes (Wray, 2002; Schmitt & Underwood, 2004), y que estas unidades léxico-gramáticas desempeñan un papel importante en la construcción de un discurso coherente y fluido (Nattinger & DeCarrico, 1992; Pawley & Syder, 1983). A pesar de que los paquetes léxicos son más frecuentes en conversación que en otros registros, se ha demostrado que estas secuencias frásticas desempeñan importantes funciones comunicativas también en el discurso académico (Biber, 2006). Cada género posee un conjunto de paquetes léxicos característicos que contribuyen a la creación de significado y son un rasgo distintivo del género (Hyland, 2012). Se han realizado numerosos estudios de los paquetes léxicos en el discurso académico escrito y hablado (Biber & Barbieri; 2007; Biber & Gray, 2010; Nesi & Basturkmen, 2006; Sánchez, 2013), contrastando los textos de autores nativos y no nativos (Ädel & Erman, 2012; Chen & Baker, 2010; Salazar, 2014), así como los de académicos expertos y novatos (Biber et al. 2004; Chen & Baker, 2010; Cortés, 2004, Hyland, 2008b; Staples et al., 2013). Según estos estudios, existen diferencias significativas entre el uso de los paquetes léxicos de los autores nativos y no nativos (Ädel & Erman, 2012; Byrd & Coxhead, 2010; Pan, Reppen & Biber, 2016). Además, se han realizado estudios contrastivos de los paquetes léxicos en diferentes disciplinas (Allen, 2001; Biber, 2005, Hyland, 2008a), y estudios enfocados en disciplinas o áreas académicas concretas (Ädel & Erman, 2012; Chen & Baker, 2010; Cortés, 2004, Eriksson, 2012; Farvardin, Afghari & Koosha, 2012; Sánchez, 2014; Salazar, 2011). Según estas investigaciones, cada área de conocimiento cuenta con su propio conjunto de paquetes léxicos para construir conocimiento. Estas secuencias frásticas recurrentes constituyen una parte crucial de las prácticas discursivas y convenciones genéricas de cada campo (Hyland, 2008a; Durrant, 2009; Eriksson, 2012). Se ha demostrado que los investigadores novatos tienden a abusar de algunos paquetes léxicos y hacer menor o ningún uso de otros, produciendo textos que no son prototípicos para su campo y género (Ädel & Erman, 2012; Chen & Baker, 2010; Salazar, 2014). Varios estudios con enfoque pedagógico se han dedicado a generar listas de paquetes léxicos para investigadores novatos (Ackermann, 2013; Simpson-Vlach & Ellis, 2010), o han incorporado la enseñanza de los paquetes léxicos en clases de Inglés para Fines Específicos o Inglés como Lengua Extranjera (Cortés, 2006; Erman et al., 2013; Jones & Haywood, 2005; Peters & Pauwels, 2015). En la mayoría de estos estudios enfocados a la enseñanza de los paquetes léxicos, la metodología adoptada se basa en las técnicas de enseñanza de vocabulario (Alali & Schmitt, 2012; Cortés, 2006; Erman et al., 2013; Jones & Haywood), las cuales no parecen del todo adecuadas para la enseñanza de patrones léxico-gramáticos. El resultado de estas metodologías es que los discentes adquieren conocimiento teórico sobre paquetes léxicos y obtienen buena puntuación en las pruebas finales, pero su producción escrita no parece haber mejorado sustancialmente después de la formación. El presente trabajo estudia los paquetes léxicos desde una perspectiva pedagógica, como un recurso para mejorar los textos escritos por académicos novatos no nativos. El estudio recoge tres disciplinas, a saber, psicología, lingüística y estudios literarios, con el fin de analizar la frecuencia, la estructura y el uso de los paquetes léxicos en cada una de ellas. La segunda parte del estudio consiste en proponer un método de enseñanza de los paquetes léxicos a usuarios avanzados de inglés y ofrecerles herramientas y metodología para aprendizaje independiente. De ahí que el estudio conste de dos fases: el análisis de los paquetes léxicos en las tres disciplinas arriba mencionadas, y la formación de académicos novatos no nativos con el fin de ofrecerles una metodología para construir sus propios mini-corpus y extraer los paquetes léxicos a partir de textos prototípicos característicos de su campo, disciplina y objeto de estudio. Durante la primera fase, se extrajeron los paquetes léxicos de un corpus de aproximadamente 2,1 millones de palabras usando AntConc (Anthony, 2010). A continuación, los paquetes léxicos se clasificaron según la taxonomía funcional propuesta por Biber, Conrad y Cortés (2003) y Hyland, (2008b). La segunda fase del estudio consiste en una breve sesión introductoria y un taller dirigido a los discentes de la Universidad de La Laguna, alumnos del Departamento de Filología Inglesa y Alemana del último año de la carrera y estudiantes de Doctorado de la Facultad de Psicología. En el presente estudio se demuestra que cada disciplina posee un conjunto de paquetes léxicos que son parte de las prácticas discursivas establecidas, consecuentemente, los académicos novatos deberían adquirir estas prácticas para ser admitidos en la comunidad del discurso de dicha disciplina. Asimismo, se demuestra que la enseñanza de paquetes léxicos debe basarse en las disciplinas concretas de los discentes para corresponder a sus necesidades. Aunque los textos producidos por los participantes del taller después de la formación no tienen suficiente volumen para afirmar con seguridad la viabilidad de esta metodología, los resultados y la retroalimentación obtenidos de los participantes del taller indican que este tipo de formación aumenta el conocimiento de los discentes sobre los paquetes léxicos y es bien recibida por ellos

    Speech verification for computer assisted pronunciation training

    Get PDF
    Computer assisted pronunciation training (CAPT) is an approach that uses computer technology and computer-based resources in teaching and learning pronunciation. It is also part of computer assisted language learning (CALL) technology that has been widely applied to online learning platforms in the past years. This thesis deals with one of the central tasks in CAPT, i.e. speech veri- fication. The goal is to provide a framework that identifies pronunciation errors in speech data of second language (L2) learners and generates feedback with information and instruction for error correction. Furthermore, the framework is supposed to support the adaptation to new L1-L2 language pairs with minimal adjustment and modification. The central result is a novel approach to L2 speech verification, which combines both modern language technologies and linguistic expertise. For pronunciation verification, we select a set of L2 speech data, create alias phonemes from the errors annotated by linguists, then train an acoustic model with mixed L2 and gold standard data and perform HTK phoneme recognition to identify the error phonemes. For prosody verification, FD-PSOLA and Dynamic time warping are both applied to verify the differences in duration, pitch and stress. Feedback is generated for both verifications. Our feedback is presented to learners not only visually as with other existing CAPT systems, but also perceptually by synthesizing the learner’s own audio, e.g. for prosody verification, the gold standard prosody is transplanted onto the learner’s own voice. The framework is self-adaptable under semi-supervision, and requires only a certain amount of mixed gold standard and annotated L2 speech data for boot- strapping. Verified speech data is validated by linguists, annotated in case of wrong verification, and used in the next iteration of training. Mary Annotation Tool (MAT) is developed as an open-source component of MARYTTS for both annotating and validating. To deal with uncertain pauses and interruptions in L2 speech, the silence model in HTK is also adapted, and used in all components of the framework where forced alignment is required. Various evaluations are conducted that help us obtain insights into the applicability and potential of our CAPT system. The pronunciation verification shows high accuracy in both precision and recall, and encourages us to acquire more error-annotated L2 speech data to enhance the trained acoustic model. To test the effect of feedback, a progressive evaluation is carried out and it shows that our perceptual feedback helps learners realize their errors, which they could not otherwise observe from visual feedback and textual instructions. In order to im- prove the user interface, a questionnaire is also designed to collect the learners’ experiences and suggestions.Computer Assisted Pronunciation Training (CAPT) ist ein Ansatz, der mittels Computer und computergestützten Ressourcen das Erlernen der korrekten Aussprache im Fremdsprachenunterricht erleichtert. Dieser Ansatz ist ein Teil der Computer Assisted Language Learning (CALL) Technologie, die seit mehreren Jahren auf Online-Lernplattformen häufig zum Einsatz kommt. Diese Arbeit ist der Sprachverifikation gewidmet, einer der zentralen Aufgaben innerhalb des CAPT. Das Ziel ist, ein Framework zur Identifikation von Aussprachefehlern zu entwickeln fürMenschen, die eine Fremdsprache (L2-Sprache) erlernen. Dabei soll Feedback mit fehlerspezifischen Informationen und Anweisungen für eine richtige Aussprache erzeugt werden. Darüber hinaus soll das Rahmenwerk die Anpassung an neue Sprachenpaare (L1-L2) mit minimalen Adaptationen und Modifikationen unterstützen. Das zentrale Ergebnis ist ein neuartiger Ansatz für die L2-Sprachprüfung, der sowohl auf modernen Sprachtechnologien als auch auf corpuslinguistischen Ansätzen beruht. Für die Ausspracheüberprüfung erstellen wir Alias-Phoneme aus Fehlern, die von Linguisten annotiert wurden. Dann trainieren wir ein akustisches Modell mit gemischten L2- und Goldstandarddaten und führen eine HTK-Phonemerkennung3 aus, um die Fehlerphoneme zu identifizieren. Für die Prosodieüberprüfung werden sowohl FD-PSOLA4 und Dynamic Time Warping angewendet, um die Unterschiede in der Dauer, Tonhöhe und Betonung zwischen dem Gesprochenen und dem Goldstandard zu verifizieren. Feedbacks werden für beide Überprüfungen generiert und den Lernenden nicht nur visuell präsentiert, so wie in anderen vorhandenen CAPT-Systemen, sondern auch perzeptuell vorgestellt. So wird unter anderem für die Prosodieverifikation die Goldstandardprosodie auf die eigene Stimme des Lernenden übergetragen. Zur Anpassung des Frameworks an weitere L1-L2 Sprachdaten muss das System über Maschinelles Lernen trainiert werden. Da es sich um ein semi-überwachtes Lernverfahren handelt, sind nur eine gewisseMenge an gemischten Goldstandardund annotierten L2-Sprachdaten für das Bootstrapping erforderlich. Verifizierte Sprachdaten werden von Linguisten validiert, im Falle einer falschen Verifizierung nochmals annotiert, und bei der nächsten Iteration des Trainings verwendet. Für die Annotation und Validierung wurde das Mary Annotation Tool (MAT) als Open-Source-Komponente von MARYTTS entwickelt. Um mit unsicheren Pausen und Unterbrechungen in der L2-Sprache umzugehen, wurde auch das sogenannte Stillmodell in HTK angepasst und in allen Komponenten des Rahmenwerks verwendet, in denen Forced Alignment erforderlich ist. Unterschiedliche Evaluierungen wurden durchgeführt, um Erkenntnisse über die Anwendungspotenziale und die Beschränkungen des Systems zu gewinnen. Die Ausspracheüberprüfung zeigt eine hohe Genauigkeit sowohl bei der Präzision als auch beim Recall. Dadurch war es möglich weitere fehlerbehaftete L2-Sprachdaten zu verwenden, um somit das trainierte akustische Modell zu verbessern. Um die Wirkung des Feedbacks zu testen, wird eine progressive Auswertung durchgeführt. Das Ergebnis zeigt, dass perzeptive Feedbacks dabei helfen, dass die Lernenden sogar Fehler erkennen, die sie nicht aus visuellen Feedbacks und Textanweisungen beobachten können. Zudem wurden mittels Fragebogen die Erfahrungen und Anregungen der Benutzeroberfläche der Lernenden gesammelt, um das System künftig zu verbessern. 3 Hidden Markov Toolkit 4 Pitch Synchronous Overlap and Ad

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    CLARIN. The infrastructure for language resources

    Get PDF
    CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)
    corecore