37 research outputs found

    Evaluation in natural language processing

    Get PDF
    quot; European Summer School on Language Logic and Information(ESSLLI 2007)(Trinity College Dublin Ireland 6-17 August 2007

    Low-resource speech translation

    Get PDF
    We explore the task of speech-to-text translation (ST), where speech in one language (source) is converted to text in a different one (target). Traditional ST systems go through an intermediate step where the source language speech is first converted to source language text using an automatic speech recognition (ASR) system, which is then converted to target language text using a machine translation (MT) system. However, this pipeline based approach is impractical for unwritten languages spoken by millions of people around the world, leaving them without access to free and automated translation services such as Google Translate. The lack of such translation services can have important real-world consequences. For example, in the aftermath of a disaster scenario, easily available translation services can help better co-ordinate relief efforts. How can we expand the coverage of automated ST systems to include scenarios which lack source language text? In this thesis we investigate one possible solution: we build ST systems to directly translate source language speech into target language text, thereby forgoing the dependency on source language text. To build such a system, we use only speech data paired with text translations as training data. We also specifically focus on low-resource settings, where we expect at most tens of hours of training data to be available for unwritten or endangered languages. Our work can be broadly divided into three parts. First we explore how we can leverage prior work to build ST systems. We find that neural sequence-to-sequence models are an effective and convenient method for ST, but produce poor quality translations when trained in low-resource settings. In the second part of this thesis, we explore methods to improve the translation performance of our neural ST systems which do not require labeling additional speech data in the low-resource language, a potentially tedious and expensive process. Instead we exploit labeled speech data for high-resource languages which is widely available and relatively easier to obtain. We show that pretraining a neural model with ASR data from a high-resource language, different from both the source and target ST languages, improves ST performance. In the final part of our thesis, we study whether ST systems can be used to build applications which have traditionally relied on the availability of ASR systems, such as information retrieval, clustering audio documents, or question/answering. We build proof-of-concept systems for two downstream applications: topic prediction for speech and cross-lingual keyword spotting. Our results indicate that low-resource ST systems can still outperform simple baselines for these tasks, leaving the door open for further exploratory work. This thesis provides, for the first time, an in-depth study of neural models for the task of direct ST across a range of training data settings on a realistic multi-speaker speech corpus. Our contributions include a set of open-source tools to encourage further research

    Special Libraries, April 1977

    Get PDF
    Volume 68, Issue 4https://scholarworks.sjsu.edu/sla_sl_1977/1003/thumbnail.jp

    Beyond topic-based representations for text mining

    Get PDF
    A massive amount of online information is natural language text: newspapers, blog articles, forum posts and comments, tweets, scientific literature, government documents, and more. While in general, all kinds of online information is useful, textual information is especially important—it is the most natural, most common, and most expressive form of information. Text representation plays a critical role in application tasks like classification or information retrieval since the quality of the underlying feature space directly impacts each task's performance. Because of this importance, many different approaches have been developed for generating text representations. By far, the most common way to generate features is to segment text into words and record their n-grams. While simple term features perform relatively well in topic-based tasks, not all downstream applications are of a topical nature and can be captured by words alone. For example, determining the native language of an English essay writer will depend on more than just word choice. Competing methods to topic-based representations (such as neural networks) are often not interpretable or rely on massive amounts of training data. This thesis proposes three novel contributions to generate and analyze a large space of non-topical features. First, structural parse tree features are solely based on structural properties of a parse tree by ignoring all of the syntactic categories in the tree. An important advantage of these "skeletons" over regular syntactic features is that they can capture global tree structures without causing problems of data sparseness or overfitting. Second, SyntacticDiff explicitly captures differences in a text document with respect to a reference corpus, creating features that are easily explained as weighted word edit differences. These edit features are especially useful since they are derived from information not present in the current document, capturing a type of comparative feature. Third, Cross-Context Lexical Analysis is a general framework for analyzing similarities and differences in both term meaning and representation with respect to different, potentially overlapping partitions of a text collection. The representations analyzed by CCLA are not limited to topic-based features

    Slava Ukraini: a psychobiographical case study of Volodymyr Zelenskyy’s public diplomacy discourse

    Get PDF
    Volodymyr Zelenskyy\u27s public diplomacy during the Russo-Ukrainian conflict was examined in this dissertation. Zelenskyy’s discourse emphasized his action-oriented traits, Ukrainian identity, and nationalism. The study employed LTA, and LIWC-22, for natural language processing analyses of Zelenskyy\u27s public speeches and diplomatic discourse. Zelenskyy demonstrated agency, adaptability, collaboration, and positive language patterns, suggesting confidence and optimism, according to the data. In addition, the research emphasizes how domestic and international factors influence state behavior, as well as how political demands, cultural, historical, and political factors influence Zelenskyy\u27s decision-making. This dissertation sheds light on a global leader\u27s psychobiographical characteristics, beliefs, and motivations during a crisis, thereby advancing leadership and conflict resolution. By incorporating transformational leadership theory into LTA, researchers can gain a better understanding of effective leadership and how it develops strong connections with followers. LTA, LIWC-22, and qualitative coding were used to identify themes and trends in Zelenskyy\u27s speeches. The findings show Zelenskyy\u27s linguistic and leadership traits in public diplomacy, emphasizing the importance of understanding leaders\u27 traits in foreign policy decision-making. Psychobiographical profiles aid scholars in understanding a leader\u27s political views on conflict, their ability to influence events, and how they accomplish their objectives. As a result, perceptions of the state as an actor, as well as foreign policy decisions, must consider the effect of individual leaders. Conclusions include the Brittain-Hale Foreign Policy Analysis Model, based on a heuristic qualitative coding framework; HISTORICAL

    Evaluating automated and hybrid neural disambiguation for African historical named entities

    Get PDF
    Documents detailing South African history contain ambiguous names. Ambiguous names may be due to people having the same name or the same person being referred to by multiple different names. Thus when searching for or attempting to extract information about a particular person, the name used may affect the results. This problem may be alleviated by using a Named Entity Disambiguation (NED) system to disambiguate names by linking them to a knowledge base. In recent years, transformer-based language models have led to improvements in NED systems. Furthermore, multilingual language models have shown the ability to learn concepts across languages, reducing the amount of training data required in low-resource languages. Thus a multilingual language model-based NED system was developed to disambiguate people's names within a historical South African context using documents written in English and isiZulu from the 500 Year Archive (FHYA). The multilingual language model-based system substantially improved on a probability-based baseline and achieved a micro F1-score of 0.726. At the same time, the entity linking component was able to link 81.9% of the mentions to the correct entity. However, the system's performance on documents written in isiZulu was significantly lower than on the documents written in English. Thus the system was augmented with handcrafted rules to improve its performance. The addition of handcrafted rules resulted in a small but significant improvement in performance when compared to the unaugmented NED system

    The Training Of Lay Persons For Careproviding Ministries Among Hispanics

    Get PDF
    This study attempted to develop a program that would provide Hispanic laypersons with skills and techniques for careproviding. Laypersons should be able to identify their gifts and particular ministries and use behavioral methods to minister to the members of the church and their respective community. Laypersons have been well-trained to do traditional evangelism, but no one has developed a program to train Hispanic laypersons in the careproviding ministry. Many pastors have multichurch districts, so their time to counsel and care for the flock is limited. In addition, pastors are not usually accessible. Laypersons, who live in the immediate area, once they are trained, could be a significant help to the pastor and to those who need someone to minister to them

    Meaning in Distributions : A Study on Computational Methods in Lexical Semantics

    Get PDF
    This study investigates the connection between lexical items' distributions and their meanings from the perspective of computational distributional operations. When applying computational methods in meaning-related research, it is customary to refer to the so-called distributional hypothesis, according to which differences in distributions and meanings are mutually correlated. However, making use of such a hypothesis requires critical explication of the concept of distribution and plausible arguments for why any particular distributional structure is connected to a particular meaning-related phenomenon. In broad strokes, the present study seeks to chart the major differences in how the concept of distribution is conceived in structuralist/autonomous and usage-based/functionalist theoretical families of contemporary linguistics. The two theoretical positions on distributions are studied for identifying how meanings could enter as enabling or constraining factors in them. The empirical part of the study comprises two case studies. In the first one, three pairs of antonymical adjectives (köyhä/rikas, sairas/terve and vanha/nuori) are studied distributionally. Very narrow bag-of-word vector representations of distributions show how the dimensions on which relevant distributional similarities are based already conflate unexpected and varied range of linguistic phenomena, spanning from syntax-oriented conceptual constrainment to connotations, pragmatic patterns and affectivity. Thus, the results simultaneously corroborate the distributional hypothesis and challenge its over-generalized, uncritical applicability. For the study of meaning, distributional and semantic spaces cannot be treated as analogous by default. In the second case study, a distributional operation is purposefully built for answering a research question related to historical development of Finnish social law terminology in the period of 1860–1910. Using a method based on interlinked collocation networks, the study shows how the term vaivainen (‘pauper, beggar, measly’) receded from the prestigious legal and administrative registers during the studied period. Corroborating some of the findings of the previous parts of this dissertation, the case study shows how structures found in distributional representations cannot be satisfactorily explained without relying on semantic, pragmatic and discoursal interpretations. The analysis leads to confirming the timeline of the studied word use in the given register. It also shows how the distributional methods based on networked patterns of co-occurrence highlight incomparable structures of very different nature and skew towards frequent occurrence types prevalent in the data.Nykyaikaiset laskennalliset menetelmät suorittavat suurista tekstiaineistoista koottujen tilastollisten mallien avulla lähes virheettömästi monia sanojen merkitysten ymmärtämistä edellyttäviä tehtäviä. Kielitieteellisen metodologian kannalta onkin kiinnostavaa, miten tällaiset menetelmät sopivat kiellisten rakenteiden merkitysten lingvistiseen tutkimukseen. Tämä väitöstutkimus lähestyy kysymystä sanasemantiikan näkökulmasta ja pyrkii sekä teoreettisesti että empiirisesti kuvaamaan minkälaisia merkityksen lajeja pelkkiin sanojen sekvensseihin perustuvat laskennalliset menetelmät kykenevät tavoittamaan. Väitöstutkimus koostuu kahdesta osatutkimuksesta, joista ensimmäisessä tutkitaan kolmea vastakohtaista adjektiiviparia Suomi24-aineistosta kootun vektoriavaruusmallin avulla. Tulokset osoittavat, miten jo hyvin rajatut sekvenssiympäristöt sisältävät informaatiota käsitteellisten merkitysten lisäksi myös muun muassa niiden konnotaatioista ja affektiivisuudesta. Sekvenssiympäristön tuottama kuva merkityksestä on kuitenkin kattavuudeltaan ennalta-arvaamaton ja ne kielekäyttötavat, jotka tutkimusaineistossa ovat yleisiä vaikuttavat selvästi siihen mitä merkityksen piirteitä tulee näkyviin. Toisessa osatutkimuksessa jäljitetään erään sosiaalioikeudellisen termin, vaivaisen, historiaa 1800-luvun loppupuolella Kansalliskirjaston historiallisesta digitaalisesta sanomalehtikokoelmasta. Myötäesiintymäverkostojen avulla pyritään selvittämään miten se katosi oikeuskielestä tunnistamalla aineistosta hallinnollis-juridista rekisteriä vastaava rakenne ja seuraamalla vaivaisen asemaa siinä. Menetelmänä käytetyt myötäesiintymäverkostot eivät kuitenkaan edusta puhtaasti mitään tiettyä rekisteriä, vaan sekoittavat itseensä piirteitä erilaisista kategorioista, joilla kielen käyttöä on esimerkiksi tekstintutkimuksessa kuvattu. Tiheimmät verkostot muodostuvat rekisterien, genrejen, tekstityyppien ja sanastollisen koheesion yhteisvaikutuksesta. Osatutkimuksen tulokset antavat viitteitä siitä, että tämä on yleinen piirre monissa samankaltaisissa menetelmissä, mukaan lukien yleiset aihemallit

    Text complexity and text simplification in the crisis management domain

    Get PDF
    Due to the fact that emergency situations can lead to substantial losses, both financial and in terms of human lives, it is essential that texts used in a crisis situation be clearly understandable. This thesis is concerned with the study of the complexity of the crisis management sub-language and with methods to produce new, clear texts and to rewrite pre-existing crisis management documents which are too complex to be understood. By doing this, this interdisciplinary study makes several contributions to the crisis management field. First, it contributes to the knowledge of the complexity of the texts used in the domain, by analysing the presence of a set of written language complexity issues derived from the psycholinguistic literature in a novel corpus of crisis management documents. Second, since the text complexity analysis shows that crisis management documents indeed exhibit high numbers of text complexity issues, the thesis adapts to the English language controlled language writing guidelines which, when applied to the crisis management language, reduce its complexity and ambiguity, leading to clear text documents. Third, since low quality of communication can have fatal consequences in emergency situations, the proposed controlled language guidelines and a set of texts which were re-written according to them are evaluated from multiple points of view. In order to achieve that, the thesis both applies existing evaluation approaches and develops new methods which are more appropriate for the task. These are used in two evaluation experiments – evaluation on extrinsic tasks and evaluation of users’ acceptability. The evaluations on extrinsic tasks (evaluating the impact of the controlled language on text complexity, reading comprehension under stress, manual translation, and machine translation tasks) Text Complexity and Text Simplification in the Crisis Management domain 4 show a positive impact of the controlled language on simplified documents and thus ensure the quality of the resource. The evaluation of users’ acceptability contributes additional findings about manual simplification and helps to determine directions for future implementation. The thesis also gives insight into reading comprehension, machine translation, and cross-language adaptability, and provides original contributions to machine translation, controlled languages, and natural language generation evaluation techniques, which make it valuable for several scientific fields, including Linguistics, Psycholinguistics, and a number of different sub-fields of NLP.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Use and Evaluation of Controlled Languages in Industrial Environments and Feasibility Study for the Implementation of Machine Translation

    Get PDF
    El presente trabajo de investigación se enmarca en los estudios de doctorado en traducción y la sociedad del conocimiento de la Universidad de Valencia y, en concreto, en la línea de investigación en tecnologías de la traducción, terminología y localización. En este sentido, esta disertación surge por la necesidad de establecer una metodología de investigación y ofrecer resultados empíricos sobre el desarrollo, implementación y evaluación de lenguajes controlados en la documentación técnica y su efecto tanto en los textos originales como en las traducciones de estos documentos. Así pues, el objetivo ha sido desarrollar una metodología para evaluar el impacto de los lenguajes controlados en la producción de documentación técnica dentro de contextos industriales y, más en concreto, en la elaboración de documentación técnica para el vehículo. El impacto se ha concretado en la mejora de la traducibilidad automática, un concepto que hemos discutido ampliamente en el capítulo 4, así como de la calidad de los textos meta.This research is part of the doctoral studies program "La traducción y la sociedad del conocimiento" at the University of Valencia. In particular the area of ​​research is translation technology, terminology and localisation. In this sense, this dissertation arises from the need to establish a research methodology and to provide empirical results on the development, implementation and evaluation of controlled languages ​​in the technical documentation and its effect on both original texts and the translations of these documents. Thus, the aim has been to develop a methodology to assess the impact of controlled languages ​​in the production of technical documentation in industrial contexts, and more specifically in the technical documentation for the vehicle. The impact has resulted in improved automatic translatability, a concept we have discussed at length in Chapter 4, as well as in the quality of the target texts
    corecore