12 research outputs found

    Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities

    Full text link
    Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%

    Enhancing Sequence-to-Sequence Text-to-Speech with Morphology

    Get PDF

    Augmenting Poetry Composition with Verse by Verse

    Full text link
    We describe Verse by Verse, our experiment in augmenting the creative process of writing poetry with an AI. We have created a group of AI poets, styled after various American classic poets, that are able to offer as suggestions generated lines of verse while a user is composing a poem. In this paper, we describe the underlying system to offer these suggestions. This includes a generative model, which is tasked with generating a large corpus of lines of verse offline and which are then stored in an index, and a dual-encoder model that is tasked with recommending the next possible set of verses from our index given the previous line of verse

    Assessing people with visual impairments’ access to information, awareness and satisfaction with high-tech assistive technology

    Get PDF
    Assistive technology (AT) devices are designed to help people with visual impairments (PVIs) perform activities that would otherwise be difficult or impossible. Devices specifically designed to assist PVIs by attempting to restore sight or substitute it for another sense have a very low uptake rate. This study, conducted in England, aimed to investigate why this is the case by assessing accessibility to knowledge, awareness, and satisfaction with AT in general and with sensory restoration and substitution devices in particular. From a sample of 25 PVIs, ranging from 21 to 68 years old, results showed that participants knew where to find AT information; however, health care providers were not the main source of this information. Participants reported good awareness of different ATs, and of technologies they would not use, but reported poor awareness of specific sensory substitution and restoration devices. Only three participants reported using AT, each with different devices and varying levels of satisfaction. The results from this study suggest a possible breakdown in communication between health care providers and PVIs, and dissociation between reported AT awareness and reported access to AT information. Moreover, awareness of sensory restoration and substitution devices is poor, which may explain the limited use of such technology

    Normalization of Lithuanian Text Using Regular Expressions

    Full text link
    Text Normalization is an integral part of any text-to-speech synthesis system. In a natural language text, there are elements such as numbers, dates, abbreviations, etc. that belong to other semiotic classes. They are called non-standard words (NSW) and need to be expanded into ordinary words. For this purpose, it is necessary to identify the semiotic class of each NSW. The taxonomy of semiotic classes adapted to the Lithuanian language is presented in the work. Sets of rules are created for detecting and expanding NSWs based on regular expressions. Experiments with three completely different data sets were performed and the accuracy was assessed. Causes of errors are explained and recommendations are given for the development of text normalization rules.Comment: 21 page

    Suomenkielisen sosiaalisen median tekstin automaattinen normalisointi

    Get PDF
    Social media provides huge amounts of potential data for natural language processing but using this data may be challenging. Finnish social media text differs greatly from standard Finnish and models trained on standard data may not be able to adequately handle the differences. Text normalization is the process of processing non-standard language into its standardized form. It provides a way to both process non-standard data with standard natural language processing tools and to get more data for training new tools for different tasks. In this thesis I experiment with bidirectional recurrent neural network models and models based on the ByT5 foundation model, as well as the Murre normalizer to see if existing tools are suitable for normalizing Finnish social media text. I manually normalize a small set of data from the Ylilauta and Suomi24 corpora to use as a test set. For training the models I use the Samples of Spoken Finnish corpus and Wikipedia data with added synthetic noise. The results of this thesis show that there are no existing tools suitable for normalizing Finnish written on social media. There is a lack of suitable data for training models for this task. The ByT5-based models perform better than the BRNN models
    corecore