12 research outputs found
Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities
Breaking domain names such as openresearch into component words open and
research is important for applications like Text-to-Speech synthesis and web
search. We link this problem to the classic problem of Chinese word
segmentation and show the effectiveness of a tagging model based on Recurrent
Neural Networks (RNNs) using characters as input. To compensate for the lack of
training data, we propose a pre-training method on concatenated entity names in
a large knowledge database. Pre-training improves the model by 33% and brings
the sequence accuracy to 85%
Augmenting Poetry Composition with Verse by Verse
We describe Verse by Verse, our experiment in augmenting the creative process
of writing poetry with an AI. We have created a group of AI poets, styled after
various American classic poets, that are able to offer as suggestions generated
lines of verse while a user is composing a poem. In this paper, we describe the
underlying system to offer these suggestions. This includes a generative model,
which is tasked with generating a large corpus of lines of verse offline and
which are then stored in an index, and a dual-encoder model that is tasked with
recommending the next possible set of verses from our index given the previous
line of verse
Assessing people with visual impairmentsâ access to information, awareness and satisfaction with high-tech assistive technology
Assistive technology (AT) devices are designed to help people with visual impairments (PVIs) perform activities that would otherwise be difficult or impossible. Devices specifically designed to assist PVIs by attempting to restore sight or substitute it for another sense have a very low uptake rate. This study, conducted in England, aimed to investigate why this is the case by assessing accessibility to knowledge, awareness, and satisfaction with AT in general and with sensory restoration and substitution devices in particular. From a sample of 25 PVIs, ranging from 21 to 68âyears old, results showed that participants knew where to find AT information; however, health care providers were not the main source of this information. Participants reported good awareness of different ATs, and of technologies they would not use, but reported poor awareness of specific sensory substitution and restoration devices. Only three participants reported using AT, each with different devices and varying levels of satisfaction. The results from this study suggest a possible breakdown in communication between health care providers and PVIs, and dissociation between reported AT awareness and reported access to AT information. Moreover, awareness of sensory restoration and substitution devices is poor, which may explain the limited use of such technology
Normalization of Lithuanian Text Using Regular Expressions
Text Normalization is an integral part of any text-to-speech synthesis
system. In a natural language text, there are elements such as numbers, dates,
abbreviations, etc. that belong to other semiotic classes. They are called
non-standard words (NSW) and need to be expanded into ordinary words. For this
purpose, it is necessary to identify the semiotic class of each NSW. The
taxonomy of semiotic classes adapted to the Lithuanian language is presented in
the work. Sets of rules are created for detecting and expanding NSWs based on
regular expressions. Experiments with three completely different data sets were
performed and the accuracy was assessed. Causes of errors are explained and
recommendations are given for the development of text normalization rules.Comment: 21 page
Suomenkielisen sosiaalisen median tekstin automaattinen normalisointi
Social media provides huge amounts of potential data for natural language processing but using this data may be challenging. Finnish social media text differs greatly from standard Finnish and models trained on standard data may not be able to adequately handle the differences.
Text normalization is the process of processing non-standard language into its standardized form. It provides a way to both process non-standard data with standard natural language processing tools and to get more data for training new tools for different tasks.
In this thesis I experiment with bidirectional recurrent neural network models and models based on the ByT5 foundation model, as well as the Murre normalizer to see if existing tools are suitable for normalizing Finnish social media text. I manually normalize a small set of data from the Ylilauta and Suomi24 corpora to use as a test set. For training the models I use the Samples of Spoken Finnish corpus and Wikipedia data with added synthetic noise.
The results of this thesis show that there are no existing tools suitable for normalizing Finnish written on social media. There is a lack of suitable data for training models for this task. The ByT5-based models perform better than the BRNN models