10 research outputs found
Speech and hand transcribed retrieval
This paper describes the issues and preliminary work involved
in the creation of an information retrieval system that will
manage the retrieval from collections composed of both speech
recognised and ordinary text documents. In previous work, it
has been shown that because of recognition errors, ordinary
documents are generally retrieved in preference to recognised
ones. Means of correcting or eliminating the observed bias is
the subject of this paper. Initial ideas and some preliminary
results are presented
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Integrating Medical Ontology and Pseudo Relevance Feedback For Medical Document Retrieval
The purpose of this thesis is to undertake and improve the accuracy of locating the relevant documents from a large amount of Electronic Medical Data (EMD). The unique goal of this research is to propose a new idea for using medical ontology to find an easy and more reliable approach for patients to have a better understanding of their diseases and also help doctors to find and further improve the possible methods of diagnosis and treatments. The empirical studies were based on the dataset provided by CLEF focused on health care data. In this research, I have used Information Retrieval to find and obtain relevant information within the large amount of data sets provided by CLEF. I then used ranking functionality on the Terrier platform to calculate and evaluate the matching documents in the collection of data sets. BM25 was used as the base normalization method to retrieve the results and Pseudo Relevance Feedback weighting model to retrieve the information regarding patients health history and medical records in order to find more accurate results. I then used Unified Medical Language System to develop indexing of the queries while searching on the Internet and looking for health related documents. UMLS software was actually used to link the computer system with the health and biomedical terms and vocabularies into classify tools; it works as a dictionary for the patients by translating the medical terms. Later I would like to work on using medical ontology to create a relationship between the documents regarding the medical data and my retrieved results
Accessing spoken interaction through dialogue processing [online]
Zusammenfassung
Unser Leben, unsere Leistungen und unsere Umgebung, alles wird
derzeit durch Schriftsprache dokumentiert. Die rasante
Fortentwicklung der technischen Möglichkeiten Audio, Bilder und
Video aufzunehmen, abzuspeichern und wiederzugeben kann genutzt
werden um die schriftliche Dokumentation von menschlicher
Kommunikation, zum Beispiel Meetings, zu unterstützen, zu
ergänzen oder gar zu ersetzen. Diese neuen Technologien können
uns in die Lage versetzen Information aufzunehmen, die
anderweitig verloren gehen, die Kosten der Dokumentation zu
senken und hochwertige Dokumente mit audiovisuellem Material
anzureichern. Die Indizierung solcher Aufnahmen stellt die
Kerntechnologie dar um dieses Potential auszuschöpfen. Diese
Arbeit stellt effektive Alternativen zu schlüsselwortbasierten
Indizes vor, die Suchraumeinschränkungen bewirken und teilweise
mit einfachen Mitteln zu berechnen sind.
Die Indizierung von Sprachdokumenten kann auf verschiedenen
Ebenen erfolgen: Ein Dokument gehört stilistisch einer
bestimmten Datenbasis an, welche durch sehr einfache Merkmale
bei hoher Genauigkeit automatisch bestimmt werden kann.
Durch diese Art von Klassifikation kann eine Reduktion des
Suchraumes um einen Faktor der Größenordnung 410 erfolgen. Die
Anwendung von thematischen Merkmalen zur Textklassifikation
bei einer Nachrichtendatenbank resultiert in einer Reduktion um
einen Faktor 18. Da Sprachdokumente sehr lang sein können müssen
sie in thematische Segmente unterteilt werden. Ein neuer
probabilistischer Ansatz sowie neue Merkmale (Sprecherinitia
tive und Stil) liefern vergleichbare oder bessere Resultate als
traditionelle schlüsselwortbasierte Ansätze. Diese thematische
Segmente können durch die vorherrschende Aktivität
charakterisiert werden (erzählen, diskutieren, planen, ...),
die durch ein neuronales Netz detektiert werden kann. Die
Detektionsraten sind allerdings begrenzt da auch Menschen
diese Aktivitäten nur ungenau bestimmen. Eine maximale
Reduktion des Suchraumes um den Faktor 6 ist bei den verwendeten
Daten theoretisch möglich. Eine thematische Klassifikation
dieser Segmente wurde ebenfalls auf einer Datenbasis
durchgeführt, die Detektionsraten für diesen Index sind jedoch
gering.
Auf der Ebene der einzelnen Äußerungen können Dialogakte wie
Aussagen, Fragen, Rückmeldungen (aha, ach ja, echt?, ...) usw.
mit einem diskriminativ trainierten Hidden Markov Model erkannt
werden. Dieses Verfahren kann um die Erkennung von kurzen Folgen
wie Frage/AntwortSpielen erweitert werden (Dialogspiele).
Dialogakte und spiele können eingesetzt werden um
Klassifikatoren für globale Sprechstile zu bauen. Ebenso
könnte ein Benutzer sich an eine bestimmte Dialogaktsequenz
erinnern und versuchen, diese in einer grafischen
Repräsentation wiederzufinden.
In einer Studie mit sehr pessimistischen Annahmen konnten
Benutzer eines aus vier ähnlichen und gleichwahrscheinlichen
Gesprächen mit einer Genauigkeit von ~ 43% durch eine graphische
Repräsentation von Aktivität bestimmt.
Dialogakte könnte in diesem Szenario ebenso nützlich sein, die
Benutzerstudie konnte aufgrund der geringen Datenmenge darüber
keinen endgültigen Aufschluß geben. Die Studie konnte allerdings
für detailierte Basismerkmale wie Formalität und
Sprecheridentität keinen Effekt zeigen.
Abstract
Written language is one of our primary means for documenting our
lives, achievements, and environment. Our capabilities to
record, store and retrieve audio, still pictures, and video are
undergoing a revolution and may support, supplement or even
replace written documentation. This technology enables us to
record information that would otherwise be lost, lower the cost
of documentation and enhance highquality documents with
original audiovisual material.
The indexing of the audio material is the key technology to
realize those benefits. This work presents effective
alternatives to keyword based indices which restrict the search
space and may in part be calculated with very limited resources.
Indexing speech documents can be done at a various levels:
Stylistically a document belongs to a certain database which can
be determined automatically with high accuracy using very simple
features. The resulting factor in search space reduction is in
the order of 410 while topic classification yielded a factor
of 18 in a news domain.
Since documents can be very long they need to be segmented into
topical regions. A new probabilistic segmentation framework as
well as new features (speaker initiative and style) prove to be
very effective compared to traditional keyword based methods. At
the topical segment level activities (storytelling, discussing,
planning, ...) can be detected using a machine learning approach
with limited accuracy; however even human annotators do not
annotate them very reliably. A maximum search space reduction
factor of 6 is theoretically possible on the databases used. A
topical classification of these regions has been attempted
on one database, the detection accuracy for that index, however,
was very low.
At the utterance level dialogue acts such as statements,
questions, backchannels (aha, yeah, ...), etc. are being
recognized using a novel discriminatively trained HMM procedure.
The procedure can be extended to recognize short sequences such
as question/answer pairs, so called dialogue games.
Dialog acts and games are useful for building classifiers for
speaking style. Similarily a user may remember a certain dialog
act sequence and may search for it in a graphical
representation.
In a study with very pessimistic assumptions users are able to
pick one out of four similar and equiprobable meetings correctly
with an accuracy ~ 43% using graphical activity information.
Dialogue acts may be useful in this situation as well but the
sample size did not allow to draw final conclusions. However the
user study fails to show any effect for detailed basic features
such as formality or speaker identity
Answer extraction for simple and complex questions
xi, 214 leaves : ill. (some col.) ; 29 cm. --When a user is served with a ranked list of relevant documents by the standard document
search engines, his search task is usually not over. He has to go through the entire
document contents to find the precise piece of information he was looking for. Question
answering, which is the retrieving of answers to natural language questions from a document
collection, tries to remove the onus on the end-user by providing direct access to
relevant information. This thesis is concerned with open-domain question answering. We
have considered both simple and complex questions. Simple questions (i.e. factoid and
list) are easier to answer than questions that have complex information needs and require
inferencing and synthesizing information from multiple documents.
Our question answering system for simple questions is based on question classification
and document tagging. Question classification extracts useful information (i.e. answer
type) about how to answer the question and document tagging extracts useful information
from the documents, which is used in finding the answer to the question.
For complex questions, we experimented with both empirical and machine learning approaches.
We extracted several features of different types (i.e. lexical, lexical semantic,
syntactic and semantic) for each of the sentences in the document collection in order to
measure its relevancy to the user query. One hill climbing local search strategy is used
to fine-tune the feature-weights. We also experimented with two unsupervised machine
learning techniques: k-means and Expectation Maximization (EM) algorithms and evaluated
their performance. For all these methods, we have shown the effects of different kinds
of features
Writing for mobile media: The influences of text, digital design and psychological characteristics on the cognitive load of the mobile user
Text elements on the mobile smartphone interface make a significant contribution to the user’s interaction experience. In combination with other visual design features, these words curate the path of the mobile user on a journey through the information to satisfy a specific task. This study analyses the elements that influence the interpretation process and optimum presentation of information on mobile media. I argue that effective digital writing contributes to reducing the cognitive load experienced by the mobile user. The central discussion focuses on the writing of text for this medium, which I suggest forges an entirely unique narrative. The optimum writing approach is based on the multi-dimensional characteristics of hypertext, which allow the writer to facilitate the journey without the user losing control of the interpretation process. This study examines the relationship between the writer, the reader and the text, with a unique perspective on the mobile media writer, who is tasked with achieving balance between the functionality and humanity of digital interaction. To explore influences on the development of the relevant writing techniques, I present insights into the distinctive characteristics of the mobile smartphone device, with specific focus on the screen and keyboard. I also discuss the unique characteristics of the mobile user and show how the visual design of the interface is integral to the writing of text for this medium. Furthermore, this study explores the role, skills, and processes of the current and future digital writer, within the backdrop of incessant technological advancement and revolutionary changes in human-computer behaviour
Toward summarization of communicative activities in spoken conversation
This thesis is an inquiry into the nature and structure of face-to-face conversation, with a
special focus on group meetings in the workplace. I argue that conversations are composed
of episodes, each of which corresponds to an identifiable communicative activity such as
giving instructions or telling a story. These activities are important because they are part
of participants’ commonsense understanding of what happens in a conversation. They
appear in natural summaries of conversations such as meeting minutes, and participants
talk about them within the conversation itself. Episodic communicative activities therefore
represent an essential component of practical, commonsense descriptions of conversations.
The thesis objective is to provide a deeper understanding of how such activities may be
recognized and differentiated from one another, and to develop a computational method
for doing so automatically. The experiments are thus intended as initial steps toward future
applications that will require analysis of such activities, such as an automatic minute-taker
for workplace meetings, a browser for broadcast news archives, or an automatic decision
mapper for planning interactions.
My main theoretical contribution is to propose a novel analytical framework called participant
relational analysis. The proposal argues that communicative activities are principally
indicated through participant-relational features, i.e., expressions of relationships between
participants and the dialogue. Participant-relational features, such as subjective language,
verbal reference to the participants, and the distribution of speech activity amongst
the participants, are therefore argued to be a principal means for analyzing the nature and
structure of communicative activities.
I then apply the proposed framework to two computational problems: automatic discourse
segmentation and automatic discourse segment labeling. The first set of experiments
test whether participant-relational features can serve as a basis for automatically
segmenting conversations into discourse segments, e.g., activity episodes. Results show
that they are effective across different levels of segmentation and different corpora, and indeed sometimes more effective than the commonly-used method of using semantic links
between content words, i.e., lexical cohesion. They also show that feature performance is
highly dependent on segment type, suggesting that human-annotated “topic segments” are
in fact a multi-dimensional, heterogeneous collection of topic and activity-oriented units.
Analysis of commonly used evaluation measures, performed in conjunction with the
segmentation experiments, reveals that they fail to penalize substantially defective results
due to inherent biases in the measures. I therefore preface the experiments with a comprehensive
analysis of these biases and a proposal for a novel evaluation measure. A reevaluation
of state-of-the-art segmentation algorithms using the novel measure produces
substantially different results from previous studies. This raises serious questions about the
effectiveness of some state-of-the-art algorithms and helps to identify the most appropriate
ones to employ in the subsequent experiments.
I also preface the experiments with an investigation of participant reference, an important
type of participant-relational feature. I propose an annotation scheme with novel distinctions
for vagueness, discourse function, and addressing-based referent inclusion, each
of which are assessed for inter-coder reliability. The produced dataset includes annotations
of 11,000 occasions of person-referring.
The second set of experiments concern the use of participant-relational features to
automatically identify labels for discourse segments. In contrast to assigning semantic topic
labels, such as topical headlines, the proposed algorithm automatically labels segments
according to activity type, e.g., presentation, discussion, and evaluation. The method is
unsupervised and does not learn from annotated ground truth labels. Rather, it induces the
labels through correlations between discourse segment boundaries and the occurrence of
bracketing meta-discourse, i.e., occasions when the participants talk explicitly about what
has just occurred or what is about to occur. Results show that bracketing meta-discourse
is an effective basis for identifying some labels automatically, but that its use is limited if
global correlations to segment features are not employed.
This thesis addresses important pre-requisites to the automatic summarization of conversation.
What I provide is a novel activity-oriented perspective on how summarization
should be approached, and a novel participant-relational approach to conversational analysis.
The experimental results show that analysis of participant-relational features is
Conversión de texto en habla multidominio basada en selección de unidades con ajuste subjetivo de pesos y marcado robusto de pitch
El propòsit final de la conversió de text a parla (CTP) és la generació de parla sintètica completament natural a partir d'un text d'entrada qualsevol. Històricament, s'han seguit dues estratègies per a assolir aquest objectiu: la que prima la flexibilitat de la conversió davant la qualitat de la síntesi, donant lloc als sistemes de conversió de text a parla de propòsit general (CTP-PG); i la que anteposa la naturalitat de la síntesi a la generalitat de la CTP, coneguda com a conversió de text a parla de domini restringit (CTP-DR). En l'actualitat, l'estratègia més utilitzada per a desenvolupar els sistemes de CTP és la conversió de text a parla basada en corpus o per selecció d'unitats (CTP-SU). Tot i que la qualitat dels sistemes de CTP-SU és bastant bona en general, encara existeixen qüestions que continuen essent font d'investigació.
En aquesta tesi es presenten diverses aportacions en el context de la CTP-SU per a millorar, d'una banda, la naturalitat dels sistemes de CTP-PG i, per l'altra, la flexibilitat dels sistemes de CTP-DR. Per abordar la primera qüestió, es presenta una tècnica que permet incorporar de forma eficient la percepció humana al procés de selecció de les unitats del corpus de veu mitjançant l'ajust subjectiu dels pesos de la funció de cost que guia la selecció de les unitats, controlant la fatiga i la consistència de l'usuari. Així mateix, es presenta un mètode per a millorar la fiabilitat del procés d'etiquetatge automàtic del corpus de veu, concretament, de les marques de pitch ---qüestió fonamental en el context dels CTP basats en selecció d'unitats. En quant al segon problema, i seguint l'estratègia de CTP-DR, es presenta la conversió de text a parla multidomini (CTP-MD), que persegueix aconseguir una qualitat sintètica equivalent a la dels sistemes de CTP-DR, augmentant la seva flexibilitat per considerar diferents dominis (estils de locució, emocions, temàtiques, etc.) per a la síntesi. En aquest context, és necessari que el sistema de CTP-MD conegui, durant el procés de conversió de text a parla, quin domini o dominis són els més adequats per a poder sintetitzar el text d'entrada amb la major naturalitat possible. En aquest cas, el sistema de CTP-MD incorpora un mòdul de classificació de textos a l'arquitectura clàssica dels sistemes de CTP adaptat a les necessitats que planteja la CTP-MD. Finalment, totes les propostes descrites s'avaluen en termes objectius ---mitjançant l'ús de mesures clàssiques juntament amb noves propostes--- i/o subjectius ---mitjançant proves perceptives--- per a validar les millores aconseguides pels mètodes desenvolupats en el context de la CTP-SU en el camí cap al desenvolupament de nous sistemes de CTP d'alta qualitat y flexibilitat.El propósito final de la conversión de texto en habla (CTH) es la generación de habla sintética completamente natural a partir de un texto de entrada cualquiera. Históricamente, se han seguido dos estrategias para lograr este objetivo: la que prima la flexibilidad de la conversión ante la calidad de la síntesis, dando lugar a los sistemas de conversión de texto en habla de propósito general (CTH-PG); y la que antepone la naturalidad de la síntesis a la generalidad de la CTH, conocida como conversión de texto en habla de dominio restringido (CTH-DR). En la actualidad, la estrategia más utilizada para desarrollar los sistemas de CTH es la conversión de texto en habla basada en corpus o por selección de unidades (CTH-SU). Aunque la calidad de los sistemas de CTH-SU es bastante buena en general, todavía existen elementos que continúan siendo fuente de investigación.
En esta tesis se presentan distintas aportaciones en el contexto de la CTH-SU para mejorar, por un lado, la naturalidad de los sistemas de CTH-PG y, por otro, la flexibilidad de los sistemas de CTH-DR. Para abordar la primera cuestión, se presenta una técnica que permite incorporar de forma eficiente la percepción humana al proceso de selección de las unidades del corpus de voz mediante el ajuste subjetivo de los pesos de la función de coste que guía la selección de las unidades, controlando la fatiga y la consistencia del usuario. Asimismo, se presenta un método para mejorar la fiabilidad del proceso de etiquetado automático del corpus de voz, concretamente, de las marcas de pitch ---cuestión fundamental en el contexto de los CTH basados en selección de unidades. En cuanto al segundo problema, y siguiendo la estrategia de CTH-DR, se presenta la conversión de texto en habla multidominio (CTH-MD), que persigue conseguir una calidad sintética equivalente a la de los sistemas de CTH-DR, aumentando su flexibilidad al considerar distintos dominios (estilos de locución, emociones, temáticas, etc.) para la síntesis. En este contexto, es necesario que el sistema de CTH-MD conozca, durante el proceso de conversión de texto en habla, qué dominio o dominios son los más adecuados para poder sintetizar el texto de entrada con la mayor naturalidad posible. En este caso, el sistema de CTH-MD incorpora un módulo de clasificación de textos a la arquitectura clásica de los sistemas de CTH adaptado a las necesidades que plantea la CTH-MD. Finalmente, todas las propuestas descritas se evalúan en términos objetivos ---mediante el uso de medidas clásicas junto a nuevas propuestas--- y/o subjetivos ---mediante pruebas de percepción--- para validar las mejoras conseguidas por los métodos desarrollados en el contexto de la CTH-SU en el camino hacia el desarrollo de nuevos sistemas de CTH de elevada calidad y flexibilidad.The final purpose of any Text-to-Speech (TTS) system is the generation of perfectly natural synthetic speech from any input text. Historically, two strategies have been followed in the quest for this goal: the general purpose TTS synthesis (GP-TTS), which strives the flexibility of the application at the expense of the achieved synthetic speech quality; and the limited domain TTS synthesis (LD-TTS), which prioritizes the development of high quality TTS systems by restricting the scope of the input text. At present, the most used strategy to develop TTS systems is the so called corpus-based text-to-speech or unit selection TTS (US-TTS) synthesis. Although the quality of US-TTS synthesis systems is quite good in general, there are still several open issues which are still being investigated.
This PhD thesis introduces different contributions for US-TTS systems in order to improve, by one hand, the naturalness of GP-TTS systems, and by the other hand, the flexibility of LD-TTS systems. To deal with the former problem, a new technique for efficiently incorporating human perception in the unit selection process by means of subjective weight tuning is introduced, which also allows controlling user fatigue and user consistency. Moreover, a new method for improving the reliability of automatic speech corpus labelling is described, particularly, a generic pitch marks filtering algorithm is introduced ---an essential issue in corpus-based TTS systems. Moreover, the latter problem is addressed by multi-domain TTS (MD-TTS) synthesis, following the LD-TTS approach, which deals with achieving synthetic speech quality equivalent to that of LD-TTS systems, but improving TTS flexibility by considering different domains (speaking styles, emotions, topics, etc.) for conducting speech synthesis. In this context, the MD-TTS system needs to know, at run time, which domain or domains are the most suitable for synthesizing the input text with the highest synthetic speech quality. To that effect, the MD-TTS system incorporates a text classification module to classic TTS synthesis architecture adapted to the MD-TTS classification particularities. Finally, all the proposals are evaluated in terms of objective experiments ---by means of classic or new measures--- and/or subjective tests ---perceptual tests--- in order to validate the improvements achieved by the methods developed in the US-TTS framework, as a step further in our research towards developing high quality and flexible text-to-speech synthesis systems