334 research outputs found
Building a hybrid : chatterbot - dialog system
Generic conversational agents often use hard-coded stimulus- response data to generate responses, for which little to no effort is attributed to effectively understand and comprehend the input. The limitation of these types of systems is obvious: the general and linguistic knowledge of the system is limited to what the developer of the system explicitly defined. Therefore, a system which analyses user input at a deeper level of abstraction which backs its knowledge with common sense information will essentially result in a system that is capable of providing more adequate responses which in turn result in a better over- all user experience. From this premise, a framework was proposed, and a working prototype was implemented upon this framework. The prototype makes use of various natural language processing tools, online and offline knowledge bases, and other information sources, to enable it to comprehend and construct relevant responses.peer-reviewe
Increased recall in annotation variance detection in treebanks
Automatic inconsistency detection in parsed corpora is significantly helpful for building more and larger corpora of annotated texts. Inconsistencies are inevitable and originate from variance in annotation caused by different factors as, for instance, the lack of attention or the absence of clear annotation guidelines. In this paper, some results involving the automatic detection of annotation variance in parsed corpora are presented. In particular, it is shown that a generalization procedure substantially increases the recall of the variant detection algorithm proposed in [1]930257858618th International Conference on Text, Speech and Dialogue (TSD)2015-09República ChecaInt Speech Commun Assoc; Czech Soc Cybernet & Informat; Kerio Technol; Univ West Bohemia, Fac Appl Sci; Masaryk Univ, Fac InformatPilse
Key Phrase Extraction of Lightly Filtered Broadcast News
This paper explores the impact of light filtering on automatic key phrase
extraction (AKE) applied to Broadcast News (BN). Key phrases are words and
expressions that best characterize the content of a document. Key phrases are
often used to index the document or as features in further processing. This
makes improvements in AKE accuracy particularly important. We hypothesized that
filtering out marginally relevant sentences from a document would improve AKE
accuracy. Our experiments confirmed this hypothesis. Elimination of as little
as 10% of the document sentences lead to a 2% improvement in AKE precision and
recall. AKE is built over MAUI toolkit that follows a supervised learning
approach. We trained and tested our AKE method on a gold standard made of 8 BN
programs containing 110 manually annotated news stories. The experiments were
conducted within a Multimedia Monitoring Solution (MMS) system for TV and radio
news/programs, running daily, and monitoring 12 TV and 4 radio channels.Comment: In 15th International Conference on Text, Speech and Dialogue (TSD
2012
A Dataset and Strong Baselines for Classification of Czech News Texts
Pre-trained models for Czech Natural Language Processing are often evaluated
on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple
classification tasks such as sentiment classification or article classification
from a single news source. As an alternative, we present
CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech
classification datasets, composed of news articles from various sources
spanning over twenty years, which allows a more rigorous evaluation of such
models. We define four classification tasks: news source, news category,
inferred author's gender, and day of the week. To verify the task difficulty,
we conducted a human evaluation, which revealed that human performance lags
behind strong machine-learning baselines built upon pre-trained transformer
models. Furthermore, we show that language-specific pre-trained encoder
analysis outperforms selected commercially available large-scale generative
language models.Comment: 12 pages, Accepted to Text, Speech and Dialogue (TSD) 202
Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
Prosodic boundaries in speech are of great relevance to both speech synthesis
and audio annotation. In this paper, we apply the wav2vec 2.0 framework to the
task of detecting these boundaries in speech signal, using only acoustic
information. We test the approach on a set of recordings of Czech broadcast
news, labeled by phonetic experts, and compare it to an existing text-based
predictor, which uses the transcripts of the same data. Despite using a
relatively small amount of labeled data, the wav2vec2 model achieves an
accuracy of 94% and F1 measure of 83% on within-sentence prosodic boundaries
(or 95% and 89% on all prosodic boundaries), outperforming the text-based
approach. However, by combining the outputs of the two different models we can
improve the results even further.Comment: This preprint is a pre-review version of the paper and does not
contain any post-submission improvements or corrections. The Version of
Record of this contribution is published in the proceedings of the
International Conference on Text, Speech, and Dialogue (TSD 2022), LNAI
volume 13502, and is available online at
https://doi.org/10.1007/978-3-031-16270-1_3
Czech Text Document Corpus v 2.0
This paper introduces "Czech Text Document Corpus v 2.0", a collection of
text documents for automatic document classification in Czech language. It is
composed of the text documents provided by the Czech News Agency and is freely
available for research purposes at http://ctdc.kiv.zcu.cz/. This corpus was
created in order to facilitate a straightforward comparison of the document
classification approaches on Czech data. It is particularly dedicated to
evaluation of multi-label document classification approaches, because one
document is usually labelled with more than one label. Besides the information
about the document classes, the corpus is also annotated at the morphological
layer. This paper further shows the results of selected state-of-the-art
methods on this corpus to offer the possibility of an easy comparison with
these approaches.Comment: Accepted for LREC 201
- …