273 research outputs found
Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora
This thesis describes the development and in-depth empirical investigation of a
method, called BootMark, for bootstrapping the marking up of named entities
in textual documents. The reason for working with documents, as opposed to
for instance sentences or phrases, is that the BootMark method is concerned
with the creation of corpora. The claim made in the thesis is that BootMark
requires a human annotator to manually annotate fewer documents in order to
produce a named entity recognizer with a given performance, than would be
needed if the documents forming the basis for the recognizer were randomly
drawn from the same corpus. The intention is then to use the created named en-
tity recognizer as a pre-tagger and thus eventually turn the manual annotation
process into one in which the annotator reviews system-suggested annotations
rather than creating new ones from scratch. The BootMark method consists of
three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping
– active machine learning for the purpose of selecting which document to an-
notate next; (3) The remaining unannotated documents of the original corpus
are marked up using pre-tagging with revision.
Five emerging issues are identified, described and empirically investigated
in the thesis. Their common denominator is that they all depend on the real-
ization of the named entity recognition task, and as such, require the context
of a practical setting in order to be properly addressed. The emerging issues
are related to: (1) the characteristics of the named entity recognition task and
the base learners used in conjunction with it; (2) the constitution of the set of
documents annotated by the human annotator in phase one in order to start the
bootstrapping process; (3) the active selection of the documents to annotate in
phase two; (4) the monitoring and termination of the active learning carried out
in phase two, including a new intrinsic stopping criterion for committee-based
active learning; and (5) the applicability of the named entity recognizer created
during phase two as a pre-tagger in phase three.
The outcomes of the empirical investigations concerning the emerging is-
sues support the claim made in the thesis. The results also suggest that while
the recognizer produced in phases one and two is as useful for pre-tagging as
a recognizer created from randomly selected documents, the applicability of
the recognizer as a pre-tagger is best investigated by conducting a user study
involving real annotators working on a real named entity recognition task
ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications
Personal assistants, automatic speech recognizers and dialogue understanding
systems are becoming more critical in our interconnected digital world. A clear
example is air traffic control (ATC) communications. ATC aims at guiding
aircraft and controlling the airspace in a safe and optimal manner. These
voice-based dialogues are carried between an air traffic controller (ATCO) and
pilots via very-high frequency radio channels. In order to incorporate these
novel technologies into ATC (low-resource domain), large-scale annotated
datasets are required to develop the data-driven AI systems. Two examples are
automatic speech recognition (ASR) and natural language understanding (NLU). In
this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering
research on the challenging ATC field, which has lagged behind due to lack of
annotated data. The ATCO2 corpus covers 1) data collection and pre-processing,
2) pseudo-annotations of speech data, and 3) extraction of ATC-related named
entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set
corpus contains 4 hours of ATC speech with manual transcripts and a subset with
gold annotations for named-entity recognition (callsign, command, value). 2)
The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched
with automatic transcripts from an in-domain speech recognizer, contextual
information, speaker turn information, signal-to-noise ratio estimate and
English language detection score per sample. Both available for purchase
through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3)
The ATCO2-test-set-1h corpus is a one-hour subset from the original test set
corpus, that we are offering for free at https://www.atco2.org/data. We expect
the ATCO2 corpus will foster research on robust ASR and NLU not only in the
field of ATC communications but also in the general research community.Comment: Manuscript under review; The code will be available at
https://github.com/idiap/atco2-corpu
On the voice-activated question answering
[EN] Question answering (QA) is probably one of the most challenging tasks in the field of natural language processing. It requires search engines that are capable of extracting concise, precise fragments of text that contain an answer to a question posed by the user. The incorporation of voice interfaces to the QA systems adds a more natural and very appealing perspective for these systems. This paper provides a comprehensive description of current state-of-the-art voice-activated QA systems. Finally, the scenarios that will emerge from the introduction of speech recognition in QA will be discussed. © 2006 IEEE.This work was supported in part by Research Projects TIN2009-13391-C04-03 and TIN2008-06856-C05-02. This paper was recommended by Associate Editor V. Marik.Rosso, P.; Hurtado Oliver, LF.; Segarra Soriano, E.; SanchĂs Arnal, E. (2012). On the voice-activated question answering. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 42(1):75-85. https://doi.org/10.1109/TSMCC.2010.2089620S758542
What Should Streamers Communicate in Livestream E-Commerce? The Effects of Social Interactions on Live Streaming Performance
Compared with traditional e-commerce, livestreaming e-commerce is characterized by direct and intimate communication between streamers and consumers that stimulates instant social interactions. This study focuses on streamers’ three types of information exchange (i.e., product information, social conversation, and social solicitation) and examines their roles in driving both short-term and long-term livestreaming performance (i.e., sales and customer base growth). We find that the informational role of product information (nonpromotional and promotional) is beneficial not only to sales performance, but also to the growth of the customer base. We also find that social conversation has a relationship-building effect that positively impacts both sales and customer base growth, whereas social solicitation has both a relationship-building and a relationship-straining effect that positively affects customer base growth but can hurt sales. Furthermore, our results show that streamers’ social interactions with consumers can stimulate consumer engagement in different ways, leading to different effects on livestreaming performance
- …