5,223 research outputs found
CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap
After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in
multimedia search engines, we have identified and analyzed gaps within European research effort during our second year.
In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio-
economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown
of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on
requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the
community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our
Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as
National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core
technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research
challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal
challenges
MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension
[EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements no. 761758 (X5Gon) and 952215 (TAILOR), and Erasmus+ Education programme under grant agreement no. 20-226-093604-SCH (EXPERT); the Government of Spain's grant RTI2018-094879-B-I00 (Multisub) funded by MCIN/AEI/10.13039/501100011033 & "ERDF A way of making Europe", and FPU scholarships FPU14/03981 and FPU18/04135; the Generalitat Valenciana's research project Classroom Activity Recognition (ref. PROMETEO/2019/111), and predoctoral research scholarship ACIF/2017/055; and the Universitat Politecnica de Valencia's PAID-01-17 R&D support programme.Baquero-Arnal, P.; Jorge-Cano, J.; Giménez Pastor, A.; Iranzo-Sánchez, J.; Pérez-González De Martos, AM.; Garcés Díaz-Munío, G.; Silvestre Cerdà, JA.... (2022). MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension. Applied Sciences. 12(2):1-14. https://doi.org/10.3390/app1202080411412
Transformer Models for Machine Translation and Streaming Automatic Speech Recognition
[ES] El procesamiento del lenguaje natural (NLP) es un conjunto de problemas
computacionales con aplicaciones de máxima relevancia, que junto con otras
tecnologías informáticas se ha beneficiado de la revolución que ha significado
el aprendizaje profundo. Esta tesis se centra en dos problemas fundamentales
para el NLP: la traducción automática (MT) y el reconocimiento automático
del habla o transcripción automática (ASR); así como en una arquitectura
neuronal profunda, el Transformer, que pondremos en práctica para mejorar
las soluciones de MT y ASR en algunas de sus aplicaciones.
El ASR y MT pueden servir para obtener textos multilingües de alta calidad a
un coste razonable para una diversidad de contenidos audiovisuales. Concre-
tamente, esta tesis aborda problemas como el de traducción de noticias o el de
subtitulación automática de televisión. El ASR y MT también se pueden com-
binar entre sí, generando automáticamente subtítulos traducidos, o con otras
soluciones de NLP: resumen de textos para producir resúmenes de discursos, o
síntesis del habla para crear doblajes automáticos. Estas aplicaciones quedan
fuera del alcance de esta tesis pero pueden aprovechar las contribuciones que
contiene, en la meduda que ayudan a mejorar el rendimiento de los sistemas
automáticos de los que dependen.
Esta tesis contiene una aplicación de la arquitectura Transformer al MT tal y
como fue concebida, mediante la que obtenemos resultados de primer nivel en
traducción de lenguas semejantes. En capítulos subsecuentes, esta tesis aborda
la adaptación del Transformer como modelo de lenguaje para sistemas híbri-
dos de ASR en vivo. Posteriormente, describe la aplicación de este tipus de
sistemas al caso de uso de subtitulación de televisión, participando en una com-
petición pública de RTVE donde obtenemos la primera posición con un marge
importante. También demostramos que la mejora se debe principalmenta a la
tecnología desarrollada y no tanto a la parte de los datos.[CA] El processament del llenguage natural (NLP) és un conjunt de problemes com-
putacionals amb aplicacions de màxima rellevància, que juntament amb al-
tres tecnologies informàtiques s'ha beneficiat de la revolució que ha significat
l'impacte de l'aprenentatge profund. Aquesta tesi se centra en dos problemes
fonamentals per al NLP: la traducció automàtica (MT) i el reconeixement
automàtic de la parla o transcripció automàtica (ASR); així com en una ar-
quitectura neuronal profunda, el Transformer, que posarem en pràctica per a
millorar les solucions de MT i ASR en algunes de les seues aplicacions.
l'ASR i MT poden servir per obtindre textos multilingües d'alta qualitat a un
cost raonable per a un gran ventall de continguts audiovisuals. Concretament,
aquesta tesi aborda problemes com el de traducció de notícies o el de subtitu-
lació automàtica de televisió. l'ASR i MT també es poden combinar entre ells,
generant automàticament subtítols traduïts, o amb altres solucions de NLP:
amb resum de textos per produir resums de discursos, o amb síntesi de la parla
per crear doblatges automàtics. Aquestes altres aplicacions es troben fora de
l'abast d'aquesta tesi però poden aprofitar les contribucions que conté, en la
mesura que ajuden a millorar els resultats dels sistemes automàtics dels quals
depenen.
Aquesta tesi conté una aplicació de l'arquitectura Transformer al MT tal com
va ser concebuda, mitjançant la qual obtenim resultats de primer nivell en
traducció de llengües semblants. En capítols subseqüents, aquesta tesi aborda
l'adaptació del Transformer com a model de llenguatge per a sistemes híbrids
d'ASR en viu. Posteriorment, descriu l'aplicació d'aquest tipus de sistemes al
cas d'ús de subtitulació de continguts televisius, participant en una competició
pública de RTVE on obtenim la primera posició amb un marge significant.
També demostrem que la millora es deu principalment a la tecnologia desen-
volupada i no tant a la part de les dades[EN] Natural language processing (NLP) is a set of fundamental computing prob-
lems with immense applicability, as language is the natural communication
vehicle for people. NLP, along with many other computer technologies, has
been revolutionized in recent years by the impact of deep learning. This thesis
is centered around two keystone problems for NLP: machine translation (MT)
and automatic speech recognition (ASR); and a common deep neural architec-
ture, the Transformer, that is leveraged to improve the technical solutions for
some MT and ASR applications.
ASR and MT can be utilized to produce cost-effective, high-quality multilin-
gual texts for a wide array of media. Particular applications pursued in this
thesis are that of news translation or that of automatic live captioning of tele-
vision broadcasts. ASR and MT can also be combined with each other, for
instance generating automatic translated subtitles from audio, or augmented
with other NLP solutions: text summarization to produce a summary of a
speech, or speech synthesis to create an automatic translated dubbing, for in-
stance. These other applications fall out of the scope of this thesis, but can
profit from the contributions that it contains, as they help to improve the
performance of the automatic systems on which they depend.
This thesis contains an application of the Transformer architecture to MT as it
was originally conceived, achieving state-of-the-art results in similar language
translation. In successive chapters, this thesis covers the adaptation of the
Transformer as a language model for streaming hybrid ASR systems. After-
wards, it describes how we applied the developed technology for a specific use
case in television captioning by participating in a competitive challenge and
achieving the first position by a large margin. We also show that the gains
came mostly from the improvement in technology capabilities over two years
including that of the Transformer language model adapted for streaming, and
the data component was minor.Baquero Arnal, P. (2023). Transformer Models for Machine Translation and Streaming Automatic Speech Recognition [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/19368
Interrelation and Consistency in the Thematic Stages of the Publicistic Style
In the article, we analyzed the combination of thematic and key vocabulary, words with the highest usage rates. In today's world of publicistic communication has gained global significance. Summarizing all that has been said about the thematic chains of the publicistic text, let us note their structural and content certainty: as the main nomination, they consistently use the basic one, in the composition of non-main nominations the most significant are folded transforms, after which, significantly inferior to them in frequency, there are substitutes. Semantically and stylistically, the thematic chains of a publicistic text are uniform, they have a bookish character, since these chains are defined by a term and implemented on its basis. Comparison of the use of the terms cyclization and cycle in this text shows that the second of them, being no less frequent in the text, is used mainly as a means of segmentation in the middle of the article and especially when it comes to research material. The differences in the composition of the nomination chains of the main theme are insignificant. To the named structural types of nominations, one can single out the main nominations of the text chain, add only grammatical transformations. Thus, in this article we can note the status of the base unit as the main in the text. The nature of the information entered and the compositional role properties of bundles that are equally relevant for a publicistic text. A certain uniformity is also observed in the field of combinatorics of the linguistic components of the publicistic thematic chain
Value Creation in a QoE Environment
User behavior of multimedia services currently undergoes strong changes. This is reflected in several recent trends, e.g. the increase of rich media content consumption, preferences for more individual and personalized services and the higher sensitivity of end users for quality issues. These changes will eventually lead to strong changes in network traffic characteristics: rising congestion in peak times and less availability of bandwidth for the individual user. As a result, the quality as perceived by the end-user will decrease if network operators and service providers do not anticipate the required changes for the network. Measurable network requirements such as available video and speech quality, security and reliability are addressed by technologies that are commonly summed up in the Quality of Service (QoS) concept. However, the end-users' perception of quality is only reflected in the wider concept of Quality of Experience (QoE). This takes the measurable network requirements into account as well as customer needs, wants and preferences. For the implementation of QoE technologies several network components need to be added or changed resulting in high capital expenditures. Yet, it is not clear if these costs can be compensated with efficiency increases. Thus, new revenue streams for the network operator are necessary to incentivize investments in QoE technologies. In this paper we address four new value creation models that can serve as basis for more elaborated business models for network operators and other actors. We show how interest in QoE of the user, the content provider, the service provider and the advertiser induces new revenue streams. These models are embedded in five possible future QoE scenarios that reveal regulation, end user quality sensibility and end-to-end support as major issues for the future. --Business Models,Quality of Experience (QoE),Quality of Service (QoS),Value Creation
On the Use of Hybrid Heuristics for Providing Service to Select the Return Channel in an Interactive Digital TV Environment
The technologies used to link the end-user to a telecommunication infrastructure, has been changing over time due to the consolidation of new access technologies. Moreover, the emergence of new tools for information dissemination, such as interactive digital TV, makes the selection of access technology, factor of fundamental importance. One of the greatest advantages of using digital TV as means to disseminate information is the installation of applications. In this chapter, a load characterization of a typical application embedded in a digital TV is performed to determine its behavior. However, it is important to note that applications send information through an access technology. Therefore, this chapter, based on the study on load characterization, developed a methodology combining Bayesian networks and technique for order preference by similarity to ideal solution (TOPSIS) analytical approach to provide support to service providers to opt for a technology (power line communication, PLC, wireless, wired, etc.) for the return channel
Accessing spoken interaction through dialogue processing [online]
Zusammenfassung
Unser Leben, unsere Leistungen und unsere Umgebung, alles wird
derzeit durch Schriftsprache dokumentiert. Die rasante
Fortentwicklung der technischen Möglichkeiten Audio, Bilder und
Video aufzunehmen, abzuspeichern und wiederzugeben kann genutzt
werden um die schriftliche Dokumentation von menschlicher
Kommunikation, zum Beispiel Meetings, zu unterstützen, zu
ergänzen oder gar zu ersetzen. Diese neuen Technologien können
uns in die Lage versetzen Information aufzunehmen, die
anderweitig verloren gehen, die Kosten der Dokumentation zu
senken und hochwertige Dokumente mit audiovisuellem Material
anzureichern. Die Indizierung solcher Aufnahmen stellt die
Kerntechnologie dar um dieses Potential auszuschöpfen. Diese
Arbeit stellt effektive Alternativen zu schlüsselwortbasierten
Indizes vor, die Suchraumeinschränkungen bewirken und teilweise
mit einfachen Mitteln zu berechnen sind.
Die Indizierung von Sprachdokumenten kann auf verschiedenen
Ebenen erfolgen: Ein Dokument gehört stilistisch einer
bestimmten Datenbasis an, welche durch sehr einfache Merkmale
bei hoher Genauigkeit automatisch bestimmt werden kann.
Durch diese Art von Klassifikation kann eine Reduktion des
Suchraumes um einen Faktor der Größenordnung 410 erfolgen. Die
Anwendung von thematischen Merkmalen zur Textklassifikation
bei einer Nachrichtendatenbank resultiert in einer Reduktion um
einen Faktor 18. Da Sprachdokumente sehr lang sein können müssen
sie in thematische Segmente unterteilt werden. Ein neuer
probabilistischer Ansatz sowie neue Merkmale (Sprecherinitia
tive und Stil) liefern vergleichbare oder bessere Resultate als
traditionelle schlüsselwortbasierte Ansätze. Diese thematische
Segmente können durch die vorherrschende Aktivität
charakterisiert werden (erzählen, diskutieren, planen, ...),
die durch ein neuronales Netz detektiert werden kann. Die
Detektionsraten sind allerdings begrenzt da auch Menschen
diese Aktivitäten nur ungenau bestimmen. Eine maximale
Reduktion des Suchraumes um den Faktor 6 ist bei den verwendeten
Daten theoretisch möglich. Eine thematische Klassifikation
dieser Segmente wurde ebenfalls auf einer Datenbasis
durchgeführt, die Detektionsraten für diesen Index sind jedoch
gering.
Auf der Ebene der einzelnen Äußerungen können Dialogakte wie
Aussagen, Fragen, Rückmeldungen (aha, ach ja, echt?, ...) usw.
mit einem diskriminativ trainierten Hidden Markov Model erkannt
werden. Dieses Verfahren kann um die Erkennung von kurzen Folgen
wie Frage/AntwortSpielen erweitert werden (Dialogspiele).
Dialogakte und spiele können eingesetzt werden um
Klassifikatoren für globale Sprechstile zu bauen. Ebenso
könnte ein Benutzer sich an eine bestimmte Dialogaktsequenz
erinnern und versuchen, diese in einer grafischen
Repräsentation wiederzufinden.
In einer Studie mit sehr pessimistischen Annahmen konnten
Benutzer eines aus vier ähnlichen und gleichwahrscheinlichen
Gesprächen mit einer Genauigkeit von ~ 43% durch eine graphische
Repräsentation von Aktivität bestimmt.
Dialogakte könnte in diesem Szenario ebenso nützlich sein, die
Benutzerstudie konnte aufgrund der geringen Datenmenge darüber
keinen endgültigen Aufschluß geben. Die Studie konnte allerdings
für detailierte Basismerkmale wie Formalität und
Sprecheridentität keinen Effekt zeigen.
Abstract
Written language is one of our primary means for documenting our
lives, achievements, and environment. Our capabilities to
record, store and retrieve audio, still pictures, and video are
undergoing a revolution and may support, supplement or even
replace written documentation. This technology enables us to
record information that would otherwise be lost, lower the cost
of documentation and enhance highquality documents with
original audiovisual material.
The indexing of the audio material is the key technology to
realize those benefits. This work presents effective
alternatives to keyword based indices which restrict the search
space and may in part be calculated with very limited resources.
Indexing speech documents can be done at a various levels:
Stylistically a document belongs to a certain database which can
be determined automatically with high accuracy using very simple
features. The resulting factor in search space reduction is in
the order of 410 while topic classification yielded a factor
of 18 in a news domain.
Since documents can be very long they need to be segmented into
topical regions. A new probabilistic segmentation framework as
well as new features (speaker initiative and style) prove to be
very effective compared to traditional keyword based methods. At
the topical segment level activities (storytelling, discussing,
planning, ...) can be detected using a machine learning approach
with limited accuracy; however even human annotators do not
annotate them very reliably. A maximum search space reduction
factor of 6 is theoretically possible on the databases used. A
topical classification of these regions has been attempted
on one database, the detection accuracy for that index, however,
was very low.
At the utterance level dialogue acts such as statements,
questions, backchannels (aha, yeah, ...), etc. are being
recognized using a novel discriminatively trained HMM procedure.
The procedure can be extended to recognize short sequences such
as question/answer pairs, so called dialogue games.
Dialog acts and games are useful for building classifiers for
speaking style. Similarily a user may remember a certain dialog
act sequence and may search for it in a graphical
representation.
In a study with very pessimistic assumptions users are able to
pick one out of four similar and equiprobable meetings correctly
with an accuracy ~ 43% using graphical activity information.
Dialogue acts may be useful in this situation as well but the
sample size did not allow to draw final conclusions. However the
user study fails to show any effect for detailed basic features
such as formality or speaker identity
A new business model?
The paper delivers an analysis of the “New Economy” focussing on the roles of new business models, the capital market and venture capital. The capital market created a double standard in the 1990s: A high return on capital was required from old economy firms whereas money was thrown at new economy firms which had a business idea that stimulated the fantasies of financial investors but no earnings. Through the gradual burst of the tech stock bubble since spring 2000 it has come to the eyes of the public that many new economy start ups were unable to recover their costs. This paper shows that business models related to the internet can only work under certain conditions. The sectoral distribution of power, for example, determines the prospects of the single firms to realise e-commerce in a profitable way. Digital technologies do not necessarily enhance profitability. On the contrary, they can increase competition and lead to lower profit rates. The limitation of competition appears to be a central condition of successful cost recovery. The venture capital cycle has been an important driving force of the new economy boom, but it can also be momentum of a longer crisis. Enormous amounts of money have been channeled to new economy start ups hoping that successful IPOs will one day give venture capitalists a high return. But the burst of the bubble has brought down the IPO activity and interrupted the valorisation cycle of venture capital. Financial investors have reacted to the crisis by shifting their capital to even riskier investments, as the come-back of hedge funds indicates. --
- …