70 research outputs found
Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
Recent advances in artificial general intelligence (AGI), particularly large
language models and creative image generation systems have demonstrated
impressive capabilities on diverse tasks spanning the arts and humanities.
However, the swift evolution of AGI has also raised critical questions about
its responsible deployment in these culturally significant domains
traditionally seen as profoundly human. This paper provides a comprehensive
analysis of the applications and implications of AGI for text, graphics, audio,
and video pertaining to arts and the humanities. We survey cutting-edge systems
and their usage in areas ranging from poetry to history, marketing to film, and
communication to classical art. We outline substantial concerns pertaining to
factuality, toxicity, biases, and public safety in AGI systems, and propose
mitigation strategies. The paper argues for multi-stakeholder collaboration to
ensure AGI promotes creativity, knowledge, and cultural values without
undermining truth or human dignity. Our timely contribution summarizes a
rapidly developing field, highlighting promising directions while advocating
for responsible progress centering on human flourishing. The analysis lays the
groundwork for further research on aligning AGI's technological capacities with
enduring social goods
Investigating Social Interactions Using Multi-Modal Nonverbal Features
Every day, humans are involved in social situations and interplays, with the goal of
sharing emotions and thoughts, establishing relationships with or acting on other
human beings. These interactions are possible thanks to what is called social intelligence,
which is the ability to express and recognize social signals produced during
the interactions. These signals aid the information exchange and are expressed
through verbal and non-verbal behavioral cues, such as facial expressions, gestures,
body pose or prosody. Recently, many works have demonstrated that social signals
can be captured and analyzed by automatic systems, giving birth to a relatively
new research area called social signal processing, which aims at replicating human
social intelligence with machines. In this thesis, we explore the use of behavioral
cues and computational methods for modeling and understanding social interactions.
Concretely, we focus on several behavioral cues in three specic contexts:
rst, we analyze the relationship between gaze and leadership in small group interactions.
Second, we expand our analysis to face and head gestures in the context of
deception detection in dyadic interactions. Finally, we analyze the whole body for
group detection in mingling scenarios
Preference Modeling in Data-Driven Product Design: Application in Visual Aesthetics
Creating a form that is attractive to the intended market audience is one of the greatest challenges in product development given the subjective nature of preference and heterogeneous market segments with potentially different product preferences. Accordingly, product designers use a variety of qualitative and quantitative research tools to assess product preferences across market segments, such as design theme clinics, focus groups, customer surveys, and design reviews; however, these tools are still limited due to their dependence on subjective judgment, and being time and resource intensive. In this dissertation, we focus on a key research question: how can we understand and predict more reliably the preference for a future product in heterogeneous markets, so that this understanding can inform designers' decision-making?
We present a number of data-driven approaches to model product preference. Instead of depending on any subjective judgment from human, the proposed preference models investigate the mathematical patterns behind users’ choice and behavior. This allows a more objective translation of customers' perception and preference into analytical relations that can inform design decision-making. Moreover, these models are scalable in that they have the capacity to analyze large-scale data and model customer heterogeneity accurately across market segments. In particular, we use feature representation as an intermediate step in our preference model, so that we can not only increase the predictive accuracy of the model but also capture in-depth insight into customers' preference.
We tested our data-driven approaches with applications in visual aesthetics preference. Our results show that the proposed approaches can obtain an objective measurement of aesthetic perception and preference for a given market segment. This measurement enables designers to reliably evaluate and predict the aesthetic appeal of their designs. We also quantify the relative importance of aesthetic attributes when both aesthetic attributes and functional attributes are considered by customers. This quantification has great utility in helping product designers and executives in design reviews and selection of designs. Moreover, we visualize the possible factors affecting customers' perception of product aesthetics and how these factors differ across different market segments. Those visualizations are incredibly important to designers as they relate physical design details to psychological customer reactions.
The main contribution of this dissertation is to present purely data-driven approaches that enable designers to quantify and interpret more reliably the product preference. Methodological contributions include using modern probabilistic approaches and feature learning algorithms to quantitatively model the design process involving product aesthetics. These novel approaches can not only increase the predictive accuracy but also capture insights to inform design decision-making.PHDDesign ScienceUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145987/1/yanxinp_1.pd
A textual and visual features-jointly driven hybrid intelligent system for digital physical education teaching quality evaluation
The utilization of intelligent computing in digital teaching quality evaluation has been a practical demand in smart cities. Currently, related research works can be categorized into two types: textual data-based approaches and visual data-based approaches. Due to the gap between their different formats and modalities, it remains very challenging to integrate them together when conducting digital teaching quality evaluation. In fact, the two types of information can both reflect distinguished knowledge from their own perspectives. To bridge this gap, this paper proposes a textual and visual features-jointly driven hybrid intelligent system for digital teaching quality evaluation. Visual features are extracted with the use of a multiscale convolution neural network by introducing receptive fields with different sizes. Textual features serve as the auxiliary contents for major visual features, and are extracted using a recurrent neural network. At last, we implement the proposed method through some simulation experiments to evaluate its practical running performance, and a real-world dataset collected from teaching activities is employed for this purpose. We obtain some groups of experimental results, which reveal that the hybrid intelligent system developed by this paper can bring more than 10% improvement of efficiency towards digital teaching quality evaluation
Modeling Visual Rhetoric and Semantics in Multimedia
Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects.
Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems
Learning Biosignals with Deep Learning
The healthcare system, which is ubiquitously recognized as one of the most influential
system in society, is facing new challenges since the start of the decade.The myriad of
physiological data generated by individuals, namely in the healthcare system, is generating
a burden on physicians, losing effectiveness on the collection of patient data. Information
systems and, in particular, novel deep learning (DL) algorithms have been prompting a
way to take this problem.
This thesis has the aim to have an impact in biosignal research and industry by
presenting DL solutions that could empower this field. For this purpose an extensive study
of how to incorporate and implement Convolutional Neural Networks (CNN), Recursive
Neural Networks (RNN) and Fully Connected Networks in biosignal studies is discussed.
Different architecture configurations were explored for signal processing and decision
making and were implemented in three different scenarios: (1) Biosignal learning and
synthesis; (2) Electrocardiogram (ECG) biometric systems, and; (3) Electrocardiogram
(ECG) anomaly detection systems. In (1) a RNN-based architecture was able to replicate
autonomously three types of biosignals with a high degree of confidence. As for (2) three
CNN-based architectures, and a RNN-based architecture (same used in (1)) were used
for both biometric identification, reaching values above 90% for electrode-base datasets
(Fantasia, ECG-ID and MIT-BIH) and 75% for off-person dataset (CYBHi), and biometric
authentication, achieving Equal Error Rates (EER) of near 0% for Fantasia and MIT-BIH
and bellow 4% for CYBHi. As for (3) the abstraction of healthy clean the ECG signal
and detection of its deviation was made and tested in two different scenarios: presence of
noise using autoencoder and fully-connected network (reaching 99% accuracy for binary
classification and 71% for multi-class), and; arrhythmia events by including a RNN to the
previous architecture (57% accuracy and 61% sensitivity).
In sum, these systems are shown to be capable of producing novel results. The incorporation
of several AI systems into one could provide to be the next generation of
preventive medicine, as the machines have access to different physiological and anatomical
states, it could produce more informed solutions for the issues that one may face in the
future increasing the performance of autonomous preventing systems that could be used
in every-day life in remote places where the access to medicine is limited. These systems will also help the study of the signal behaviour and how they are made in real life context
as explainable AI could trigger this perception and link the inner states of a network with
the biological traits.O sistema de saúde, que é ubiquamente reconhecido como um dos sistemas mais influentes
da sociedade, enfrenta novos desafios desde o ínicio da década. A miríade de dados fisiológicos
gerados por indíviduos, nomeadamente no sistema de saúde, está a gerar um fardo
para os médicos, perdendo a eficiência no conjunto dos dados do paciente. Os sistemas de
informação e, mais espcificamente, da inovação de algoritmos de aprendizagem profunda
(DL) têm sido usados na procura de uma solução para este problema.
Esta tese tem o objetivo de ter um impacto na pesquisa e na indústria de biosinais,
apresentando soluções de DL que poderiam melhorar esta área de investigação. Para
esse fim, é discutido um extenso estudo de como incorporar e implementar redes neurais
convolucionais (CNN), redes neurais recursivas (RNN) e redes totalmente conectadas para
o estudo de biosinais.
Diferentes arquiteturas foram exploradas para processamento e tomada de decisão de
sinais e foram implementadas em três cenários diferentes: (1) Aprendizagem e síntese de
biosinais; (2) sistemas biométricos com o uso de eletrocardiograma (ECG), e; (3) Sistema
de detecção de anomalias no ECG. Em (1) uma arquitetura baseada na RNN foi capaz
de replicar autonomamente três tipos de sinais biológicos com um alto grau de confiança.
Quanto a (2) três arquiteturas baseadas em CNN e uma arquitetura baseada em RNN
(a mesma usada em (1)) foram usadas para ambas as identificações, atingindo valores
acima de 90 % para conjuntos de dados à base de eletrodos (Fantasia, ECG-ID e MIT
-BIH) e 75 % para o conjunto de dados fora da pessoa (CYBHi) e autenticação, atingindo
taxas de erro iguais (EER) de quase 0 % para Fantasia e MIT-BIH e abaixo de 4 % para
CYBHi. Quanto a (3) a abstração de sinais limpos e assimptomáticos de ECG e a detecção
do seu desvio foram feitas e testadas em dois cenários diferentes: na presença de ruído
usando um autocodificador e uma rede totalmente conectada (atingindo 99 % de precisão
na classificação binária e 71 % na multi-classe), e; eventos de arritmia incluindo um RNN
na arquitetura anterior (57 % de precisão e 61 % de sensibilidade).
Em suma, esses sistemas são mais uma vez demonstrados como capazes de produzir
resultados inovadores. A incorporação de vários sistemas de inteligência artificial em
um unico sistema pederá desencadear a próxima geração de medicina preventiva. Os
algoritmos ao terem acesso a diferentes estados fisiológicos e anatómicos, podem produzir
soluções mais informadas para os problemas que se possam enfrentar no futuro, aumentando o desempenho de sistemas autónomos de prevenção que poderiam ser usados na vida
quotidiana, nomeadamente em locais remotos onde o acesso à medicinas é limitado. Estes
sistemas também ajudarão o estudo do comportamento do sinal e como eles são feitos no
contexto da vida real, pois a IA explicável pode desencadear essa percepção e vincular os
estados internos de uma rede às características biológicas
Automatic understanding of multimodal content for Web-based learning
Web-based learning has become an integral part of everyday life for all ages and backgrounds. On the one hand, the advantages of this learning type, such as availability, accessibility, flexibility, and cost, are apparent. On the other hand, the oversupply of content can lead to learners struggling to find optimal resources efficiently. The interdisciplinary research field Search as Learning is concerned with the analysis and improvement of Web-based learning processes, both on the learner and the computer science side.
So far, automatic approaches that assess and recommend learning resources in Search as Learning (SAL) focus on textual, resource, and behavioral features. However, these approaches commonly ignore multimodal aspects. This work addresses this research gap by proposing several approaches that address the question of how multimodal retrieval methods can help support learning on the Web. First, we evaluate whether textual metadata of the TIB AV-Portal can be exploited and enriched by semantic word embeddings to generate video recommendations and, in addition, a video summarization technique to improve exploratory search. Then we turn to the challenging task of knowledge gain prediction that estimates the potential learning success given a specific learning resource. We used data from two user studies for our approaches. The first one observes the knowledge gain when learning with videos in a Massive Open Online Course (MOOC) setting, while the second one provides an informal Web-based learning setting where the subjects have unrestricted access to the Internet. We then extend the purely textual features to include visual, audio, and cross-modal features for a holistic representation of learning resources. By correlating these features with the achieved knowledge gain, we can estimate the impact of a particular learning resource on learning success.
We further investigate the influence of multimodal data on the learning process by examining how the combination of visual and textual content generally conveys information. For this purpose, we draw on work from linguistics and visual communications, which investigated the relationship between image and text by means of different metrics and categorizations for several decades. We concretize these metrics to enable their compatibility for machine learning purposes. This process includes the derivation of semantic image-text classes from these metrics. We evaluate all proposals with comprehensive experiments and discuss their impacts and limitations at the end of the thesis.Web-basiertes Lernen ist ein fester Bestandteil des Alltags aller Alters- und Bevölkerungsschichten geworden. Einerseits liegen die Vorteile dieser Art des Lernens wie Verfügbarkeit, Zugänglichkeit, Flexibilität oder Kosten auf der Hand. Andererseits kann das Überangebot an Inhalten auch dazu führen, dass Lernende nicht in der Lage sind optimale Ressourcen effizient zu finden. Das interdisziplinäre Forschungsfeld Search as Learning beschäftigt sich mit der Analyse und Verbesserung von Web-basierten Lernprozessen.
Bisher sind automatische Ansätze bei der Bewertung und Empfehlung von Lernressourcen fokussiert auf monomodale Merkmale, wie Text oder Dokumentstruktur. Die multimodale Betrachtung ist hingegen noch nicht ausreichend erforscht. Daher befasst sich diese Arbeit mit der Frage wie Methoden des Multimedia Retrievals dazu beitragen können das Lernen im Web zu unterstützen. Zunächst wird evaluiert, ob textuelle Metadaten des TIB AV-Portals genutzt werden können um in Verbindung mit semantischen Worteinbettungen einerseits Videoempfehlungen zu generieren und andererseits Visualisierungen zur Inhaltszusammenfassung von Videos abzuleiten. Anschließend wenden wir uns der anspruchsvollen Aufgabe der Vorhersage des Wissenszuwachses zu, die den potenziellen Lernerfolg einer Lernressource schätzt. Wir haben für unsere Ansätze Daten aus zwei Nutzerstudien verwendet. In der ersten wird der Wissenszuwachs beim Lernen mit Videos in einem MOOC-Setting beobachtet, während die zweite eine informelle web-basierte Lernumgebung bietet, in der die Probanden uneingeschränkten Internetzugang haben. Anschließend erweitern wir die rein textuellen Merkmale um visuelle, akustische und cross-modale Merkmale für eine ganzheitliche Darstellung der Lernressourcen. Durch die Korrelation dieser Merkmale mit dem erzielten Wissenszuwachs können wir den Einfluss einer Lernressource auf den Lernerfolg vorhersagen.
Weiterhin untersuchen wir wie verschiedene Kombinationen von visuellen und textuellen Inhalten Informationen generell vermitteln. Dazu greifen wir auf Arbeiten aus der Linguistik und der visuellen Kommunikation zurück, die seit mehreren Jahrzehnten die Beziehung zwischen Bild und Text untersucht haben. Wir konkretisieren vorhandene Metriken, um ihre Verwendung für maschinelles Lernen zu ermöglichen. Dieser Prozess beinhaltet die Ableitung semantischer Bild-Text-Klassen. Wir evaluieren alle Ansätze mit umfangreichen Experimenten und diskutieren ihre Auswirkungen und Limitierungen am Ende der Arbeit
Adaptation of speech recognition systems to selected real-world deployment conditions
Tato habilitační práce se zabývá problematikou adaptace systémů
rozpoznávání řeči na vybrané reálné podmínky nasazení. Je koncipována
jako sborník celkem dvanácti článků, které se touto problematikou
zabývají. Jde o publikace, jejichž jsem hlavním autorem
nebo spoluatorem, a které vznikly v rámci několika navazujících
výzkumných projektů. Na řešení těchto projektů jsem se
podílel jak v roli člena výzkumného týmu, tak i v roli řešitele nebo
spoluřešitele.
Publikace zařazené do tohoto sborníku lze rozdělit podle tématu
do tří hlavních skupin. Jejich společným jmenovatelem je
snaha přizpůsobit daný rozpoznávací systém novým podmínkám či
konkrétnímu faktoru, který významným způsobem ovlivňuje jeho
funkci či přesnost.
První skupina článků se zabývá úlohou neřízené adaptace na
mluvčího, kdy systém přizpůsobuje svoje parametry specifickým
hlasovým charakteristikám dané mluvící osoby. Druhá část práce
se pak věnuje problematice identifikace neřečových událostí na vstupu
do systému a související úloze rozpoznávání řeči s hlukem
(a zejména hudbou) na pozadí. Konečně třetí část práce se zabývá
přístupy, které umožňují přepis audio signálu obsahujícího promluvy
ve více než v jednom jazyce. Jde o metody adaptace existujícího
rozpoznávacího systému na nový jazyk a metody identifikace
jazyka z audio signálu.
Obě zmíněné identifikační úlohy jsou přitom vyšetřovány zejména
v náročném a méně probádaném režimu zpracování po jednotlivých
rámcích vstupního signálu, který je jako jediný vhodný pro on-line
nasazení, např. pro streamovaná data.This habilitation thesis deals with adaptation of automatic speech
recognition (ASR) systems to selected real-world deployment conditions.
It is presented in the form of a collection of twelve articles
dealing with this task; I am the main author or a co-author of these
articles. They were published during my work on several consecutive
research projects. I have participated in the solution of them
as a member of the research team as well as the investigator or a
co-investigator.
These articles can be divided into three main groups according to
their topics. They have in common the effort to adapt a particular
ASR system to a specific factor or deployment condition that affects
its function or accuracy.
The first group of articles is focused on an unsupervised speaker
adaptation task, where the ASR system adapts its parameters to
the specific voice characteristics of one particular speaker. The second
part deals with a) methods allowing the system to identify
non-speech events on the input, and b) the related task of recognition
of speech with non-speech events, particularly music, in the
background. Finally, the third part is devoted to the methods
that allow the transcription of an audio signal containing multilingual
utterances. It includes a) approaches for adapting the existing
recognition system to a new language and b) methods for identification
of the language from the audio signal.
The two mentioned identification tasks are in particular investigated
under the demanding and less explored frame-wise scenario,
which is the only one suitable for processing of on-line data streams
Time- and value-continuous explainable affect estimation in-the-wild
Today, the relevance of Affective Computing, i.e., of making computers recognise and simulate human emotions, cannot be overstated. All technology giants (from manufacturers of laptops to mobile phones to smart speakers) are in a fierce competition to make their devices understand not only what is being said, but also how it is being said to recognise user’s emotions. The goals have evolved from predicting the basic emotions (e.g., happy, sad) to now the more nuanced affective states (e.g., relaxed, bored) real-time. The databases used in such research too have evolved, from earlier featuring the acted behaviours to now spontaneous behaviours. There is a more powerful shift lately, called in-the-wild affect recognition, i.e., taking the research out of the laboratory, into the uncontrolled real-world.
This thesis discusses, for the very first time, affect recognition for two unique in-the-wild audiovisual databases, GRAS2 and SEWA. The GRAS2 is the only database till date with time- and value-continuous affect annotations for Labov effect-free affective behaviours, i.e., without the participant’s awareness of being recorded (which otherwise is known to affect the naturalness of one’s affective behaviour). The SEWA features participants from six different cultural backgrounds, conversing using a video-calling platform. Thus, SEWA features in-the-wild recordings further corrupted by unpredictable artifacts, such as the network-induced delays, frame-freezing and echoes. The two databases present a unique opportunity to study time- and value-continuous affect estimation that is truly in-the-wild.
A novel ‘Evaluator Weighted Estimation’ formulation is proposed to generate a gold standard sequence from several annotations. An illustration is presented demonstrating that the moving bag-of-words (BoW) representation better preserves the temporal context of the features, yet remaining more robust against the outliers compared to other statistical summaries, e.g., moving average. A novel, data-independent randomised codebook is proposed for the BoW representation; especially useful for cross-corpus model generalisation testing when the feature-spaces of the databases differ drastically. Various deep learning models and support vector regressors are used to predict affect dimensions time- and value-continuously. Better generalisability of the models trained on GRAS2 , despite the smaller training size, makes a strong case for the collection and use of Labov effect-free data.
A further foundational contribution is the discovery of the missing many-to-many mapping between the mean square error (MSE) and the concordance correlation coefficient (CCC), i.e., between two of the most popular utility functions till date. The newly invented cost function |MSE_{XY}/σ_{XY}| has been evaluated in the experiments aimed at demystifying the inner workings of a well-performing, simple, low-cost neural network effectively utilising the BoW text features. Also proposed herein is the shallowest-possible convolutional neural network (CNN) that uses the facial action unit (FAU) features. The CNN exploits sequential context, but unlike RNNs, also inherently allows data- and process-parallelism. Interestingly, for the most part, these white-box AI models have shown to utilise the provided features consistent with the human perception of emotion expression
- …