70 research outputs found

    Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities

    Full text link
    Recent advances in artificial general intelligence (AGI), particularly large language models and creative image generation systems have demonstrated impressive capabilities on diverse tasks spanning the arts and humanities. However, the swift evolution of AGI has also raised critical questions about its responsible deployment in these culturally significant domains traditionally seen as profoundly human. This paper provides a comprehensive analysis of the applications and implications of AGI for text, graphics, audio, and video pertaining to arts and the humanities. We survey cutting-edge systems and their usage in areas ranging from poetry to history, marketing to film, and communication to classical art. We outline substantial concerns pertaining to factuality, toxicity, biases, and public safety in AGI systems, and propose mitigation strategies. The paper argues for multi-stakeholder collaboration to ensure AGI promotes creativity, knowledge, and cultural values without undermining truth or human dignity. Our timely contribution summarizes a rapidly developing field, highlighting promising directions while advocating for responsible progress centering on human flourishing. The analysis lays the groundwork for further research on aligning AGI's technological capacities with enduring social goods

    Investigating Social Interactions Using Multi-Modal Nonverbal Features

    Get PDF
    Every day, humans are involved in social situations and interplays, with the goal of sharing emotions and thoughts, establishing relationships with or acting on other human beings. These interactions are possible thanks to what is called social intelligence, which is the ability to express and recognize social signals produced during the interactions. These signals aid the information exchange and are expressed through verbal and non-verbal behavioral cues, such as facial expressions, gestures, body pose or prosody. Recently, many works have demonstrated that social signals can be captured and analyzed by automatic systems, giving birth to a relatively new research area called social signal processing, which aims at replicating human social intelligence with machines. In this thesis, we explore the use of behavioral cues and computational methods for modeling and understanding social interactions. Concretely, we focus on several behavioral cues in three specic contexts: rst, we analyze the relationship between gaze and leadership in small group interactions. Second, we expand our analysis to face and head gestures in the context of deception detection in dyadic interactions. Finally, we analyze the whole body for group detection in mingling scenarios

    Preference Modeling in Data-Driven Product Design: Application in Visual Aesthetics

    Full text link
    Creating a form that is attractive to the intended market audience is one of the greatest challenges in product development given the subjective nature of preference and heterogeneous market segments with potentially different product preferences. Accordingly, product designers use a variety of qualitative and quantitative research tools to assess product preferences across market segments, such as design theme clinics, focus groups, customer surveys, and design reviews; however, these tools are still limited due to their dependence on subjective judgment, and being time and resource intensive. In this dissertation, we focus on a key research question: how can we understand and predict more reliably the preference for a future product in heterogeneous markets, so that this understanding can inform designers' decision-making? We present a number of data-driven approaches to model product preference. Instead of depending on any subjective judgment from human, the proposed preference models investigate the mathematical patterns behind users’ choice and behavior. This allows a more objective translation of customers' perception and preference into analytical relations that can inform design decision-making. Moreover, these models are scalable in that they have the capacity to analyze large-scale data and model customer heterogeneity accurately across market segments. In particular, we use feature representation as an intermediate step in our preference model, so that we can not only increase the predictive accuracy of the model but also capture in-depth insight into customers' preference. We tested our data-driven approaches with applications in visual aesthetics preference. Our results show that the proposed approaches can obtain an objective measurement of aesthetic perception and preference for a given market segment. This measurement enables designers to reliably evaluate and predict the aesthetic appeal of their designs. We also quantify the relative importance of aesthetic attributes when both aesthetic attributes and functional attributes are considered by customers. This quantification has great utility in helping product designers and executives in design reviews and selection of designs. Moreover, we visualize the possible factors affecting customers' perception of product aesthetics and how these factors differ across different market segments. Those visualizations are incredibly important to designers as they relate physical design details to psychological customer reactions. The main contribution of this dissertation is to present purely data-driven approaches that enable designers to quantify and interpret more reliably the product preference. Methodological contributions include using modern probabilistic approaches and feature learning algorithms to quantitatively model the design process involving product aesthetics. These novel approaches can not only increase the predictive accuracy but also capture insights to inform design decision-making.PHDDesign ScienceUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145987/1/yanxinp_1.pd

    A textual and visual features-jointly driven hybrid intelligent system for digital physical education teaching quality evaluation

    Get PDF
    The utilization of intelligent computing in digital teaching quality evaluation has been a practical demand in smart cities. Currently, related research works can be categorized into two types: textual data-based approaches and visual data-based approaches. Due to the gap between their different formats and modalities, it remains very challenging to integrate them together when conducting digital teaching quality evaluation. In fact, the two types of information can both reflect distinguished knowledge from their own perspectives. To bridge this gap, this paper proposes a textual and visual features-jointly driven hybrid intelligent system for digital teaching quality evaluation. Visual features are extracted with the use of a multiscale convolution neural network by introducing receptive fields with different sizes. Textual features serve as the auxiliary contents for major visual features, and are extracted using a recurrent neural network. At last, we implement the proposed method through some simulation experiments to evaluate its practical running performance, and a real-world dataset collected from teaching activities is employed for this purpose. We obtain some groups of experimental results, which reveal that the hybrid intelligent system developed by this paper can bring more than 10% improvement of efficiency towards digital teaching quality evaluation

    Modeling Visual Rhetoric and Semantics in Multimedia

    Get PDF
    Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects. Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems

    Learning Biosignals with Deep Learning

    Get PDF
    The healthcare system, which is ubiquitously recognized as one of the most influential system in society, is facing new challenges since the start of the decade.The myriad of physiological data generated by individuals, namely in the healthcare system, is generating a burden on physicians, losing effectiveness on the collection of patient data. Information systems and, in particular, novel deep learning (DL) algorithms have been prompting a way to take this problem. This thesis has the aim to have an impact in biosignal research and industry by presenting DL solutions that could empower this field. For this purpose an extensive study of how to incorporate and implement Convolutional Neural Networks (CNN), Recursive Neural Networks (RNN) and Fully Connected Networks in biosignal studies is discussed. Different architecture configurations were explored for signal processing and decision making and were implemented in three different scenarios: (1) Biosignal learning and synthesis; (2) Electrocardiogram (ECG) biometric systems, and; (3) Electrocardiogram (ECG) anomaly detection systems. In (1) a RNN-based architecture was able to replicate autonomously three types of biosignals with a high degree of confidence. As for (2) three CNN-based architectures, and a RNN-based architecture (same used in (1)) were used for both biometric identification, reaching values above 90% for electrode-base datasets (Fantasia, ECG-ID and MIT-BIH) and 75% for off-person dataset (CYBHi), and biometric authentication, achieving Equal Error Rates (EER) of near 0% for Fantasia and MIT-BIH and bellow 4% for CYBHi. As for (3) the abstraction of healthy clean the ECG signal and detection of its deviation was made and tested in two different scenarios: presence of noise using autoencoder and fully-connected network (reaching 99% accuracy for binary classification and 71% for multi-class), and; arrhythmia events by including a RNN to the previous architecture (57% accuracy and 61% sensitivity). In sum, these systems are shown to be capable of producing novel results. The incorporation of several AI systems into one could provide to be the next generation of preventive medicine, as the machines have access to different physiological and anatomical states, it could produce more informed solutions for the issues that one may face in the future increasing the performance of autonomous preventing systems that could be used in every-day life in remote places where the access to medicine is limited. These systems will also help the study of the signal behaviour and how they are made in real life context as explainable AI could trigger this perception and link the inner states of a network with the biological traits.O sistema de saúde, que é ubiquamente reconhecido como um dos sistemas mais influentes da sociedade, enfrenta novos desafios desde o ínicio da década. A miríade de dados fisiológicos gerados por indíviduos, nomeadamente no sistema de saúde, está a gerar um fardo para os médicos, perdendo a eficiência no conjunto dos dados do paciente. Os sistemas de informação e, mais espcificamente, da inovação de algoritmos de aprendizagem profunda (DL) têm sido usados na procura de uma solução para este problema. Esta tese tem o objetivo de ter um impacto na pesquisa e na indústria de biosinais, apresentando soluções de DL que poderiam melhorar esta área de investigação. Para esse fim, é discutido um extenso estudo de como incorporar e implementar redes neurais convolucionais (CNN), redes neurais recursivas (RNN) e redes totalmente conectadas para o estudo de biosinais. Diferentes arquiteturas foram exploradas para processamento e tomada de decisão de sinais e foram implementadas em três cenários diferentes: (1) Aprendizagem e síntese de biosinais; (2) sistemas biométricos com o uso de eletrocardiograma (ECG), e; (3) Sistema de detecção de anomalias no ECG. Em (1) uma arquitetura baseada na RNN foi capaz de replicar autonomamente três tipos de sinais biológicos com um alto grau de confiança. Quanto a (2) três arquiteturas baseadas em CNN e uma arquitetura baseada em RNN (a mesma usada em (1)) foram usadas para ambas as identificações, atingindo valores acima de 90 % para conjuntos de dados à base de eletrodos (Fantasia, ECG-ID e MIT -BIH) e 75 % para o conjunto de dados fora da pessoa (CYBHi) e autenticação, atingindo taxas de erro iguais (EER) de quase 0 % para Fantasia e MIT-BIH e abaixo de 4 % para CYBHi. Quanto a (3) a abstração de sinais limpos e assimptomáticos de ECG e a detecção do seu desvio foram feitas e testadas em dois cenários diferentes: na presença de ruído usando um autocodificador e uma rede totalmente conectada (atingindo 99 % de precisão na classificação binária e 71 % na multi-classe), e; eventos de arritmia incluindo um RNN na arquitetura anterior (57 % de precisão e 61 % de sensibilidade). Em suma, esses sistemas são mais uma vez demonstrados como capazes de produzir resultados inovadores. A incorporação de vários sistemas de inteligência artificial em um unico sistema pederá desencadear a próxima geração de medicina preventiva. Os algoritmos ao terem acesso a diferentes estados fisiológicos e anatómicos, podem produzir soluções mais informadas para os problemas que se possam enfrentar no futuro, aumentando o desempenho de sistemas autónomos de prevenção que poderiam ser usados na vida quotidiana, nomeadamente em locais remotos onde o acesso à medicinas é limitado. Estes sistemas também ajudarão o estudo do comportamento do sinal e como eles são feitos no contexto da vida real, pois a IA explicável pode desencadear essa percepção e vincular os estados internos de uma rede às características biológicas

    Language Grounding in Massive Online Data

    Get PDF

    Automatic understanding of multimodal content for Web-based learning

    Get PDF
    Web-based learning has become an integral part of everyday life for all ages and backgrounds. On the one hand, the advantages of this learning type, such as availability, accessibility, flexibility, and cost, are apparent. On the other hand, the oversupply of content can lead to learners struggling to find optimal resources efficiently. The interdisciplinary research field Search as Learning is concerned with the analysis and improvement of Web-based learning processes, both on the learner and the computer science side. So far, automatic approaches that assess and recommend learning resources in Search as Learning (SAL) focus on textual, resource, and behavioral features. However, these approaches commonly ignore multimodal aspects. This work addresses this research gap by proposing several approaches that address the question of how multimodal retrieval methods can help support learning on the Web. First, we evaluate whether textual metadata of the TIB AV-Portal can be exploited and enriched by semantic word embeddings to generate video recommendations and, in addition, a video summarization technique to improve exploratory search. Then we turn to the challenging task of knowledge gain prediction that estimates the potential learning success given a specific learning resource. We used data from two user studies for our approaches. The first one observes the knowledge gain when learning with videos in a Massive Open Online Course (MOOC) setting, while the second one provides an informal Web-based learning setting where the subjects have unrestricted access to the Internet. We then extend the purely textual features to include visual, audio, and cross-modal features for a holistic representation of learning resources. By correlating these features with the achieved knowledge gain, we can estimate the impact of a particular learning resource on learning success. We further investigate the influence of multimodal data on the learning process by examining how the combination of visual and textual content generally conveys information. For this purpose, we draw on work from linguistics and visual communications, which investigated the relationship between image and text by means of different metrics and categorizations for several decades. We concretize these metrics to enable their compatibility for machine learning purposes. This process includes the derivation of semantic image-text classes from these metrics. We evaluate all proposals with comprehensive experiments and discuss their impacts and limitations at the end of the thesis.Web-basiertes Lernen ist ein fester Bestandteil des Alltags aller Alters- und Bevölkerungsschichten geworden. Einerseits liegen die Vorteile dieser Art des Lernens wie Verfügbarkeit, Zugänglichkeit, Flexibilität oder Kosten auf der Hand. Andererseits kann das Überangebot an Inhalten auch dazu führen, dass Lernende nicht in der Lage sind optimale Ressourcen effizient zu finden. Das interdisziplinäre Forschungsfeld Search as Learning beschäftigt sich mit der Analyse und Verbesserung von Web-basierten Lernprozessen. Bisher sind automatische Ansätze bei der Bewertung und Empfehlung von Lernressourcen fokussiert auf monomodale Merkmale, wie Text oder Dokumentstruktur. Die multimodale Betrachtung ist hingegen noch nicht ausreichend erforscht. Daher befasst sich diese Arbeit mit der Frage wie Methoden des Multimedia Retrievals dazu beitragen können das Lernen im Web zu unterstützen. Zunächst wird evaluiert, ob textuelle Metadaten des TIB AV-Portals genutzt werden können um in Verbindung mit semantischen Worteinbettungen einerseits Videoempfehlungen zu generieren und andererseits Visualisierungen zur Inhaltszusammenfassung von Videos abzuleiten. Anschließend wenden wir uns der anspruchsvollen Aufgabe der Vorhersage des Wissenszuwachses zu, die den potenziellen Lernerfolg einer Lernressource schätzt. Wir haben für unsere Ansätze Daten aus zwei Nutzerstudien verwendet. In der ersten wird der Wissenszuwachs beim Lernen mit Videos in einem MOOC-Setting beobachtet, während die zweite eine informelle web-basierte Lernumgebung bietet, in der die Probanden uneingeschränkten Internetzugang haben. Anschließend erweitern wir die rein textuellen Merkmale um visuelle, akustische und cross-modale Merkmale für eine ganzheitliche Darstellung der Lernressourcen. Durch die Korrelation dieser Merkmale mit dem erzielten Wissenszuwachs können wir den Einfluss einer Lernressource auf den Lernerfolg vorhersagen. Weiterhin untersuchen wir wie verschiedene Kombinationen von visuellen und textuellen Inhalten Informationen generell vermitteln. Dazu greifen wir auf Arbeiten aus der Linguistik und der visuellen Kommunikation zurück, die seit mehreren Jahrzehnten die Beziehung zwischen Bild und Text untersucht haben. Wir konkretisieren vorhandene Metriken, um ihre Verwendung für maschinelles Lernen zu ermöglichen. Dieser Prozess beinhaltet die Ableitung semantischer Bild-Text-Klassen. Wir evaluieren alle Ansätze mit umfangreichen Experimenten und diskutieren ihre Auswirkungen und Limitierungen am Ende der Arbeit

    Adaptation of speech recognition systems to selected real-world deployment conditions

    Get PDF
    Tato habilitační práce se zabývá problematikou adaptace systémů rozpoznávání řeči na vybrané reálné podmínky nasazení. Je koncipována jako sborník celkem dvanácti článků, které se touto problematikou zabývají. Jde o publikace, jejichž jsem hlavním autorem nebo spoluatorem, a které vznikly v rámci několika navazujících výzkumných projektů. Na řešení těchto projektů jsem se podílel jak v roli člena výzkumného týmu, tak i v roli řešitele nebo spoluřešitele. Publikace zařazené do tohoto sborníku lze rozdělit podle tématu do tří hlavních skupin. Jejich společným jmenovatelem je snaha přizpůsobit daný rozpoznávací systém novým podmínkám či konkrétnímu faktoru, který významným způsobem ovlivňuje jeho funkci či přesnost. První skupina článků se zabývá úlohou neřízené adaptace na mluvčího, kdy systém přizpůsobuje svoje parametry specifickým hlasovým charakteristikám dané mluvící osoby. Druhá část práce se pak věnuje problematice identifikace neřečových událostí na vstupu do systému a související úloze rozpoznávání řeči s hlukem (a zejména hudbou) na pozadí. Konečně třetí část práce se zabývá přístupy, které umožňují přepis audio signálu obsahujícího promluvy ve více než v jednom jazyce. Jde o metody adaptace existujícího rozpoznávacího systému na nový jazyk a metody identifikace jazyka z audio signálu. Obě zmíněné identifikační úlohy jsou přitom vyšetřovány zejména v náročném a méně probádaném režimu zpracování po jednotlivých rámcích vstupního signálu, který je jako jediný vhodný pro on-line nasazení, např. pro streamovaná data.This habilitation thesis deals with adaptation of automatic speech recognition (ASR) systems to selected real-world deployment conditions. It is presented in the form of a collection of twelve articles dealing with this task; I am the main author or a co-author of these articles. They were published during my work on several consecutive research projects. I have participated in the solution of them as a member of the research team as well as the investigator or a co-investigator. These articles can be divided into three main groups according to their topics. They have in common the effort to adapt a particular ASR system to a specific factor or deployment condition that affects its function or accuracy. The first group of articles is focused on an unsupervised speaker adaptation task, where the ASR system adapts its parameters to the specific voice characteristics of one particular speaker. The second part deals with a) methods allowing the system to identify non-speech events on the input, and b) the related task of recognition of speech with non-speech events, particularly music, in the background. Finally, the third part is devoted to the methods that allow the transcription of an audio signal containing multilingual utterances. It includes a) approaches for adapting the existing recognition system to a new language and b) methods for identification of the language from the audio signal. The two mentioned identification tasks are in particular investigated under the demanding and less explored frame-wise scenario, which is the only one suitable for processing of on-line data streams

    Time- and value-continuous explainable affect estimation in-the-wild

    Get PDF
    Today, the relevance of Affective Computing, i.e., of making computers recognise and simulate human emotions, cannot be overstated. All technology giants (from manufacturers of laptops to mobile phones to smart speakers) are in a fierce competition to make their devices understand not only what is being said, but also how it is being said to recognise user’s emotions. The goals have evolved from predicting the basic emotions (e.g., happy, sad) to now the more nuanced affective states (e.g., relaxed, bored) real-time. The databases used in such research too have evolved, from earlier featuring the acted behaviours to now spontaneous behaviours. There is a more powerful shift lately, called in-the-wild affect recognition, i.e., taking the research out of the laboratory, into the uncontrolled real-world. This thesis discusses, for the very first time, affect recognition for two unique in-the-wild audiovisual databases, GRAS2 and SEWA. The GRAS2 is the only database till date with time- and value-continuous affect annotations for Labov effect-free affective behaviours, i.e., without the participant’s awareness of being recorded (which otherwise is known to affect the naturalness of one’s affective behaviour). The SEWA features participants from six different cultural backgrounds, conversing using a video-calling platform. Thus, SEWA features in-the-wild recordings further corrupted by unpredictable artifacts, such as the network-induced delays, frame-freezing and echoes. The two databases present a unique opportunity to study time- and value-continuous affect estimation that is truly in-the-wild. A novel ‘Evaluator Weighted Estimation’ formulation is proposed to generate a gold standard sequence from several annotations. An illustration is presented demonstrating that the moving bag-of-words (BoW) representation better preserves the temporal context of the features, yet remaining more robust against the outliers compared to other statistical summaries, e.g., moving average. A novel, data-independent randomised codebook is proposed for the BoW representation; especially useful for cross-corpus model generalisation testing when the feature-spaces of the databases differ drastically. Various deep learning models and support vector regressors are used to predict affect dimensions time- and value-continuously. Better generalisability of the models trained on GRAS2 , despite the smaller training size, makes a strong case for the collection and use of Labov effect-free data. A further foundational contribution is the discovery of the missing many-to-many mapping between the mean square error (MSE) and the concordance correlation coefficient (CCC), i.e., between two of the most popular utility functions till date. The newly invented cost function |MSE_{XY}/σ_{XY}| has been evaluated in the experiments aimed at demystifying the inner workings of a well-performing, simple, low-cost neural network effectively utilising the BoW text features. Also proposed herein is the shallowest-possible convolutional neural network (CNN) that uses the facial action unit (FAU) features. The CNN exploits sequential context, but unlike RNNs, also inherently allows data- and process-parallelism. Interestingly, for the most part, these white-box AI models have shown to utilise the provided features consistent with the human perception of emotion expression
    corecore