914 research outputs found
Deep Recurrent Learning for Efficient Image Recognition Using Small Data
Recognition is fundamental yet open and challenging problem in computer vision. Recognition involves the detection and interpretation of complex shapes of objects or persons from previous encounters or knowledge. Biological systems are considered as the most powerful, robust and generalized recognition models. The recent success of learning based mathematical models known as artificial neural networks, especially deep neural networks, have propelled researchers to utilize such architectures for developing bio-inspired computational recognition models. However, the computational complexity of these models increases proportionally to the challenges posed by the recognition problem, and more importantly, these models require a large amount of data for successful learning. Additionally, the feedforward-based hierarchical models do not exploit another important biological learning paradigm, known as recurrency, which ubiquitously exists in the biological visual system and has been shown to be quite crucial for recognition.
Consequently, this work aims to develop novel biologically relevant deep recurrent learning models for robust recognition using limited training data. First, we design an efficient deep simultaneous recurrent network (DSRN) architecture for solving several challenging image recognition tasks. The use of simultaneous recurrency in the proposed model improves the recognition performance and offers reduced computational complexity compared to the existing hierarchical deep learning models. Moreover, the DSRN architecture inherently learns meaningful representations of data during the training process which is essential to achieve superior recognition performance. However, probabilistic models such as deep generative models are particularly adept at learning representations directly from unlabeled input data. Accordingly, we show the generalization of the proposed deep simultaneous recurrency concept by developing a probabilistic deep simultaneous recurrent belief network (DSRBN) architecture which is more efficient in learning the underlying representation of the data compared to the state-of-the-art generative models. Finally, we propose a deep recurrent learning framework for solving the image recognition task using small data. We incorporate Bayesian statistics to the DSRBN generative model to propose a deep recurrent generative Bayesian model that addresses the challenge of learning from a small amount of data. Our findings suggest that the proposed deep recurrent Bayesian framework demonstrates better image recognition performance compared to the state-of-the-art models in a small data learning scenario. In conclusion, this dissertation proposes novel deep recurrent learning pipelines, which utilize not only limited training data to achieve improved image recognition performance but also require significantly reduced training parameters
Pathway to Future Symbiotic Creativity
This report presents a comprehensive view of our vision on the development
path of the human-machine symbiotic art creation. We propose a classification
of the creative system with a hierarchy of 5 classes, showing the pathway of
creativity evolving from a mimic-human artist (Turing Artists) to a Machine
artist in its own right. We begin with an overview of the limitations of the
Turing Artists then focus on the top two-level systems, Machine Artists,
emphasizing machine-human communication in art creation. In art creation, it is
necessary for machines to understand humans' mental states, including desires,
appreciation, and emotions, humans also need to understand machines' creative
capabilities and limitations. The rapid development of immersive environment
and further evolution into the new concept of metaverse enable symbiotic art
creation through unprecedented flexibility of bi-directional communication
between artists and art manifestation environments. By examining the latest
sensor and XR technologies, we illustrate the novel way for art data collection
to constitute the base of a new form of human-machine bidirectional
communication and understanding in art creation. Based on such communication
and understanding mechanisms, we propose a novel framework for building future
Machine artists, which comes with the philosophy that a human-compatible AI
system should be based on the "human-in-the-loop" principle rather than the
traditional "end-to-end" dogma. By proposing a new form of inverse
reinforcement learning model, we outline the platform design of machine
artists, demonstrate its functions and showcase some examples of technologies
we have developed. We also provide a systematic exposition of the ecosystem for
AI-based symbiotic art form and community with an economic model built on NFT
technology. Ethical issues for the development of machine artists are also
discussed
Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
Recent advances in artificial general intelligence (AGI), particularly large
language models and creative image generation systems have demonstrated
impressive capabilities on diverse tasks spanning the arts and humanities.
However, the swift evolution of AGI has also raised critical questions about
its responsible deployment in these culturally significant domains
traditionally seen as profoundly human. This paper provides a comprehensive
analysis of the applications and implications of AGI for text, graphics, audio,
and video pertaining to arts and the humanities. We survey cutting-edge systems
and their usage in areas ranging from poetry to history, marketing to film, and
communication to classical art. We outline substantial concerns pertaining to
factuality, toxicity, biases, and public safety in AGI systems, and propose
mitigation strategies. The paper argues for multi-stakeholder collaboration to
ensure AGI promotes creativity, knowledge, and cultural values without
undermining truth or human dignity. Our timely contribution summarizes a
rapidly developing field, highlighting promising directions while advocating
for responsible progress centering on human flourishing. The analysis lays the
groundwork for further research on aligning AGI's technological capacities with
enduring social goods
STARNet: Sensor Trustworthiness and Anomaly Recognition via Approximated Likelihood Regret for Robust Edge Autonomy
Complex sensors such as LiDAR, RADAR, and event cameras have proliferated in
autonomous robotics to enhance perception and understanding of the environment.
Meanwhile, these sensors are also vulnerable to diverse failure mechanisms that
can intricately interact with their operation environment. In parallel, the
limited availability of training data on complex sensors also affects the
reliability of their deep learning-based prediction flow, where their
prediction models can fail to generalize to environments not adequately
captured in the training set. To address these reliability concerns, this paper
introduces STARNet, a Sensor Trustworthiness and Anomaly Recognition Network
designed to detect untrustworthy sensor streams that may arise from sensor
malfunctions and/or challenging environments. We specifically benchmark STARNet
on LiDAR and camera data. STARNet employs the concept of approximated
likelihood regret, a gradient-free framework tailored for low-complexity
hardware, especially those with only fixed-point precision capabilities.
Through extensive simulations, we demonstrate the efficacy of STARNet in
detecting untrustworthy sensor streams in unimodal and multimodal settings. In
particular, the network shows superior performance in addressing internal
sensor failures, such as cross-sensor interference and crosstalk. In diverse
test scenarios involving adverse weather and sensor malfunctions, we show that
STARNet enhances prediction accuracy by approximately 10% by filtering out
untrustworthy sensor streams. STARNet is publicly available at
\url{https://github.com/sinatayebati/STARNet}
Towards Interaction-level Video Action Understanding
A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
Advancements in Medical Imaging and Diagnostics with Deep Learning Technologies
Medical imaging has long been a cornerstone in diagnostic medicine, providing clinicians with a non-invasive method to visualize internal structures and processes. However, traditional imaging techniques have faced challenges in resolution, safety concerns related to radiation exposure, and the need for invasive procedures for clearer visualization. With the advent of deep learning technologies, significant advancements have been made in the field of medical imaging, addressing many of these challenges and introducing new capabilities. This research seeks into the integration of deep learning in enhancing image resolution, leading to clearer and more detailed visualizations. Furthermore, the ability to reconstruct three-dimensional images from traditional two-dimensional scans offers a more comprehensive view of the area under examination. Automated analysis powered by deep learning algorithms not only speeds up the diagnostic process but also detects anomalies that might be overlooked by the human eye. Predictive analysis, based on these enhanced images, can forecast the likelihood of diseases, and real-time analysis during surgeries ensures immediate feedback, enhancing the precision of medical procedures. Safety in medical imaging has also seen improvements. Techniques powered by deep learning require reduced radiation, minimizing risks to patients. Additionally, the enhanced clarity and detail in images reduce the need for invasive procedures, further ensuring patient safety. The integration of imaging data with Electronic Health Records (EHR) has paved the way for personalized care recommendations, tailoring treatments based on individual patient history and current diagnostics. Lastly, the role of deep learning extends to medical education, where it aids in creating realistic simulations and models, equipping medical professionals with better training tools
ChatAnything: Facetime Chat with LLM-Enhanced Personas
In this technical report, we target generating anthropomorphized personas for
LLM-based characters in an online manner, including visual appearance,
personality and tones, with only text descriptions. To achieve this, we first
leverage the in-context learning capability of LLMs for personality generation
by carefully designing a set of system prompts. We then propose two novel
concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for
diverse voice and appearance generation. For MoV, we utilize the text-to-speech
(TTS) algorithms with a variety of pre-defined tones and select the most
matching one based on the user-provided text description automatically. For
MoD, we combine the recent popular text-to-image generation techniques and
talking head algorithms to streamline the process of generating talking
objects. We termed the whole framework as ChatAnything. With it, users could be
able to animate anything with any personas that are anthropomorphic using just
a few text inputs. However, we have observed that the anthropomorphic objects
produced by current generative models are often undetectable by pre-trained
face landmark detectors, leading to failure of the face motion generation, even
if these faces possess human-like appearances because those images are nearly
seen during the training (e.g., OOD samples). To address this issue, we
incorporate pixel-level guidance to infuse human face landmarks during the
image generation phase. To benchmark these metrics, we have built an evaluation
dataset. Based on it, we verify that the detection rate of the face landmark is
significantly increased from 57.0% to 92.5% thus allowing automatic face
animation based on generated speech content. The code and more results can be
found at https://chatanything.github.io/
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
In the rapidly advancing field of multi-modal machine learning (MMML), the
convergence of multiple data modalities has the potential to reshape various
applications. This paper presents a comprehensive overview of the current
state, advancements, and challenges of MMML within the sphere of engineering
design. The review begins with a deep dive into five fundamental concepts of
MMML:multi-modal information representation, fusion, alignment, translation,
and co-learning. Following this, we explore the cutting-edge applications of
MMML, placing a particular emphasis on tasks pertinent to engineering design,
such as cross-modal synthesis, multi-modal prediction, and cross-modal
information retrieval. Through this comprehensive overview, we highlight the
inherent challenges in adopting MMML in engineering design, and proffer
potential directions for future research. To spur on the continued evolution of
MMML in engineering design, we advocate for concentrated efforts to construct
extensive multi-modal design datasets, develop effective data-driven MMML
techniques tailored to design applications, and enhance the scalability and
interpretability of MMML models. MMML models, as the next generation of
intelligent design tools, hold a promising future to impact how products are
designed
- …