8,182 research outputs found
Physics-guided adversarial networks for artificial digital image correlation data generation
Digital image correlation (DIC) has become a valuable tool in the evaluation
of mechanical experiments, particularly fatigue crack growth experiments. The
evaluation requires accurate information of the crack path and crack tip
position, which is difficult to obtain due to inherent noise and artefacts.
Machine learning models have been extremely successful in recognizing this
relevant information given labelled DIC displacement data. For the training of
robust models, which generalize well, big data is needed. However, data is
typically scarce in the field of material science and engineering because
experiments are expensive and time-consuming. We present a method to generate
synthetic DIC displacement data using generative adversarial networks with a
physics-guided discriminator. To decide whether data samples are real or fake,
this discriminator additionally receives the derived von Mises equivalent
strain. We show that this physics-guided approach leads to improved results in
terms of visual quality of samples, sliced Wasserstein distance, and geometry
score
Siamese DETR
Recent self-supervised methods are mainly designed for representation
learning with the base model, e.g., ResNets or ViTs. They cannot be easily
transferred to DETR, with task-specific Transformer modules. In this work, we
present Siamese DETR, a Siamese self-supervised pretraining approach for the
Transformer architecture in DETR. We consider learning view-invariant and
detection-oriented representations simultaneously through two complementary
tasks, i.e., localization and discrimination, in a novel multi-view learning
framework. Two self-supervised pretext tasks are designed: (i) Multi-View
Region Detection aims at learning to localize regions-of-interest between
augmented views of the input, and (ii) Multi-View Semantic Discrimination
attempts to improve object-level discrimination for each region. The proposed
Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL
VOC detection using different DETR variants in all setups. Code is available at
https://github.com/Zx55/SiameseDETR.Comment: 10 pages, 11 figures. Accepted in CVPR 202
DASS Good: Explainable Data Mining of Spatial Cohort Data
Developing applicable clinical machine learning models is a difficult task
when the data includes spatial information, for example, radiation dose
distributions across adjacent organs at risk. We describe the co-design of a
modeling system, DASS, to support the hybrid human-machine development and
validation of predictive models for estimating long-term toxicities related to
radiotherapy doses in head and neck cancer patients. Developed in collaboration
with domain experts in oncology and data mining, DASS incorporates
human-in-the-loop visual steering, spatial data, and explainable AI to augment
domain knowledge with automatic data mining. We demonstrate DASS with the
development of two practical clinical stratification models and report feedback
from domain experts. Finally, we describe the design lessons learned from this
collaborative experience.Comment: 10 pages, 9 figure
CLIP-S: Language-Guided Self-Supervised Semantic Segmentation
Existing semantic segmentation approaches are often limited by costly
pixel-wise annotations and predefined classes. In this work, we present
CLIP-S that leverages self-supervised pixel representation learning and
vision-language models to enable various semantic segmentation tasks (e.g.,
unsupervised, transfer learning, language-driven segmentation) without any
human annotations and unknown class information. We first learn pixel
embeddings with pixel-segment contrastive learning from different augmented
views of images. To further improve the pixel embeddings and enable
language-driven semantic segmentation, we design two types of consistency
guided by vision-language models: 1) embedding consistency, aligning our pixel
embeddings to the joint feature space of a pre-trained vision-language model,
CLIP; and 2) semantic consistency, forcing our model to make the same
predictions as CLIP over a set of carefully designed target classes with both
known and unknown prototypes. Thus, CLIP-S enables a new task of class-free
semantic segmentation where no unknown class information is needed during
training. As a result, our approach shows consistent and substantial
performance improvement over four popular benchmarks compared with the
state-of-the-art unsupervised and language-driven semantic segmentation
methods. More importantly, our method outperforms these methods on unknown
class recognition by a large margin.Comment: The IEEE/CVF Conference on Computer Vision and Pattern Recognition
202
Colour technologies for content production and distribution of broadcast content
The requirement of colour reproduction has long been a priority driving the development of new colour imaging systems that maximise human perceptual plausibility. This thesis explores machine learning algorithms for colour processing to assist both content production and distribution. First, this research studies colourisation technologies with practical use cases in restoration and processing of archived content. The research targets practical deployable solutions, developing a cost-effective pipeline which integrates the activity of the producer into the processing workflow. In particular, a fully automatic image colourisation paradigm using Conditional GANs is proposed to improve content generalisation and colourfulness of existing baselines. Moreover, a more conservative solution is considered by providing references to guide the system towards more accurate colour predictions. A fast-end-to-end architecture is proposed to improve existing exemplar-based image colourisation methods while decreasing the complexity and runtime. Finally, the proposed image-based methods are integrated into a video colourisation pipeline. A general framework is proposed to reduce the generation of temporal flickering or propagation of errors when such methods are applied frame-to-frame. The proposed model is jointly trained to stabilise the input video and to cluster their frames with the aim of learning scene-specific modes. Second, this research explored colour processing technologies for content distribution with the aim to effectively deliver the processed content to the broad audience. In particular, video compression is tackled by introducing a novel methodology for chroma intra prediction based on attention models. Although the proposed architecture helped to gain control over the reference samples and better understand the prediction process, the complexity of the underlying neural network significantly increased the encoding and decoding time. Therefore, aiming at efficient deployment within the latest video coding standards, this work also focused on the simplification of the proposed architecture to obtain a more compact and explainable model
Security and Privacy Problems in Voice Assistant Applications: A Survey
Voice assistant applications have become omniscient nowadays. Two models that
provide the two most important functions for real-life applications (i.e.,
Google Home, Amazon Alexa, Siri, etc.) are Automatic Speech Recognition (ASR)
models and Speaker Identification (SI) models. According to recent studies,
security and privacy threats have also emerged with the rapid development of
the Internet of Things (IoT). The security issues researched include attack
techniques toward machine learning models and other hardware components widely
used in voice assistant applications. The privacy issues include technical-wise
information stealing and policy-wise privacy breaches. The voice assistant
application takes a steadily growing market share every year, but their privacy
and security issues never stopped causing huge economic losses and endangering
users' personal sensitive information. Thus, it is important to have a
comprehensive survey to outline the categorization of the current research
regarding the security and privacy problems of voice assistant applications.
This paper concludes and assesses five kinds of security attacks and three
types of privacy threats in the papers published in the top-tier conferences of
cyber security and voice domain.Comment: 5 figure
The Metaverse: Survey, Trends, Novel Pipeline Ecosystem & Future Directions
The Metaverse offers a second world beyond reality, where boundaries are
non-existent, and possibilities are endless through engagement and immersive
experiences using the virtual reality (VR) technology. Many disciplines can
benefit from the advancement of the Metaverse when accurately developed,
including the fields of technology, gaming, education, art, and culture.
Nevertheless, developing the Metaverse environment to its full potential is an
ambiguous task that needs proper guidance and directions. Existing surveys on
the Metaverse focus only on a specific aspect and discipline of the Metaverse
and lack a holistic view of the entire process. To this end, a more holistic,
multi-disciplinary, in-depth, and academic and industry-oriented review is
required to provide a thorough study of the Metaverse development pipeline. To
address these issues, we present in this survey a novel multi-layered pipeline
ecosystem composed of (1) the Metaverse computing, networking, communications
and hardware infrastructure, (2) environment digitization, and (3) user
interactions. For every layer, we discuss the components that detail the steps
of its development. Also, for each of these components, we examine the impact
of a set of enabling technologies and empowering domains (e.g., Artificial
Intelligence, Security & Privacy, Blockchain, Business, Ethics, and Social) on
its advancement. In addition, we explain the importance of these technologies
to support decentralization, interoperability, user experiences, interactions,
and monetization. Our presented study highlights the existing challenges for
each component, followed by research directions and potential solutions. To the
best of our knowledge, this survey is the most comprehensive and allows users,
scholars, and entrepreneurs to get an in-depth understanding of the Metaverse
ecosystem to find their opportunities and potentials for contribution
One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era
OpenAI has recently released GPT-4 (a.k.a. ChatGPT plus), which is
demonstrated to be one small step for generative AI (GAI), but one giant leap
for artificial general intelligence (AGI). Since its official release in
November 2022, ChatGPT has quickly attracted numerous users with extensive
media coverage. Such unprecedented attention has also motivated numerous
researchers to investigate ChatGPT from various aspects. According to Google
scholar, there are more than 500 articles with ChatGPT in their titles or
mentioning it in their abstracts. Considering this, a review is urgently
needed, and our work fills this gap. Overall, this work is the first to survey
ChatGPT with a comprehensive review of its underlying technology, applications,
and challenges. Moreover, we present an outlook on how ChatGPT might evolve to
realize general-purpose AIGC (a.k.a. AI-generated content), which will be a
significant milestone for the development of AGI.Comment: A Survey on ChatGPT and GPT-4, 29 pages. Feedback is appreciated
([email protected]
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes
Training models to apply linguistic knowledge and visual concepts from 2D
images to 3D world understanding is a promising direction that researchers have
only recently started to explore. In this work, we design a novel 3D
pre-training Vision-Language method that helps a model learn semantically
meaningful and transferable 3D scene point cloud representations. We inject the
representational power of the popular CLIP model into our 3D encoder by
aligning the encoded 3D scene features with the corresponding 2D image and text
embeddings produced by CLIP. To assess our model's 3D world reasoning
capability, we evaluate it on the downstream task of 3D Visual Question
Answering. Experimental quantitative and qualitative results show that our
pre-training method outperforms state-of-the-art works in this task and leads
to an interpretable representation of 3D scene features.Comment: CVPRW 2023. Code will be made publicly available:
https://github.com/AlexDelitzas/3D-VQ
Audio-Visual Automatic Speech Recognition Towards Education for Disabilities
Education is a fundamental right that enriches everyone’s life. However, physically challenged people often debar from the general and advanced education system. Audio-Visual Automatic Speech Recognition (AV-ASR) based system is useful to improve the education of physically challenged people by providing hands-free computing. They can communicate to the learning system through AV-ASR. However, it is challenging to trace the lip correctly for visual modality. Thus, this paper addresses the appearance-based visual feature along with the co-occurrence statistical measure for visual speech recognition. Local Binary Pattern-Three Orthogonal Planes (LBP-TOP) and Grey-Level Co-occurrence Matrix (GLCM) is proposed for visual speech information. The experimental results show that the proposed system achieves 76.60 % accuracy for visual speech and 96.00 % accuracy for audio speech recognition
- …