19,099 research outputs found
Towards Autonomous Selective Harvesting: A Review of Robot Perception, Robot Design, Motion Planning and Control
This paper provides an overview of the current state-of-the-art in selective
harvesting robots (SHRs) and their potential for addressing the challenges of
global food production. SHRs have the potential to increase productivity,
reduce labour costs, and minimise food waste by selectively harvesting only
ripe fruits and vegetables. The paper discusses the main components of SHRs,
including perception, grasping, cutting, motion planning, and control. It also
highlights the challenges in developing SHR technologies, particularly in the
areas of robot design, motion planning and control. The paper also discusses
the potential benefits of integrating AI and soft robots and data-driven
methods to enhance the performance and robustness of SHR systems. Finally, the
paper identifies several open research questions in the field and highlights
the need for further research and development efforts to advance SHR
technologies to meet the challenges of global food production. Overall, this
paper provides a starting point for researchers and practitioners interested in
developing SHRs and highlights the need for more research in this field.Comment: Preprint: to be appeared in Journal of Field Robotic
Modularizing and Assembling Cognitive Map Learners via Hyperdimensional Computing
Biological organisms must learn how to control their own bodies to achieve
deliberate locomotion, that is, predict their next body position based on their
current position and selected action. Such learning is goal-agnostic with
respect to maximizing (minimizing) an environmental reward (penalty) signal. A
cognitive map learner (CML) is a collection of three separate yet
collaboratively trained artificial neural networks which learn to construct
representations for the node states and edge actions of an arbitrary
bidirectional graph. In so doing, a CML learns how to traverse the graph nodes;
however, the CML does not learn when and why to move from one node state to
another. This work created CMLs with node states expressed as high dimensional
vectors suitable for hyperdimensional computing (HDC), a form of symbolic
machine learning (ML). In so doing, graph knowledge (CML) was segregated from
target node selection (HDC), allowing each ML approach to be trained
independently. The first approach used HDC to engineer an arbitrary number of
hierarchical CMLs, where each graph node state specified target node states for
the next lower level CMLs to traverse to. Second, an HDC-based
stimulus-response experience model was demonstrated per CML. Because
hypervectors may be in superposition with each other, multiple experience
models were added together and run in parallel without any retraining. Lastly,
a CML-HDC ML unit was modularized: trained with proxy symbols such that
arbitrary, application-specific stimulus symbols could be operated upon without
retraining either CML or HDC model. These methods provide a template for
engineering heterogenous ML systems
CoRe-Sleep: A Multimodal Fusion Framework for Time Series Robust to Imperfect Modalities
Sleep abnormalities can have severe health consequences. Automated sleep
staging, i.e. labelling the sequence of sleep stages from the patient's
physiological recordings, could simplify the diagnostic process. Previous work
on automated sleep staging has achieved great results, mainly relying on the
EEG signal. However, often multiple sources of information are available beyond
EEG. This can be particularly beneficial when the EEG recordings are noisy or
even missing completely. In this paper, we propose CoRe-Sleep, a Coordinated
Representation multimodal fusion network that is particularly focused on
improving the robustness of signal analysis on imperfect data. We demonstrate
how appropriately handling multimodal information can be the key to achieving
such robustness. CoRe-Sleep tolerates noisy or missing modalities segments,
allowing training on incomplete data. Additionally, it shows state-of-the-art
performance when testing on both multimodal and unimodal data using a single
model on SHHS-1, the largest publicly available study that includes sleep stage
labels. The results indicate that training the model on multimodal data does
positively influence performance when tested on unimodal data. This work aims
at bridging the gap between automated analysis tools and their clinical
utility.Comment: 10 pages, 4 figures, 2 tables, journa
TransFusionOdom: Interpretable Transformer-based LiDAR-Inertial Fusion Odometry Estimation
Multi-modal fusion of sensors is a commonly used approach to enhance the
performance of odometry estimation, which is also a fundamental module for
mobile robots. However, the question of \textit{how to perform fusion among
different modalities in a supervised sensor fusion odometry estimation task?}
is still one of challenging issues remains. Some simple operations, such as
element-wise summation and concatenation, are not capable of assigning adaptive
attentional weights to incorporate different modalities efficiently, which make
it difficult to achieve competitive odometry results. Recently, the Transformer
architecture has shown potential for multi-modal fusion tasks, particularly in
the domains of vision with language. In this work, we propose an end-to-end
supervised Transformer-based LiDAR-Inertial fusion framework (namely
TransFusionOdom) for odometry estimation. The multi-attention fusion module
demonstrates different fusion approaches for homogeneous and heterogeneous
modalities to address the overfitting problem that can arise from blindly
increasing the complexity of the model. Additionally, to interpret the learning
process of the Transformer-based multi-modal interactions, a general
visualization approach is introduced to illustrate the interactions between
modalities. Moreover, exhaustive ablation studies evaluate different
multi-modal fusion strategies to verify the performance of the proposed fusion
strategy. A synthetic multi-modal dataset is made public to validate the
generalization ability of the proposed fusion strategy, which also works for
other combinations of different modalities. The quantitative and qualitative
odometry evaluations on the KITTI dataset verify the proposed TransFusionOdom
could achieve superior performance compared with other related works.Comment: Submitted to IEEE Sensors Journal with some modifications. This work
has been submitted to the IEEE for possible publication. Copyright may be
transferred without notice, after which this version may no longer be
accessibl
Copy-paste data augmentation for domain transfer on traffic signs
City streets carry a lot of information that can be exploited to improve the quality of the services the citizens receive. For example, autonomous vehicles need to act accordingly to all the element that are nearby the vehicle itself, like pedestrians, traffic signs and other vehicles. It is also possible to use such information for smart city applications, for example to predict and analyze the traffic or pedestrian flows.
Among all the objects that it is possible to find in a street, traffic signs are very important because of the information they carry. This information can in fact be exploited both for autonomous driving and for smart city applications. Deep learning and, more generally, machine learning models however need huge quantities to learn. Even though modern models are very good at gener- alizing, the more samples the model has, the better it can generalize between different samples.
Creating these datasets organically, namely with real pictures, is a very tedious task because of the wide variety of signs available in the whole world and especially because of all the possible light, orientation conditions and con- ditions in general in which they can appear. In addition to that, it may not be easy to collect enough samples for all the possible traffic signs available, cause some of them may be very rare to find.
Instead of collecting pictures manually, it is possible to exploit data aug- mentation techniques to create synthetic datasets containing the signs that are needed. Creating this data synthetically allows to control the distribution and the conditions of the signs in the datasets, improving the quality and quantity of training data that is going to be used. This thesis work is about using copy-paste data augmentation to create synthetic data for the traffic sign recognition task
Neural Architecture Search: Insights from 1000 Papers
In the past decade, advances in deep learning have resulted in breakthroughs
in a variety of areas, including computer vision, natural language
understanding, speech recognition, and reinforcement learning. Specialized,
high-performing neural architectures are crucial to the success of deep
learning in these areas. Neural architecture search (NAS), the process of
automating the design of neural architectures for a given task, is an
inevitable next step in automating machine learning and has already outpaced
the best human-designed architectures on many tasks. In the past few years,
research in NAS has been progressing rapidly, with over 1000 papers released
since 2020 (Deng and Lindauer, 2021). In this survey, we provide an organized
and comprehensive guide to neural architecture search. We give a taxonomy of
search spaces, algorithms, and speedup techniques, and we discuss resources
such as benchmarks, best practices, other surveys, and open-source libraries
Open Set Classification of GAN-based Image Manipulations via a ViT-based Hybrid Architecture
Classification of AI-manipulated content is receiving great attention, for
distinguishing different types of manipulations. Most of the methods developed
so far fail in the open-set scenario, that is when the algorithm used for the
manipulation is not represented by the training set. In this paper, we focus on
the classification of synthetic face generation and manipulation in open-set
scenarios, and propose a method for classification with a rejection option. The
proposed method combines the use of Vision Transformers (ViT) with a hybrid
approach for simultaneous classification and localization. Feature map
correlation is exploited by the ViT module, while a localization branch is
employed as an attention mechanism to force the model to learn per-class
discriminative features associated with the forgery when the manipulation is
performed locally in the image. Rejection is performed by considering several
strategies and analyzing the model output layers. The effectiveness of the
proposed method is assessed for the task of classification of facial attribute
editing and GAN attribution
Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
Spurious correlations that degrade model generalization or lead the model to
be right for the wrong reasons are one of the main robustness concerns for
real-world deployments. However, mitigating these correlations during
pre-training for large-scale models can be costly and impractical, particularly
for those without access to high-performance computing resources. This paper
proposes a novel approach to address spurious correlations during fine-tuning
for a given domain of interest. With a focus on multi-modal models (e.g.,
CLIP), the proposed method leverages different modalities in these models to
detect and explicitly set apart spurious attributes from the affected class,
achieved through a multi-modal contrastive loss function that expresses
spurious relationships through language. Our experimental results and in-depth
visualizations on CLIP show that such an intervention can effectively i)
improve the model's accuracy when spurious attributes are not present, and ii)
directs the model's activation maps towards the actual class rather than the
spurious attribute when present. In particular, on the Waterbirds dataset, our
algorithm achieved a worst-group accuracy 23% higher than ERM on CLIP with a
ResNet-50 backbone, and 32% higher on CLIP with a ViT backbone, while
maintaining the same average accuracy as ERM
INFERENSI KONTEKS BERDASARKAN ANALISIS RELASI MAKNA WEBTOON “SMILE BRUSH: MY OLD PICTURES”
The study in this research is oriented to the analysis and description of inferences on the context and a comprehensive understanding of other linguistic variables in the text and discourse in it. The research data are lingual lexical units and phrases that show the relation of synonymy and polysemy meanings in the narrative text of the comic "Smile Brush: My Old Pictures" by Waroo, which can be accessed on the Webtoon platform. The data is processed using descriptive qualitative linguistic research characteristics combined with ethnoscience analysis. Data was occupied by the distribution method using the BUL/Direct Element Sharing technique and coding. The result states that this inference is the conclusion of cognition based on the context built by involving participants, awareness, and over-paradigmatic relations to syntagmatic other ties. This inference is the role of the association of meaning to other linguistic units in understanding the context in terminating inference. The process and conclusion of all these factors and variables show the stimulative, systemic, and holistic linguistic correlation of metafunctions and stratification of linguistic domains
GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering
In this work, we present an end-to-end Knowledge Graph Question Answering
(KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text
pre-trained language model. The model takes a question in natural language as
input and produces a simpler form of the intended SPARQL query. In the simpler
form, the model does not directly produce entity and relation IDs. Instead, it
produces corresponding entity and relation labels. The labels are grounded to
KG entity and relation IDs in a subsequent step. To further improve the
results, we instruct the model to produce a truncated version of the KG
embedding for each entity. The truncated KG embedding enables a finer search
for disambiguation purposes. We find that T5 is able to learn the truncated KG
embeddings without any change of loss function, improving KGQA performance. As
a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata
datasets on end-to-end KGQA over Wikidata.Comment: 16 pages single column format accepted at ESWC 2023 research trac
- …