2,088 research outputs found
Video Registration in Egocentric Vision under Day and Night Illumination Changes
With the spread of wearable devices and head mounted cameras, a wide range of
application requiring precise user localization is now possible. In this paper
we propose to treat the problem of obtaining the user position with respect to
a known environment as a video registration problem. Video registration, i.e.
the task of aligning an input video sequence to a pre-built 3D model, relies on
a matching process of local keypoints extracted on the query sequence to a 3D
point cloud. The overall registration performance is strictly tied to the
actual quality of this 2D-3D matching, and can degrade if environmental
conditions such as steep changes in lighting like the ones between day and
night occur. To effectively register an egocentric video sequence under these
conditions, we propose to tackle the source of the problem: the matching
process. To overcome the shortcomings of standard matching techniques, we
introduce a novel embedding space that allows us to obtain robust matches by
jointly taking into account local descriptors, their spatial arrangement and
their temporal robustness. The proposal is evaluated using unconstrained
egocentric video sequences both in terms of matching quality and resulting
registration performance using different 3D models of historical landmarks. The
results show that the proposed method can outperform state of the art
registration algorithms, in particular when dealing with the challenges of
night and day sequences
Su un frammento di Sallustio
La fortuna, (e sfortuna) di una congettura di Lipsio a Sallustio, fr. II, fr. 77 Maur.
documenta come una citazione può talvolta sedurre il lettore a considerarla, anche
contro l'evidenza, compiuta.The fate of Lipsius' conjecture to Sallustius, fr. II, fr. 77 Maur. proves that a fragment or a
quotation may tempt the reader to consider it as a complete sentence
A Data-Driven Approach for Tag Refinement and Localization in Web Videos
Tagging of visual content is becoming more and more widespread as web-based
services and social networks have popularized tagging functionalities among
their users. These user-generated tags are used to ease browsing and
exploration of media collections, e.g. using tag clouds, or to retrieve
multimedia content. However, not all media are equally tagged by users. Using
the current systems is easy to tag a single photo, and even tagging a part of a
photo, like a face, has become common in sites like Flickr and Facebook. On the
other hand, tagging a video sequence is more complicated and time consuming, so
that users just tag the overall content of a video. In this paper we present a
method for automatic video annotation that increases the number of tags
originally provided by users, and localizes them temporally, associating tags
to keyframes. Our approach exploits collective knowledge embedded in
user-generated tags and web sources, and visual similarity of keyframes and
images uploaded to social sites like YouTube and Flickr, as well as web sources
like Google and Bing. Given a keyframe, our method is able to select on the fly
from these visual sources the training exemplars that should be the most
relevant for this test sample, and proceeds to transfer labels across similar
images. Compared to existing video tagging approaches that require training
classifiers for each tag, our system has few parameters, is easy to implement
and can deal with an open vocabulary scenario. We demonstrate the approach on
tag refinement and localization on DUT-WEBV, a large dataset of web videos, and
show state-of-the-art results.Comment: Preprint submitted to Computer Vision and Image Understanding (CVIU
UniUD Submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2023
In this report, we present the technical details of our submission to the
EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2023. To participate in
the challenge, we ensembled two models trained with two different loss
functions on 25% of the training data. Our submission, visible on the public
leaderboard, obtains an average score of 56.81% nDCG and 42.63% mAP
Alfalfa hay digestibility in Sardinian does
A feeding trial was carried out on ten Sardinian does. 5 does (group F) were fed with alfalfa
hay and the other 5 (group C+F) with alfalfa hay + 0.6 kg of concentrate to measure the
digestibility of these feeds. The digestibility was measured by marker method (lignin) and by
the enzymatic method. The marker method showed that the addition of concentrate to the hay
increased the OM digestibility. The second method showed similar results. The Authors
calculated a regression equation between "in vivo" and "in vitro" digestibility of OM in the
C+F group. The changes in lignin composition before and after the digestion were analyzed
Egocentric Video Summarization of Cultural Tour based on User Preferences
In this paper, we propose a new method to obtain customized video summarization according to specific user preferences. Our approach is tailored on Cultural Heritage scenario and is designed on identifying candidate shots, selecting from the original streams only the scenes with behavior patterns related to the presence of relevant experiences, and further filtering them in order to obtain a summary matching the requested user's preferences. Our preliminary results show that the proposed approach is able to leverage user's preferences in order to obtain a customized summary, so that different users may extract from the same stream different summaries
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
Every hour, huge amounts of visual contents are posted on social media and
user-generated content platforms. To find relevant videos by means of a natural
language query, text-video retrieval methods have received increased attention
over the past few years. Data augmentation techniques were introduced to
increase the performance on unseen test examples by creating new training
samples with the application of semantics-preserving techniques, such as color
space or geometric transformations on images. Yet, these techniques are usually
applied on raw data, leading to more resource-demanding solutions and also
requiring the shareability of the raw data, which may not always be true, e.g.
copyright issues with clips from movies or TV series. To address this
shortcoming, we propose a multimodal data augmentation technique which works in
the feature space and creates new videos and captions by mixing semantically
similar samples. We experiment our solution on a large scale public dataset,
EPIC-Kitchens-100, and achieve considerable improvements over a baseline
method, improved state-of-the-art performance, while at the same time
performing multiple ablation studies. We release code and pretrained models on
Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.Comment: Accepted for presentation at 30th ACM International Conference on
Multimedia (ACM MM
Data augmentation techniques for the Video Question Answering task
Video Question Answering (VideoQA) is a task that requires a model to analyze
and understand both the visual content given by the input video and the textual
part given by the question, and the interaction between them in order to
produce a meaningful answer. In our work we focus on the Egocentric VideoQA
task, which exploits first-person videos, because of the importance of such
task which can have impact on many different fields, such as those pertaining
the social assistance and the industrial training. Recently, an Egocentric
VideoQA dataset, called EgoVQA, has been released. Given its small size, models
tend to overfit quickly. To alleviate this problem, we propose several
augmentation techniques which give us a +5.5% improvement on the final accuracy
over the considered baseline.Comment: 16 pages, 5 figures; to be published in Egocentric Perception,
Interaction and Computing (EPIC) Workshop Proceedings, at ECCV 202
FArMARe: a Furniture-Aware Multi-task methodology for Recommending Apartments based on the user interests
Nowadays, many people frequently have to search for new accommodation
options. Searching for a suitable apartment is a time-consuming process,
especially because visiting them is often mandatory to assess the truthfulness
of the advertisements found on the Web. While this process could be alleviated
by visiting the apartments in the metaverse, the Web-based recommendation
platforms are not suitable for the task. To address this shortcoming, in this
paper, we define a new problem called text-to-apartment recommendation, which
requires ranking the apartments based on their relevance to a textual query
expressing the user's interests. To tackle this problem, we introduce FArMARe,
a multi-task approach that supports cross-modal contrastive training with a
furniture-aware objective. Since public datasets related to indoor scenes do
not contain detailed descriptions of the furniture, we collect and annotate a
dataset comprising more than 6000 apartments. A thorough experimentation with
three different methods and two raw feature extraction procedures reveals the
effectiveness of FArMARe in dealing with the problem at hand.Comment: accepted for presentation at the ICCV2023 CV4Metaverse worksho
Efficient Keyphrase Generation with GANs
Keyphrase Generation is the task of predicting keyphrases: short text sequences that convey the main semantic meaning of a document. In this paper, we introduce a keyphrase generation approach that makes use of a Generative Adversarial Networks (GANs) architecture. In our system, the Generator produces a sequence of keyphrases for an input document. The Discriminator, in turn, tries to distinguish between machine generated and human curated keyphrases. We propose a novel Discriminator architecture based on a BERT pretrained model fine-tuned for Sequence Classification. We train our proposed architecture using only a small subset of the standard available training dataset, amounting to less than 1% of the total, achieving a great level of data efficiency. The resulting model is evaluated on five public datasets, obtaining competitive and promising results with respect to four state-of-the-art generative models
- …