2,056 research outputs found

    Video Registration in Egocentric Vision under Day and Night Illumination Changes

    Full text link
    With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is now possible. In this paper we propose to treat the problem of obtaining the user position with respect to a known environment as a video registration problem. Video registration, i.e. the task of aligning an input video sequence to a pre-built 3D model, relies on a matching process of local keypoints extracted on the query sequence to a 3D point cloud. The overall registration performance is strictly tied to the actual quality of this 2D-3D matching, and can degrade if environmental conditions such as steep changes in lighting like the ones between day and night occur. To effectively register an egocentric video sequence under these conditions, we propose to tackle the source of the problem: the matching process. To overcome the shortcomings of standard matching techniques, we introduce a novel embedding space that allows us to obtain robust matches by jointly taking into account local descriptors, their spatial arrangement and their temporal robustness. The proposal is evaluated using unconstrained egocentric video sequences both in terms of matching quality and resulting registration performance using different 3D models of historical landmarks. The results show that the proposed method can outperform state of the art registration algorithms, in particular when dealing with the challenges of night and day sequences

    Su un frammento di Sallustio

    Get PDF
    La fortuna, (e sfortuna) di una congettura di Lipsio a Sallustio, fr. II, fr. 77 Maur. documenta come una citazione può talvolta sedurre il lettore a considerarla, anche contro l'evidenza, compiuta.The fate of Lipsius' conjecture to Sallustius, fr. II, fr. 77 Maur. proves that a fragment or a quotation may tempt the reader to consider it as a complete sentence

    A Data-Driven Approach for Tag Refinement and Localization in Web Videos

    Get PDF
    Tagging of visual content is becoming more and more widespread as web-based services and social networks have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia content. However, not all media are equally tagged by users. Using the current systems is easy to tag a single photo, and even tagging a part of a photo, like a face, has become common in sites like Flickr and Facebook. On the other hand, tagging a video sequence is more complicated and time consuming, so that users just tag the overall content of a video. In this paper we present a method for automatic video annotation that increases the number of tags originally provided by users, and localizes them temporally, associating tags to keyframes. Our approach exploits collective knowledge embedded in user-generated tags and web sources, and visual similarity of keyframes and images uploaded to social sites like YouTube and Flickr, as well as web sources like Google and Bing. Given a keyframe, our method is able to select on the fly from these visual sources the training exemplars that should be the most relevant for this test sample, and proceeds to transfer labels across similar images. Compared to existing video tagging approaches that require training classifiers for each tag, our system has few parameters, is easy to implement and can deal with an open vocabulary scenario. We demonstrate the approach on tag refinement and localization on DUT-WEBV, a large dataset of web videos, and show state-of-the-art results.Comment: Preprint submitted to Computer Vision and Image Understanding (CVIU

    UniUD Submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2023

    Full text link
    In this report, we present the technical details of our submission to the EPIC-Kitchens-100 Multi-Instance Retrieval Challenge 2023. To participate in the challenge, we ensembled two models trained with two different loss functions on 25% of the training data. Our submission, visible on the public leaderboard, obtains an average score of 56.81% nDCG and 42.63% mAP

    Alfalfa hay digestibility in Sardinian does

    Get PDF
    A feeding trial was carried out on ten Sardinian does. 5 does (group F) were fed with alfalfa hay and the other 5 (group C+F) with alfalfa hay + 0.6 kg of concentrate to measure the digestibility of these feeds. The digestibility was measured by marker method (lignin) and by the enzymatic method. The marker method showed that the addition of concentrate to the hay increased the OM digestibility. The second method showed similar results. The Authors calculated a regression equation between "in vivo" and "in vitro" digestibility of OM in the C+F group. The changes in lignin composition before and after the digestion were analyzed

    Egocentric Video Summarization of Cultural Tour based on User Preferences

    Get PDF
    In this paper, we propose a new method to obtain customized video summarization according to specific user preferences. Our approach is tailored on Cultural Heritage scenario and is designed on identifying candidate shots, selecting from the original streams only the scenes with behavior patterns related to the presence of relevant experiences, and further filtering them in order to obtain a summary matching the requested user's preferences. Our preliminary results show that the proposed approach is able to leverage user's preferences in order to obtain a customized summary, so that different users may extract from the same stream different summaries

    A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

    Full text link
    Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.Comment: Accepted for presentation at 30th ACM International Conference on Multimedia (ACM MM

    Data augmentation techniques for the Video Question Answering task

    Full text link
    Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a meaningful answer. In our work we focus on the Egocentric VideoQA task, which exploits first-person videos, because of the importance of such task which can have impact on many different fields, such as those pertaining the social assistance and the industrial training. Recently, an Egocentric VideoQA dataset, called EgoVQA, has been released. Given its small size, models tend to overfit quickly. To alleviate this problem, we propose several augmentation techniques which give us a +5.5% improvement on the final accuracy over the considered baseline.Comment: 16 pages, 5 figures; to be published in Egocentric Perception, Interaction and Computing (EPIC) Workshop Proceedings, at ECCV 202

    FArMARe: a Furniture-Aware Multi-task methodology for Recommending Apartments based on the user interests

    Full text link
    Nowadays, many people frequently have to search for new accommodation options. Searching for a suitable apartment is a time-consuming process, especially because visiting them is often mandatory to assess the truthfulness of the advertisements found on the Web. While this process could be alleviated by visiting the apartments in the metaverse, the Web-based recommendation platforms are not suitable for the task. To address this shortcoming, in this paper, we define a new problem called text-to-apartment recommendation, which requires ranking the apartments based on their relevance to a textual query expressing the user's interests. To tackle this problem, we introduce FArMARe, a multi-task approach that supports cross-modal contrastive training with a furniture-aware objective. Since public datasets related to indoor scenes do not contain detailed descriptions of the furniture, we collect and annotate a dataset comprising more than 6000 apartments. A thorough experimentation with three different methods and two raw feature extraction procedures reveals the effectiveness of FArMARe in dealing with the problem at hand.Comment: accepted for presentation at the ICCV2023 CV4Metaverse worksho

    Efficient Keyphrase Generation with GANs

    Get PDF
    Keyphrase Generation is the task of predicting keyphrases: short text sequences that convey the main semantic meaning of a document. In this paper, we introduce a keyphrase generation approach that makes use of a Generative Adversarial Networks (GANs) architecture. In our system, the Generator produces a sequence of keyphrases for an input document. The Discriminator, in turn, tries to distinguish between machine generated and human curated keyphrases. We propose a novel Discriminator architecture based on a BERT pretrained model fine-tuned for Sequence Classification. We train our proposed architecture using only a small subset of the standard available training dataset, amounting to less than 1% of the total, achieving a great level of data efficiency. The resulting model is evaluated on five public datasets, obtaining competitive and promising results with respect to four state-of-the-art generative models
    • …
    corecore