1,358 research outputs found
Accessibility-based reranking in multimedia search engines
Traditional multimedia search engines retrieve results based mostly on the query submitted by the user, or using a log of previous searches to provide personalized results, while not considering the accessibility of the results for users with vision or other types of impairments. In this paper, a novel approach is presented which incorporates the accessibility of images for users with various vision impairments, such as color blindness, cataract and glaucoma, in order to rerank the results of an image search engine. The accessibility of individual images is measured through the use of vision simulation filters. Multi-objective optimization techniques utilizing the image accessibility scores are used to handle users with multiple vision impairments, while the impairment profile of a specific user is used to select one from the Pareto-optimal solutions. The proposed approach has been tested with two image datasets, using both simulated and real impaired users, and the results verify its applicability. Although the proposed method has been used for vision accessibility-based reranking, it can also be extended for other types of personalization context
Annotating, Understanding, and Predicting Long-term Video Memorability
International audienceMemorability can be regarded as a useful metric of video importance to help make a choice between competing videos. Research on computational understanding of video memorability is however in its early stages. There is no available dataset for modelling purposes, and the few previous attempts provided protocols to collect video memorability data that would be difficult to generalize. Furthermore, the computational features needed to build a robust memorability predictor remain largely undiscovered. In this article, we propose a new protocol to collect long-term video memorability annotations. We measure the memory performances of 104 participants from weeks to years after memorization to build a dataset of 660 videos for video memorability prediction. This dataset is made available for the research community. We then analyze the collected data in order to better understand video memorability, in particular the effects of response time, duration of memory retention and repetition of visualization on video memorability. We finally investigate the use of various types of audio and visual features and build a computational model for video memorability prediction. We conclude that high level visual semantics help better predict the memorability of videos
Deep Learning for Dense Interpretation of Video: Survey of Various Approach, Challenges, Datasets and Metrics
Video interpretation has garnered considerable attention in computer vision and natural language processing fields due to the rapid expansion of video data and the increasing demand for various applications such as intelligent video search, automated video subtitling, and assistance for visually impaired individuals. However, video interpretation presents greater challenges due to the inclusion of both temporal and spatial information within the video. While deep learning models for images, text, and audio have made significant progress, efforts have recently been focused on developing deep networks for video interpretation. A thorough evaluation of current research is necessary to provide insights for future endeavors, considering the myriad techniques, datasets, features, and evaluation criteria available in the video domain. This study offers a survey of recent advancements in deep learning for dense video interpretation, addressing various datasets and the challenges they present, as well as key features in video interpretation. Additionally, it provides a comprehensive overview of the latest deep learning models in video interpretation, which have been instrumental in activity identification and video description or captioning. The paper compares the performance of several deep learning models in this field based on specific metrics. Finally, the study summarizes future trends and directions in video interpretation
Deep Architectures for Visual Recognition and Description
In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included.
In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased.
This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, ‘Bharatnatyam’. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy
Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Recognizing the activities, causing distraction, in real-world driving
scenarios is critical for ensuring the safety and reliability of both drivers
and pedestrians on the roadways. Conventional computer vision techniques are
typically data-intensive and require a large volume of annotated training data
to detect and classify various distracted driving behaviors, thereby limiting
their efficiency and scalability. We aim to develop a generalized framework
that showcases robust performance with access to limited or no annotated
training data. Recently, vision-language models have offered large-scale
visual-textual pretraining that can be adapted to task-specific learning like
distracted driving activity recognition. Vision-language pretraining models,
such as CLIP, have shown significant promise in learning natural
language-guided visual representations. This paper proposes a CLIP-based driver
activity recognition approach that identifies driver distraction from
naturalistic driving images and videos. CLIP's vision embedding offers
zero-shot transfer and task-based finetuning, which can classify distracted
activities from driving video data. Our results show that this framework offers
state-of-the-art performance on zero-shot transfer and video-based CLIP for
predicting the driver's state on two public datasets. We propose both
frame-based and video-based frameworks developed on top of the CLIP's visual
representation for distracted driving detection and classification task and
report the results.Comment: 15 pages, 10 figure
ORÁCULO: Detection of Spatiotemporal Hot Spots of Conflict-Related Events Extracted from Online News Sources
Dissertation presented as the partial requirement for obtaining a Master's degree in Geographic Information Systems and ScienceAchieving situational awareness in peace operations requires understanding
where and when conflict-related activity is most intense. However, the irregular nature
of most factions hinders the use of remote sensing, while winning the trust of the host
populations to allow the collection of wide-ranging human intelligence is a slow process.
Thus, our proposed solution, ORÁCULO, is an information system which detects
spatiotemporal hot spots of conflict-related activity by analyzing the patterns of events
extracted from online news sources, allowing immediate situational awareness. To do so,
it combines a closed-domain supervised event extractor with emerging hot spots analysis
of event space-time cubes. The prototype of ORÁCULO was tested on tweets scraped
from the Twitter accounts of local and international news sources covering the Central
African Republic Civil War, and its test results show that it achieved near state-of-theart
event extraction performance, significant overlap with a reference event dataset, and
strong correlation with the hot spots space-time cube generated from the reference event
dataset, proving the viability of the proposed solution. Future work will focus on
improving the event extraction performance and on testing ORÁCULO in cooperation
with peacekeeping organizations.
Keywords: event extraction, natural language understanding, spatiotemporal analysis,
peace operations, open-source intelligence.Atingir e manter a consciência situacional em operações de paz requer o
conhecimento de quando e onde é que a atividade relacionada com o conflito é mais
intensa. Porém, a natureza irregular da maioria das fações dificulta o uso de deteção
remota, e ganhar a confiança das populações para permitir a recolha de informações é
um processo moroso. Assim, a nossa solução proposta, ORÁCULO, consiste num sistema
de informações que deteta “hot spots” espácio-temporais de atividade relacionada com o
conflito através da análise dos padrões de eventos extraídos de fontes noticiosas online,
(incluindo redes sociais), permitindo consciência situacional imediata. Nesse sentido, a
nossa solução combina um extrator de eventos de domínio limitado baseado em
aprendizagem supervisionada com a análise de “hot spots” emergentes de cubos espaçotempo
de eventos. O protótipo de ORÁCULO foi testado em tweets recolhidos de fontes
noticiosas locais e internacionais que cobrem a Guerra Civil da República Centro-
Africana. Os resultados dos seus testes demonstram que foram conseguidos um
desempenho de extração de eventos próximo do estado da arte, uma sobreposição
significativa com um conjunto de eventos de referência e uma correlação forte com o
cubo espaço-tempo de “hot spots” gerado a partir desse conjunto de referência,
comprovando a viabilidade da solução proposta. Face aos resultados atingidos, o
trabalho futuro focar-se-á em melhorar o desempenho de extração de eventos e em testar
o sistema ORÁCULO em cooperação com organizações que conduzam operações paz
Discovery of topological constraints on spatial object classes using a refined topological model
In a typical data collection process, a surveyed spatial object is annotated upon creation, and is classified based on its attributes. This annotation can also be guided by textual definitions of objects. However, interpretations of such definitions may differ among people, and thus result in subjective and inconsistent classification of objects. This problem becomes even more pronounced if the cultural and linguistic differences are considered. As a solution, this paper investigates the role of topology as the defining characteristic of a class of spatial objects. We propose a data mining approach based on frequent itemset mining to learn patterns in topological relations between objects of a given class and other spatial objects. In order to capture topological relations between more than two (linear) objects, this paper further proposes a refinement of the 9-intersection model for topological relations of line geometries. The discovered topological relations form topological constraints of an object class that can be used for spatial object classification. A case study has been carried out on bridges in the OpenStreetMap dataset for the state of Victoria, Australia. The results show that the proposed approach can successfully learn topological constraints for the class bridge, and that the proposed refined topological model for line geometries outperforms the 9-intersection model in this task
- …