1,027 research outputs found
Distributed and Scalable Video Analysis Architecture for Human Activity Recognition Using Cloud Services
This thesis proposes an open-source, maintainable system for detecting human activity in large video datasets using scalable hardware architectures. The system is validated by detecting writing and typing activities that were collected as part of the Advancing Out of School Learning in Mathematics and Engineering (AOLME) project. The implementation of the system using Amazon Web Services (AWS) is shown to be both horizontally and vertically scalable. The software associated with the system was designed to be robust so as to facilitate reproducibility and extensibility for future research
Integration of Computer Vision and Natural Language Processing in Multimedia Robotics Application
Computer vision and natural language processing (NLP) are two active machine learning research areas. However, the integration of these two areas gives rise to a new interdisciplinary field, which is currently attracting more attention of researchers. Research has been carried out to extract the text associated with an image or a video that can assist in making computer vision effective. Moreover, researchers focus on utilizing NLP to extract the meaning of words through the use of computer vision. This concept is widely used in robotics. Although robots should observe the surroundings from different ways of interactions, natural gestures and spoken languages are the most convenient way for humans to interact with the robots. This would be possible only if the robots can understand such types of interactions. In the present paper, the proposed integrated application is utilized for guiding vision-impaired people. As vision is the most essential in the life of a human being, an alternative source that helps in guiding the blind in their movements is highly important. For this purpose, the current paper uses a smartphone with the capabilities of vision, language, and intelligence which has been attached to the blind person to capture the images of their surroundings, and it is associated with a Faster Region Convolutional Neural Network (F-RCNN) based central server to detect the objects in the image to inform the person about them and avoid obstacles in their way. These results are passed to the smartphone which produces a speech output for the guidance of the blinds
Dynamicity and Durability in Scalable Visual Instance Search.
Visual instance search involves retrieving from a collection of images the
ones that contain an instance of a visual query. Systems designed for visual
instance search face the major challenge of scalability: a collection of a few
million images used for instance search typically creates a few billion
features that must be indexed. Furthermore, as real image collections grow
rapidly, systems must also provide dynamicity, i.e., be able to handle on-line
insertions while concurrently serving retrieval operations. Durability, which
is the ability to recover correctly from software and hardware crashes, is the
natural complement of dynamicity. Durability, however, has rarely been
integrated within scalable and dynamic high-dimensional indexing solutions.
This article addresses the issue of dynamicity and durability for scalable
indexing of very large and rapidly growing collections of local features for
instance retrieval. By extending the NV-tree, a scalable disk-based
high-dimensional index, we show how to implement the ACID properties of
transactions which ensure both dynamicity and durability. We present a detailed
performance evaluation of the transactional NV-tree: (i) We show that the
insertion throughput is excellent despite the overhead for enforcing the ACID
properties; (ii) We also show that this transactional index is truly scalable
using a standard image benchmark embedded in collections of up to 28.5 billion
high-dimensional vectors; the largest single-server evaluations reported in the
literature
Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition with CNNs
In this paper, we revive the use of old-fashioned handcrafted video
representations for action recognition and put new life into these techniques
via a CNN-based hallucination step. Despite of the use of RGB and optical flow
frames, the I3D model (amongst others) thrives on combining its output with the
Improved Dense Trajectory (IDT) and extracted with its low-level video
descriptors encoded via Bag-of-Words (BoW) and Fisher Vectors (FV). Such a
fusion of CNNs and handcrafted representations is time-consuming due to
pre-processing, descriptor extraction, encoding and tuning parameters. Thus, we
propose an end-to-end trainable network with streams which learn the IDT-based
BoW/FV representations at the training stage and are simple to integrate with
the I3D model. Specifically, each stream takes I3D feature maps ahead of the
last 1D conv. layer and learns to `translate' these maps to BoW/FV
representations. Thus, our model can hallucinate and use such synthesized
BoW/FV representations at the testing stage. We show that even features of the
entire I3D optical flow stream can be hallucinated thus simplifying the
pipeline. Our model saves 20-55h of computations and yields state-of-the-art
results on four publicly available datasets.Comment: First two authors contributed equally. This paper is accepted by
ICCV'1
Feature Augmentation for Improved Topic Modeling of Youtube Lecture Videos using Latent Dirichlet Allocation
Application of Topic Models in text mining of educational data and more specifically, the text data obtained from lecture videos, is an area of research which is largely unexplored yet holds great potential. This work seeks to find empirical evidence for an improvement in Topic Modeling by pre- extracting bigram tokens and adding them as additional features in the Latent Dirichlet Allocation (LDA) algorithm, a widely-recognized topic modeling technique. The dataset considered for analysis is a collection of transcripts of video lectures on Machine Learning scraped from YouTube. Using the cosine similarity distance measure as a metric, the experiment showed a statistically significant improvement in topic model performance against the baseline topic model which did not use extra features, thus confirming the hypothesis. By introducing explainable features before modeling and using deep learning based text representation only at the post-modeling evaluation stage, the overall model interpretability is retained. This empowers educators and researchers alike to not only benefit from the LDA model in their own fields but also to play a substantial role in eorts to improve model performance. It also sets the direction for future work which could use the feature augmented topic model as the input to other more common text mining tasks like document categorization and information retrieval
- …