706 research outputs found
A model for adapting 3D graphics based on scalable coding, real-time simplification and remote rendering
Most current multiplayer 3D games can only be played on dedicated platforms, requiring specifically designed content and communication over a predefined network. To overcome these limitations, the OLGA (On-Line GAming) consortium has devised a framework to develop distributive, multiplayer 3D games. Scalability at the level of content, platforms and networks is exploited to achieve the best trade-offs between complexity and quality
Human activity recognition for pervasive interaction
PhD ThesisThis thesis addresses the challenge of computing food preparation context in the kitchen. The automatic
recognition of fine-grained human activities and food ingredients is realized through pervasive sensing
which we achieve by instrumenting kitchen objects such as knives, spoons, and chopping boards with
sensors. Context recognition in the kitchen lies at the heart of a broad range of real-world applications. In
particular, activity and food ingredient recognition in the kitchen is an essential component for situated
services such as automatic prompting services for cognitively impaired kitchen users and digital situated
support for healthier eating interventions. Previous works, however, have addressed the activity
recognition problem by exploring high-level-human activities using wearable sensing (i.e. worn sensors
on human body) or using technologies that raise privacy concerns (i.e. computer vision). Although such
approaches have yielded significant results for a number of activity recognition problems, they are not
applicable to our domain of investigation, for which we argue that the technology itself must be genuinely
“invisible”, thereby allowing users to perform their activities in a completely natural manner.
In this thesis we describe the development of pervasive sensing technologies and algorithms for finegrained
human activity and food ingredient recognition in the kitchen. After reviewing previous work on
food and activity recognition we present three systems that constitute increasingly sophisticated
approaches to the challenge of kitchen context recognition. Two of these systems, Slice&Dice and Classbased
Threshold Dynamic Time Warping (CBT-DTW), recognize fine-grained food preparation
activities. Slice&Dice is a proof-of-concept application, whereas CBT-DTW is a real-time application
that also addresses the problem of recognising unknown activities. The final system, KitchenSense is a
real-time context recognition framework that deals with the recognition of a more complex set of
activities, and includes the recognition of food ingredients and events in the kitchen. For each system, we
describe the prototyping of pervasive sensing technologies, algorithms, as well as real-world experiments
and empirical evaluations that validate the proposed solutions.Vietnamese government’s 322 project, executed by the Vietnamese Ministry of
Education and Training
End-to-end people detection in crowded scenes
Current people detectors operate either by scanning an image in a sliding
window fashion or by classifying a discrete set of proposals. We propose a
model that is based on decoding an image into a set of people detections. Our
system takes an image as input and directly outputs a set of distinct detection
hypotheses. Because we generate predictions jointly, common post-processing
steps such as non-maximum suppression are unnecessary. We use a recurrent LSTM
layer for sequence generation and train our model end-to-end with a new loss
function that operates on sets of detections. We demonstrate the effectiveness
of our approach on the challenging task of detecting people in crowded scenes.Comment: 9 pages, 7 figures. Submitted to NIPS 2015. Supplementary material
video: http://www.youtube.com/watch?v=QeWl0h3kQ2
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Recent advancements in surgical computer vision applications have been driven
by fully-supervised methods, primarily using only visual data. These methods
rely on manually annotated surgical videos to predict a fixed set of object
categories, limiting their generalizability to unseen surgical procedures and
downstream tasks. In this work, we put forward the idea that the surgical video
lectures available through open surgical e-learning platforms can provide
effective supervisory signals for multi-modal representation learning without
relying on manual annotations. We address the surgery-specific linguistic
challenges present in surgical video lectures by employing multiple
complementary automatic speech recognition systems to generate text
transcriptions. We then present a novel method, SurgVLP - Surgical Vision
Language Pre-training, for multi-modal representation learning. SurgVLP
constructs a new contrastive learning objective to align video clip embeddings
with the corresponding multiple text embeddings by bringing them together
within a joint latent space. To effectively show the representation capability
of the learned joint latent space, we introduce several vision-and-language
tasks for surgery, such as text-based video retrieval, temporal activity
grounding, and video captioning, as benchmarks for evaluation. We further
demonstrate that without using any labeled ground truth, our approach can be
employed for traditional vision-only surgical downstream tasks, such as
surgical tool, phase, and triplet recognition. The code will be made available
at https://github.com/CAMMA-public/SurgVL
Deep Learning in the Automotive Industry: Applications and Tools
Deep Learning refers to a set of machine learning techniques that utilize
neural networks with many hidden layers for tasks, such as image
classification, speech recognition, language understanding. Deep learning has
been proven to be very effective in these domains and is pervasively used by
many Internet services. In this paper, we describe different automotive uses
cases for deep learning in particular in the domain of computer vision. We
surveys the current state-of-the-art in libraries, tools and infrastructures
(e.\,g.\ GPUs and clouds) for implementing, training and deploying deep neural
networks. We particularly focus on convolutional neural networks and computer
vision use cases, such as the visual inspection process in manufacturing plants
and the analysis of social media data. To train neural networks, curated and
labeled datasets are essential. In particular, both the availability and scope
of such datasets is typically very limited. A main contribution of this paper
is the creation of an automotive dataset, that allows us to learn and
automatically recognize different vehicle properties. We describe an end-to-end
deep learning application utilizing a mobile app for data collection and
process support, and an Amazon-based cloud backend for storage and training.
For training we evaluate the use of cloud and on-premises infrastructures
(including multiple GPUs) in conjunction with different neural network
architectures and frameworks. We assess both the training times as well as the
accuracy of the classifier. Finally, we demonstrate the effectiveness of the
trained classifier in a real world setting during manufacturing process.Comment: 10 page
- …