485,855 research outputs found
GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination
Recent progress in deep learning is revolutionizing the healthcare domain
including providing solutions to medication recommendations, especially
recommending medication combination for patients with complex health
conditions. Existing approaches either do not customize based on patient health
history, or ignore existing knowledge on drug-drug interactions (DDI) that
might lead to adverse outcomes. To fill this gap, we propose the Graph
Augmented Memory Networks (GAMENet), which integrates the drug-drug
interactions knowledge graph by a memory module implemented as a graph
convolutional networks, and models longitudinal patient records as the query.
It is trained end-to-end to provide safe and personalized recommendation of
medication combination. We demonstrate the effectiveness and safety of GAMENet
by comparing with several state-of-the-art methods on real EHR data. GAMENet
outperformed all baselines in all effectiveness measures, and also achieved
3.60% DDI rate reduction from existing EHR data.Comment: AAAI 2019; change the template and fix some typo
Deep Virtual Networks for Memory Efficient Inference of Multiple Tasks
Deep networks consume a large amount of memory by their nature. A natural
question arises can we reduce that memory requirement whilst maintaining
performance. In particular, in this work we address the problem of memory
efficient learning for multiple tasks. To this end, we propose a novel network
architecture producing multiple networks of different configurations, termed
deep virtual networks (DVNs), for different tasks. Each DVN is specialized for
a single task and structured hierarchically. The hierarchical structure, which
contains multiple levels of hierarchy corresponding to different numbers of
parameters, enables multiple inference for different memory budgets. The
building block of a deep virtual network is based on a disjoint collection of
parameters of a network, which we call a unit. The lowest level of hierarchy in
a deep virtual network is a unit, and higher levels of hierarchy contain lower
levels' units and other additional units. Given a budget on the number of
parameters, a different level of a deep virtual network can be chosen to
perform the task. A unit can be shared by different DVNs, allowing multiple
DVNs in a single network. In addition, shared units provide assistance to the
target task with additional knowledge learned from another tasks. This
cooperative configuration of DVNs makes it possible to handle different tasks
in a memory-aware manner. Our experiments show that the proposed method
outperforms existing approaches for multiple tasks. Notably, ours is more
efficient than others as it allows memory-aware inference for all tasks.Comment: CVPR 201
End-to-End Audiovisual Fusion with LSTMs
Several end-to-end deep learning approaches have been recently presented
which simultaneously extract visual features from the input images and perform
visual speech classification. However, research on jointly extracting audio and
visual features and performing classification is very limited. In this work, we
present an end-to-end audiovisual model based on Bidirectional Long Short-Term
Memory (BLSTM) networks. To the best of our knowledge, this is the first
audiovisual fusion model which simultaneously learns to extract features
directly from the pixels and spectrograms and perform classification of speech
and nonlinguistic vocalisations. The model consists of multiple identical
streams, one for each modality, which extract features directly from mouth
regions and spectrograms. The temporal dynamics in each stream/modality are
modeled by a BLSTM and the fusion of multiple streams/modalities takes place
via another BLSTM. An absolute improvement of 1.9% in the mean F1 of 4
nonlingusitic vocalisations over audio-only classification is reported on the
AVIC database. At the same time, the proposed end-to-end audiovisual fusion
system improves the state-of-the-art performance on the AVIC database leading
to a 9.7% absolute increase in the mean F1 measure. We also perform audiovisual
speech recognition experiments on the OuluVS2 database using different views of
the mouth, frontal to profile. The proposed audiovisual system significantly
outperforms the audio-only model for all views when the acoustic noise is high.Comment: Accepted to AVSP 2017. arXiv admin note: substantial text overlap
with arXiv:1709.00443 and text overlap with arXiv:1701.0584
Deep Sketch Hashing: Fast Free-hand Sketch-Based Image Retrieval
Free-hand sketch-based image retrieval (SBIR) is a specific cross-view
retrieval task, in which queries are abstract and ambiguous sketches while the
retrieval database is formed with natural images. Work in this area mainly
focuses on extracting representative and shared features for sketches and
natural images. However, these can neither cope well with the geometric
distortion between sketches and images nor be feasible for large-scale SBIR due
to the heavy continuous-valued distance computation. In this paper, we speed up
SBIR by introducing a novel binary coding method, named \textbf{Deep Sketch
Hashing} (DSH), where a semi-heterogeneous deep architecture is proposed and
incorporated into an end-to-end binary coding framework. Specifically, three
convolutional neural networks are utilized to encode free-hand sketches,
natural images and, especially, the auxiliary sketch-tokens which are adopted
as bridges to mitigate the sketch-image geometric distortion. The learned DSH
codes can effectively capture the cross-view similarities as well as the
intrinsic semantic correlations between different categories. To the best of
our knowledge, DSH is the first hashing work specifically designed for
category-level SBIR with an end-to-end deep architecture. The proposed DSH is
comprehensively evaluated on two large-scale datasets of TU-Berlin Extension
and Sketchy, and the experiments consistently show DSH's superior SBIR
accuracies over several state-of-the-art methods, while achieving significantly
reduced retrieval time and memory footprint.Comment: This paper will appear as a spotlight paper in CVPR201
End-to-end visual speech recognition with LSTMS
Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification is very limited. In this work, we present an end-to-end visual speech recognition system based on Long-Short Memory (LSTM) networks. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and perform classification and also achieves state-of-the-art performance in visual speech classification. The model consists of two streams which extract features directly from the mouth and difference images, respectively. The temporal dynamics in each stream are modelled by an LSTM and the fusion of the two streams takes place via a Bidirectional LSTM (BLSTM). An absolute improvement of 9.7% over the base line is reported on the OuluVS2 database, and 1.5% on the CUAVE database when compared with other methods which use a similar visual front-end
Ultrafast Video Attention Prediction with Coupled Knowledge Distillation
Large convolutional neural network models have recently demonstrated
impressive performance on video attention prediction. Conventionally, these
models are with intensive computation and large memory. To address these
issues, we design an extremely light-weight network with ultrafast speed, named
UVA-Net. The network is constructed based on depth-wise convolutions and takes
low-resolution images as input. However, this straight-forward acceleration
method will decrease performance dramatically. To this end, we propose a
coupled knowledge distillation strategy to augment and train the network
effectively. With this strategy, the model can further automatically discover
and emphasize implicit useful cues contained in the data. Both spatial and
temporal knowledge learned by the high-resolution complex teacher networks also
can be distilled and transferred into the proposed low-resolution light-weight
spatiotemporal network. Experimental results show that the performance of our
model is comparable to ten state-of-the-art models in video attention
prediction, while it costs only 0.68 MB memory footprint, runs about 10,106 FPS
on GPU and 404 FPS on CPU, which is 206 times faster than previous models
- …