10 research outputs found
Searching for A Robust Neural Architecture in Four GPU Hours
Conventional neural architecture search (NAS) approaches are based on
reinforcement learning or evolutionary strategy, which take more than 3000 GPU
hours to find a good model on CIFAR-10. We propose an efficient NAS approach
learning to search by gradient descent. Our approach represents the search
space as a directed acyclic graph (DAG). This DAG contains billions of
sub-graphs, each of which indicates a kind of neural architecture. To avoid
traversing all the possibilities of the sub-graphs, we develop a differentiable
sampler over the DAG. This sampler is learnable and optimized by the validation
loss after training the sampled architecture. In this way, our approach can be
trained in an end-to-end fashion by gradient descent, named Gradient-based
search using Differentiable Architecture Sampler (GDAS). In experiments, we can
finish one searching procedure in four GPU hours on CIFAR-10, and the
discovered model obtains a test error of 2.82\% with only 2.5M parameters,
which is on par with the state-of-the-art. Code is publicly available on
GitHub: https://github.com/D-X-Y/NAS-Projects.Comment: Minor modifications to the CVPR 2019 camera-ready version (add code
link
Bidirectional multirate reconstruction for temporal modeling in videos
Β© 2017 IEEE. Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. In this paper, we propose an unsupervised temporal modeling method that learns from untrimmed videos. The speed of motion varies constantly, e.g., a man may run quickly or slowly. We therefore train a Multirate Visual Recurrent Model (MVRM) by encoding frames of a clip with different intervals. This learning process makes the learned model more capable of dealing with motion speed variance. Given a clip sampled from a video, we use its past and future neighboring clips as the temporal context, and reconstruct the two temporal transitions, i.e., present β past transition and present β future transition, reflecting the temporal information in different views. The proposed method exploits the two transitions simultaneously by incorporating a bidirectional reconstruction which consists of a backward reconstruction and a forward reconstruction. We apply the proposed method to two challenging video tasks, i.e., complex event detection and video captioning, in which it achieves state-of-the-art performance. Notably, our method generates the best single feature for event detection with a relative improvement of 10.4% on the MEDTest-13 dataset and achieves the best performance in video captioning across all evaluation metrics on the YouTube2Text dataset