With the proliferation of soft and hard sensors, data in multiple sensor modalities has become commonplace. In this dissertation, we propose a general-purpose agent model that operates using a closed perception-action loop. The agent actively and sequentially samples its environment, driven by sensory prediction error. It learns where and what to sample by minimizing this prediction error, without any reinforcement. This end-to-end model is evaluated on three applications: (1) generation and recognition of handwritten numerals and alphabets from images and videos, (2) generation and recognition of human-human interactions from videos, and (3) recognition of emotions from speech via generation. For each application, the model yields state-of-the-art accuracy on benchmark datasets, while also maintaining sample and model size efficiency. In order to validate our model with respect to human performance, we collect mouse-click attention tracking (mcAT) data from 382 participants trying to recognize handwritten numerals and alphabets (upper and lowercase) from images via sequential sampling. Images from benchmark datasets are presented as stimuli. The collected data consists of a sequence of sample (click) locations, predicted class label(s) at each sampling, and duration of each sampling. We show that on average, participants observe only 12.8% of an image for recognition. When exposed to the same stimuli and experimental conditions as the participants, our agent model performs handwritten numeral/alphabet recognition more efficiently than the participants as well as a highly-cited attention-based reinforcement model