14 research outputs found

    Masked Autoencoder for Unsupervised Video Summarization

    Full text link
    Summarizing a video requires a diverse understanding of the video, ranging from recognizing scenes to evaluating how much each frame is essential enough to be selected as a summary. Self-supervised learning (SSL) is acknowledged for its robustness and flexibility to multiple downstream tasks, but the video SSL has not shown its value for dense understanding tasks like video summarization. We claim an unsupervised autoencoder with sufficient self-supervised learning does not need any extra downstream architecture design or fine-tuning weights to be utilized as a video summarization model. The proposed method to evaluate the importance score of each frame takes advantage of the reconstruction score of the autoencoder's decoder. We evaluate the method in major unsupervised video summarization benchmarks to show its effectiveness under various experimental settings

    Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

    Full text link
    Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated views. In this paper, we propose frequency augmentation (FreqAug), a spatio-temporal data augmentation method in the frequency domain for video representation learning. FreqAug stochastically removes undesirable information from the video by filtering out specific frequency components so that learned representation captures essential features of the video for various downstream tasks. Specifically, FreqAug pushes the model to focus more on dynamic features rather than static features in the video via dropping spatial or temporal low-frequency components. In other words, learning invariance between remaining frequency components results in high-frequency enhanced representation with less static bias. To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations. Transferring the improved representation to five video action recognition and two temporal action localization downstream tasks shows consistent improvements over baselines

    Proceedings of the 2003 Winter Simulation Conference

    No full text
    We present a system called RUBE, which allows a modeler to customize model components and model structure in 2D and 3D. RUBE employs open source tools to assist in model authoring, allowing the user to visualize models with different metaphors. For example, it is possible to visualize an event graph as a city block, or a Petri network as an organically -oriented 3D machine. We suggest that such flexibility in visualization will allow existing model types to take on forms that may be more recognizable to modeling subcommunities, while employing notation as afforded by inexpensive graphical hardware. There is also a possibility to create model types using entirely new notations
    corecore