78 research outputs found

    Circulant temporal encoding for video retrieval and temporal alignment

    Get PDF
    We address the problem of specific video event retrieval. Given a query video of a specific event, e.g., a concert of Madonna, the goal is to retrieve other videos of the same event that temporally overlap with the query. Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to efficiently compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. The descriptors can be compressed in the frequency domain with a product quantizer adapted to complex numbers. In this case, video retrieval is performed without decompressing the descriptors. We also consider the temporal alignment of a set of videos. We exploit the matching confidence and an estimate of the temporal offset computed for all pairs of videos by our retrieval approach. Our robust algorithm aligns the videos on a global timeline by maximizing the set of temporally consistent matches. The global temporal alignment enables synchronous playback of the videos of a given scene

    Vectors of Locally Aggregated Centers for Compact Video Representation

    Full text link
    We propose a novel vector aggregation technique for compact video representation, with application in accurate similarity detection within large video datasets. The current state-of-the-art in visual search is formed by the vector of locally aggregated descriptors (VLAD) of Jegou et. al. VLAD generates compact video representations based on scale-invariant feature transform (SIFT) vectors (extracted per frame) and local feature centers computed over a training set. With the aim to increase robustness to visual distortions, we propose a new approach that operates at a coarser level in the feature representation. We create vectors of locally aggregated centers (VLAC) by first clustering SIFT features to obtain local feature centers (LFCs) and then encoding the latter with respect to given centers of local feature centers (CLFCs), extracted from a training set. The sum-of-differences between the LFCs and the CLFCs are aggregated to generate an extremely-compact video description used for accurate video segment similarity detection. Experimentation using a video dataset, comprising more than 1000 minutes of content from the Open Video Project, shows that VLAC obtains substantial gains in terms of mean Average Precision (mAP) against VLAD and the hyper-pooling method of Douze et. al., under the same compaction factor and the same set of distortions.Comment: Proc. IEEE International Conference on Multimedia and Expo, ICME 2015, Torino, Ital

    LAMV: Learning to align and match videos with kernelized temporal layers

    Get PDF
    This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos

    Analisis Interaksi Pengguna di Media Sosial Dalam Mencegah Video Hoax dan Model Arsitektur Deteksi Tingkat Tinggi

    Get PDF
    Penyebaran berita hoaks dengan konten video yang berulang pada media sosial merupakan fenomenayang sangat luar biasa dan muncul bukan hanya pada kalangan pengguna dewasa saja namun sudah kesegalalapisan usia, Efek yang paling terasa adalah timbulnya perpecahan di masyarakat karena penggunaan videoyang sudah pernah tayang atau ada sebelumnya menjadi bukti kuat untuk memvalidasi konten yang dilihatnya.Penting untuk mendeteksi berita hoaks dengan konten video yang berulang dan menghentikan efek negatifnyapada individu dan masyarakat. Pada penelitian ini pembuatan model arsitektur deteksi tingkat tinggi untuksistem analisis berita hoaks dengan konten video yang digunakan kembali atau berulang pada media sosial dikenalkan, dengan menggunakan deep learning video processing, speech to text dan beberapa fitur content-baseddan context-based rancangan model arsitektur ini dibuat. Konten hoaks dengan video yang berulang diharapkandapat dicegah penyebarannya jika bisa di filter terlebih dahulu sebelum muncul di lini masa. Diharapkan modelarsitektur ini dapat menjadi referensi untuk di buat menjadi real syste

    Text-based Localization of Moments in a Video Corpus

    Full text link
    Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: (i) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, and (ii) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries. Qualitative and quantitative results on three benchmark text-based video moment retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions - demonstrate that our method achieves promising performance on the proposed task of temporal localization of moments in a corpus of videos
    • …
    corecore