47 research outputs found

    Improving Global Multi-target Tracking with Local Updates

    Get PDF
    Conference dates: September 6-7 & 12, 2014We propose a scheme to explicitly detect and resolve ambiguous situations in multiple target tracking. During periods of uncertainty, our method applies multiple local single target trackers to hypothesise short term tracks. These tracks are combined with the tracks obtained by a global multi-target tracker, if they result in a reduction in the global cost function. Since tracking failures typically arise when targets become occluded, we propose a local data association scheme to maintain the target identities in these situations. We demonstrate a reduction of up to 50% in the global cost function, which in turn leads to superior performance on several challenging benchmark sequences. Additionally, we show tracking results in sports videos where poor video quality and frequent and severe occlusions between multiple players pose difficulties for state-of-the-art trackers.Anton Milan, Rikke Gade, Anthony Dick, Thomas B. Moeslund, and Ian Rei

    Foley Music: Learning to Generate Music from Videos

    Full text link
    In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph-Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the demo video with audio turned on to experience the results.Comment: ECCV 2020. Project page: http://foley-music.csail.mit.ed

    Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World

    Get PDF
    Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules

    Critical literacy as a pedagogical goal in English language teaching

    Get PDF
    In this chapter, the authors provide an overview of the area of critical literacy as it pertains to second language pedagogy (curriculum and instruction). After considering the historical origins of critical literacy (from antiquity, and including in first language education), they consider how it began to penetrate the field of applied linguistics. They note the geographical and institutional spread of critical literacy practice as documented by published accounts. They then sketch the main features of L2 critical literacy practice. To do this, they acknowledge how practitioners have reported on their practices regarding classroom content and process. The authors also draw attention to the outcomes of these practices as well as challenges that practitioners have encountered in incorporating critical literacy into their second language classrooms

    What's Making that Sound?

    No full text
    In this paper, we investigate techniques to localize the sound source in video made using one microphone. The visual object whose motion generates the sound is located and segmented based on the synchronization analysis of object motion and audio energy. We first apply an effective region tracking algorithm to segment the video into a number of spatial-temporal region tracks, each representing the temporal evolution of an appearance-coherent image structure (i.e., object). We then extract the motion features of each object as its average acceleration in each frame. Meanwhile, Short-term Fourier Transform is applied to the audio signal to extract audio energy feature as the audio descriptor. We further impose a nonlinear transformation on both audio and visual descriptors to obtain the audio and visual codes in a common rank correlation space. Finally, the correlation between an object and the audio signal is simply evaluated by computing the Hamming distance between the audio and visual codes generated in previous steps. We evaluate the proposed method both qualitatively and quantitatively using a number of challenging test videos. In particular, the proposed method is compared with a state-of-the-art audiovisual source localization algorithm. The results demonstrate the superior performance of the proposed algorithm in spatial-temporal localization and segmentation of audio sources in the visual domain

    FOILING AN ATTACK DEFEATING IPSEC TUNNEL FINGERPRINTING

    No full text
    This paper addresses some of the discriminants that make IPSec tunnel fingerprinting possible. Fingerprinting of VPN-tunnel endpoints may be desirable for forensic purposes, but in the hands of individuals of ill-intent, it undermines an enterprise network’s perimeter security. Three ways of preventing the ill-use of this type of fingerprinting are presented. The first two, apply to enterprises wishing to make their VPN tunnels immune to fingerprinting. The third delves deeper into the conceptual, and is directed at the standards definition process, as used by the Internet Engineering Task Force (IETF) and to authors of security-related RFCs in particular. It addresses aspects in the Internet Key Exchange version 1 (IKEv1) RFC that have led to misinterpretations on the part of IPSec implementers, and describes the use of a form of process algebra known as Communicating Sequential Processes (CSP) in defining security-related standards to overcome RFC-related ambiguities

    Secure VPNs for Trusted Computing Environments

    No full text

    Self-supervised learning of audio-visual objects from video

    No full text
    Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets. Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection

    Foley Music: Learning to Generate Music from Videos

    No full text
    In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph−Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results.ONR MURI (N00014-16-1-2007
    corecore