122 research outputs found

    Understanding video through the lens of language

    Get PDF
    The increasing abundance of video data online necessitates the development of systems capable of understanding such content. However, building these systems poses significant challenges, including the absence of scalable and robust supervision signals, computational complexity, and multimodal modelling. To address these issues, this thesis explores the role of language as a complementary learning signal for video, drawing inspiration from the success of self-supervised Large Language Models (LLMs) and image-language models. First, joint video-language representations are examined under the text-to-video retrieval task. This includes the study of pre-extracted multimodal features, the influence of contextual information, joint end-to-end learning of both image and video representations, and various frame aggregation methods for long-form videos. In doing so, state-of-the-art performance is achieved across a range of established video-text benchmarks. Second, this work explores the automatic generation of audio description (AD) – narrations describing the visual happenings in a video, for the benefit of visually impaired audiences. An LLM, prompted with multimodal information, including past predictions, and pretrained with partial data sources, is employed for the task. In the process, substantial advancements are achieved in the following areas: efficient speech transcription, long-form visual storytelling, referencing character names, and AD time-point prediction. Finally, audiovisual behaviour recognition is applied to the field of wildlife conservation and ethology. The approach is used to analyse vast video archives of wild primates, revealing insights into individual and group behaviour variations, with the potential for monitoring the effects of human pressures on animal habitats

    Statistical Analysis Of A Compound Power-Law Model For Repairable Systems

    Get PDF
    Conclusions - A compound (mixed) Poisson distribution is sometimes used as an alternative to the Poisson distribution for count data. Such a compound distribution, which has a negative binomial form, occurs when the population consists of Poisson distributed individuals, but with intensities which have a gamma distribution. A similar situation can occur with a repairable system when failure intensities of each system are different. A more general situation is considered where the system failures are distributed according to nonhomogeneous Poisson processes having Power Law intensity functions with gamma distributed intensity parameter. If the failures of each system in a population of repairable systems are distributed according to a Power Law process, but with different intensities, then a compound Power Law process provides a suitable model. A test, based on the ratio of the sample variance to the sample mean of count data from s-independent systems, provides a convenient way to determine if a compound model is appropriate. When a compound Power Law model is indicated, the maximum likelihood estimates of the shape parameters of the individual systems can be computed and homogeneity can be tested. If equality of the shape parameters is indicated, then it is possible to test whether the systems are homogeneous Poisson processes versus a nonhomogeneous alternative. If deterioration within systems is suspected, then the alternative in which the shape parameter exceeds unity would be appropriate, while if systems are undergoing reliability growth the alternative would be that the shape parameter is less than unity. The other parameters can also be estimated by maximum likelihood. If the test for a compound Power Law model does not reject, then the joint maximum likelihood estimates of the compound model may be unstable, or may even fail to exist, except possibly in a specified limiting sense. When this happens, an ordinary Power Law process provides a more reasonable model. Of course, this would also include the possibility of the simpler homogeneous Poisson process. Copyright © 1987 by The Institute of Electrical and Electronics Engineers, Inc

    Sequential Probability Ratio Tests For The Shape Parameter Of A Nonhomogeneous Poisson Process

    Get PDF
    Sequential probability ratio tests for the shape parameter of one or more nonhomogeneous Poisson processes, with power intensity functions, are provided. The tests can be performed when the scale parameter is an unknown nuisance parameter; the effective loss of not knowing the scale parameter is one observation per process. The resulting tests can be expressed in terms of the maximum likelihood estimators of the shape parameters for the usual fixed sample procedure. A further advantage of the present approach is that the scale parameters for different processes, in the multiple sample procedures, need not be equal. Approximations for the operating characteristic function and the average sample number are provided. Copyright © 1982 by The Institute of Electrical and Electronics Engineers, Inc

    On The Mean Time Between Failures For Repairable Systems

    Get PDF
    Much of the recent work on modeling repairable systems involves Poisson processes with nonconstant intensity functions, viz, nonhomogeneous Poisson processes. Since times between failures are not identically distributed when the process is nonhomogeneous, it is not clear what concept should take the place of the mean time between failures in assessing the reliability of a repairable system. A number of alternate concepts can be found in the literature. We investigate the relationship between two of the most frequently considered alternatives: the reciprocal of the intensity function, and the mean waiting time from t until the next failure. Theorem 1 states a necessary and sufficient condition for the mean time until the next failure to be asymptotically proportional to the reciprocal of the intensity function. Some examples, including the familiar log-linear and power-intensity processes satisfy this condition. A monotonicity property is also established between these two concepts which could be used to obtain conservative statistical confidence limits for the mean time until the next failure, based on results which are already available for the intensity function of the power-intensity process. However, further study of concepts such as the rate of convergence would be needed in order to determine the degree of approximation of the nominal confidence level to the actual level. Until more is known about the mean time from t until the next failure, it would be advisable to use the reciprocal of the intensity function, which has been studied more extensively, as the basis of reliability assessment for a repairable system. Copyright © 1986 by the Institute of Electrical and Electronics Engineers, Inc

    OxfordVGG Submission to the EGO4D AV Transcription Challenge

    Full text link
    This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team. We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available. Our final submission obtained 56.0% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the leaderboard. All baseline codes and models are available on https://github.com/m-bain/whisperX.Comment: Technical Repor

    WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

    Full text link
    Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.Comment: Accepted to INTERSPEECH 202

    WhisperX: time-accurate speech transcription of long-form audio

    Get PDF
    Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelvefold transcription speedup via batched inference. The code is available open-source
    • …
    corecore