1,303 research outputs found

    Taking the bite out of automated naming of characters in TV video

    No full text
    We investigate the problem of automatically labelling appearances of characters in TV or film material with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ‘‘Buffy the Vampire Slayer”

    An aesthetic for sustainable interactions in product-service systems?

    Get PDF
    Copyright @ 2012 Greenleaf PublishingEco-efficient Product-Service System (PSS) innovations represent a promising approach to sustainability. However the application of this concept is still very limited because its implementation and diffusion is hindered by several barriers (cultural, corporate and regulative ones). The paper investigates the barriers that affect the attractiveness and acceptation of eco-efficient PSS alternatives, and opens the debate on the aesthetic of eco-efficient PSS, and the way in which aesthetic could enhance some specific inner qualities of this kinds of innovations. Integrating insights from semiotics, the paper outlines some first research hypothesis on how the aesthetic elements of an eco-efficient PSS could facilitate user attraction, acceptation and satisfaction

    Deep Architectures for Visual Recognition and Description

    Get PDF
    In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included. In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased. This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, ‘Bharatnatyam’. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy

    Evaluating the Fairness of Discriminative Foundation Models in Computer Vision

    Full text link
    We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP), that are used for labeling tasks. We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy. Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning. We categorize desired behaviors based around three axes: (i) if the task concerns humans; (ii) how subjective the task is (i.e., how likely it is that people from a diverse range of backgrounds would agree on a labeling); and (iii) the intended purpose of the task and if fairness is better served by impartiality (i.e., making decisions independent of the protected attributes) or representation (i.e., making decisions to maximize diversity). Finally, we provide quantitative fairness evaluations for both binary-valued and multi-valued protected attributes over ten diverse datasets. We find that fair PCA, a post-processing method for fair representations, works very well for debiasing in most of the aforementioned tasks while incurring only minor loss of performance. However, different debiasing approaches vary in their effectiveness depending on the task. Hence, one should choose the debiasing approach depending on the specific use case.Comment: Accepted at AIES'2

    KEER2022

    Get PDF
    AvanttĂ­tol: KEER2022. DiversitiesDescripciĂł del recurs: 25 juliol 202

    Video Categorization Using Semantics and Semiotics

    Get PDF
    There is a great need to automatically segment, categorize, and annotate video data, and to develop efficient tools for browsing and searching. We believe that the categorization of videos can be achieved by exploring the concepts and meanings of the videos. This task requires bridging the gap between low-level content and high-level concepts (or semantics). Once a relationship is established between the low-level computable features of the video and its semantics, the user would be able to navigate through videos through the use of concepts and ideas (for example, a user could extract only those scenes in an action film that actually contain fights) rat her than sequentially browsing the whole video. However, this relationship must follow the norms of human perception and abide by the rules that are most often followed by the creators (directors) of these videos. These rules are called film grammar in video production literature. Like any natural language, this grammar has several dialects, but it has been acknowledged to be universal. Therefore, the knowledge of film grammar can be exploited effectively for the understanding of films. To interpret an idea using the grammar, we need to first understand the symbols, as in natural languages, and second, understand the rules of combination of these symbols to represent concepts. In order to develop algorithms that exploit this film grammar, it is necessary to relate the symbols of the grammar to computable video features. In this dissertation, we have identified a set of computable features of videos and have developed methods to estimate them. A computable feature of audio-visual data is defined as any statistic of available data that can be automatically extracted using image/signal processing and computer vision techniques. These features are global in nature and are extracted using whole images, therefore, they do not require any object detection, tracking and classification. These features include video shots, shot length, shot motion content, color distribution, key-lighting, and audio energy. We use these features and exploit the knowledge of ubiquitous film grammar to solve three related problems: segmentation and categorization of talk and game shows; classification of movie genres based on the previews; and segmentation and representation of full-length Hollywood movies and sitcoms. We have developed a method for organizing videos of talk and game shows by automatically separating the program segments from the commercials and then classifying each shot as the host\u27s or guest\u27s shot. In our approach, we rely primarily on information contained in shot transitions and utilize the inherent difference in the scene structure (grammar) of commercials and talk shows. A data structure called a shot connectivity graph is constructed, which links shots over time using temporal proximity and color similarity constraints. Analysis of the shot connectivity graph helps us to separate commercials from program segments. This is done by first detecting stories, and then assigning a weight to each story based on its likelihood of being a commercial or a program segment. We further analyze stories to distinguish shots of the hosts from those of the guests. We have performed extensive experiments on eight full-length talk shows (e.g. Larry King Live, Meet the Press, News Night) and game shows (Who Wants To Be A Millionaire), and have obtained excellent classification with 96% recall and 99% precision. http://www.cs.ucf.edu/~vision/projects/LarryKing/LarryKing.html Secondly, we have developed a novel method for genre classification of films using film previews. In our approach, we classify previews into four broad categories: comedies, action, dramas or horror films. Computable video features are combined in a framework with cinematic principles to provide a mapping to these four high-level semantic classes. We have developed two methods for genre classification; (a) a hierarchical method and (b) an unsupervised classification met hod. In the hierarchical method, we first classify movies into action and non-action categories based on the average shot length and motion content in the previews. Next, non-action movies are sub-classified into comedy, horror or drama categories by examining their lighting key. Finally, action movies are ranked on the basis of number of explosions/gunfire events. In the unsupervised method for classifying movies, a mean shift classifier is used to discover the structure of the mapping between the computable features and each film genre. We have conducted extensive experiments on over a hundred film previews and demonstrated that low-level features can be efficiently utilized for movie classification. We achieved about 87% successful classification. http://www.cs.ucf.edu/-vision/projects/movieClassification/movieClmsification.html Finally, we have addressed the problem of detecting scene boundaries in full-length feature movies. We have developed two novel approaches to automatically find scenes in the videos. Our first approach is a two-pass algorithm. In the first pass, shots are clustered by computing backward shot coherence; a shot color similarity measure that detects potential scene boundaries (PSBs) in the videos. In the second pass we compute scene dynamics for each scene as a function of shot length and the motion content in the potential scenes. In this pass, a scene-merging criterion is used to remove weak PSBs in order to reduce over-segmentation. In our second approach, we cluster shots into scenes by transforming this task into a graph-partitioning problem. This is achieved by constructing a weighted undirected graph called a shot similarity graph (SSG), where each node represents a shot and the edges between the shots are weighted by their similarities (color and motion). The SSG is then split into sub-graphs by applying the normalized cut technique for graph partitioning. The partitions obtained represent individual scenes in the video. We further extend the framework to automatically detect the best representative key frames of identified scenes. With this approach, we are able to obtain a compact representation of huge videos in a small number of key frames. We have performed experiments on five Hollywood films (Terminator II, Top Gun, Gone In 60 Seconds, Golden Eye, and A Beautiful Mind) and one TV sitcom (Seinfeld) that demonstrate the effectiveness of our approach. We achieved about 80% recall and 63% precision in our experiments. http://www.cs.ucf.edu/~vision/projects/sceneSeg/sceneSeg.htm

    Part-based clothing segmentation for person retrieval

    Full text link
    • 

    corecore