2,619 research outputs found

    Deep Architectures for Visual Recognition and Description

    Get PDF
    In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included. In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased. This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, ‘Bharatnatyam’. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy

    Ultrakinetic features of anime texts: revisioning composition theory and exploring visual rhetoric pedagogy.

    Get PDF
    Our contemporary culture is laden with a glut of visual stimuli: advertising, packaging, television, film, the internet, digital-camera-wireless-web-access-mobile-phones. In a world filled with visually rhetorical media, it is imperative that the field of Composition continue to embrace the theorizing and use of visual rhetoric an apt (if not vital) pursuit. It seems a responsible choice to select visual texts and/or visually rich writing assignments in order to address the current fabric of public/private communication. Furthermore, such texts can potentially set the student reader (viewer/writer/designer/composer) at ease in the sometimes daunting task of becoming a college writer and subsequently a professional able to communicate in the contemporary world. Visual texts that reflect the social climate students encounter each day may help break down some of the initial barriers to reading and composing. Students react differently to film, comics and other media than to a more traditional school text. This dissertation submits that the genre of Japanese animated film, known as anime, is particularly ripe for study and use in composition. Anime offers visually rhetoric texts that are accessible yet challenging for students in the area of analysis. Anime also encompasses a wide range of themes and styles. A specific form of anime, termed in this dissertation ultrakinetic , describes visually rhetorical texts that highlight the presentation of movement in ranges from stillness to slowed to hyper-fast. Ultrakinetic texts reflect a 21 st century sensibility and are effective models for students who must learn to read and compose in a rapidly changing multimodal environment

    Music as brand, with reference to the film music of John Towner Williams (with particular emphasis on Williams's 'Main Title' for Star Wars)

    Get PDF
    In contemporary consumer culture, branding is the term given to the creation of an image or text (visual, aural, textural or multi-sensory) intended to represent a commodity or product sold by a producer or service provider. This product’s commercial viability depends largely on the way it is presented (via branding) to its target market. The aim of this research report is to show that music used consciously as a branding medium, with special reference to film music (in its commodified form), has become a brand in itself, as opposed to merely a component of a multi-modal commercial product. Through analyses of a central film music theme from Star Wars: Episode IV, composed by John Williams, I aim to identify what I will term `audio-branding techniques’ within the music, thereby showing how music has come to be regarded as a brand. The audio branding techniques will relate directly to the four levels of analysis that I propose to conduct. The nature of branding implies the presence of three entities in the cultural and commercial `transaction’ that takes place: namely, the service provider (creator), the product (commodity) and the target market (consumer). I intend to argue that, as a result of powerful creative collaborations between John Williams and his various directors (not to mention his own unique talent), this composer’s film music has increasingly become an audio brand which is almost commensurate with the brand status of the film itself. Williams’s ability to create a symbiotic relationship between a music brand and that of a film has set him apart from most other contemporary art and commercial composers. As a result, it is not simply the actors, directors and producers associated with a movie that induce one to buy tickets to see it, but Williams’s independent audio branding style as well. I thus aim to prove that his film music is an audio brand independent of, and yet also allied with, other brands

    From corporeality to virtual reality: theorizing literacy, bodies, and technology in the emerging media of virtual, augmented, and mixed realities

    Get PDF
    This dissertation explores the relationships between literacy, technology, and bodies in the emerging media of Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). In response to the recent, rapid emergence of new media forms, questions arise as to how and why we should prepare to compose in new digital media. To interrogate the newness accorded to new media composing, I historicize the literacy practices demanded by new media by examining digital texts, such as video games and software applications, alongside analogous “antiquated” media, such as dioramas and museum exhibits. Comparative textual analysis of analogous digital and non-digital VR, AR, and MR texts reveals new media and “antiquated” media utilize common characteristics of dimensionality, layering, and absence/presence, respectively. The establishment of shared traits demonstrates how media operate on a continuum of mutually held textual practices; despite their distinctive forms, new media texts do not represent either a hierarchical or linear progression of maturing development. Such an understanding aids composing in new VR, AR, and MR media by enabling composers to make fuller use of prior knowledge in a rapidly evolving new media environment, a finding significant both for educators and communicators. As these technologies mature, we will continue to compose both traditional and new forms of texts. As such, we need literacy theory that attends to both the traditional and the new and also is comprehensive enough to encompass future acts of composing in media yet to emerge

    Remixing Pedagogy: How Teachers Experience Remix as a Tool for Teaching English Language Arts

    Get PDF
    Remix, a type of digital multimedia composition created by combining existing media to create new texts offers high school teachers a non-traditional approach to teaching English Language Arts (ELA). As technology in the U.S. has become more accessible and affordable, literacy practices outside school classrooms have changed. While there is a growing body of research about remix and remix culture, most of it is set outside the ELA classroom by focusing on activities after school hours or specialty courses in creative writing or technology classes. Teachers’ points of view are largely left out of studies that examine in-school experiences with remix. Additionally, existing studies are often set in either higher education or elementary schools. This case study sought to understand how two high school ELA teachers experienced using remix as a tool for teaching and how practicing remix informed their pedagogies. The study revealed insight into why teachers find it challenging to practice new pedagogies in their teaching. I grounded my theoretical framework in sociocultural theories and a remix of Peirce’s (1898) semiotic theory with Rosenblatt’s (1938/1995) transactionalism. Designed within a case study methodology, data sources included teacher remixes, recorded conversations in online meetings, emails, texts, telephone calls, and a detailed researcher journal. Data analysis included multiple iterations of open coding of transcripts, informed by grounded theory and tools of discourse analysis, as well as visual analyses of teacher-created remixes. Key findings showed that, while teachers desired to incorporate remix teaching tools for meeting student needs, constraints of professional learning obligations, state standards, and administrator expectations limited their use of non-traditional practices. Both teachers approached remix differently, encouraging their students to construct meaning through multimodal tools, while still finding paths to meeting administrative requirements through remix. Further, remix allowed teachers to increase the student-centeredness of their pedagogy and at the same time support multiple student learning styles. This study also extends prior theoretical scholarship about remix by contributing a study of knowledge-in-action, focusing on teachers as their remix experiences unfolded

    Análise de vídeo sensível

    Get PDF
    Orientadores: Anderson de Rezende Rocha, Siome Klein GoldensteinTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Vídeo sensível pode ser definido como qualquer filme capaz de oferecer ameaças à sua audiência. Representantes típicos incluem ¿ mas não estão limitados a ¿ pornografia, violência, abuso infantil, crueldade contra animais, etc. Hoje em dia, com o papel cada vez mais pervasivo dos dados digitais em nossa vidas, a análise de conteúdo sensível representa uma grande preocupação para representantes da lei, empresas, professores, e pais, devido aos potenciais danos que este tipo de conteúdo pode infligir a menores, estudantes, trabalhadores, etc. Não obstante, o emprego de mediadores humanos, para constantemente analisar grandes quantidades de dados sensíveis, muitas vezes leva a ocorrências de estresse e trauma, o que justifica a busca por análises assistidas por computador. Neste trabalho, nós abordamos este problema em duas frentes. Na primeira, almejamos decidir se um fluxo de vídeo apresenta ou não conteúdo sensível, à qual nos referimos como classificação de vídeo sensível. Na segunda, temos como objetivo encontrar os momentos exatos em que um fluxo começa e termina a exibição de conteúdo sensível, em nível de quadros de vídeo, à qual nos referimos como localização de conteúdo sensível. Para ambos os casos, projetamos e desenvolvemos métodos eficazes e eficientes, com baixo consumo de memória, e adequação à implantação em dispositivos móveis. Neste contexto, nós fornecemos quatro principais contribuições. A primeira é uma nova solução baseada em sacolas de palavras visuais, para a classificação eficiente de vídeos sensíveis, apoiada na análise de fenômenos temporais. A segunda é uma nova solução de fusão multimodal em alto nível semântico, para a localização de conteúdo sensível. A terceira, por sua vez, é um novo detector espaço-temporal de pontos de interesse, e descritor de conteúdo de vídeo. Finalmente, a quarta contribuição diz respeito a uma base de vídeos anotados em nível de quadro, que possui 140 horas de conteúdo pornográfico, e que é a primeira da literatura a ser adequada para a localização de pornografia. Um aspecto relevante das três primeiras contribuições é a sua natureza de generalização, no sentido de poderem ser empregadas ¿ sem modificações no passo a passo ¿ para a detecção de tipos diversos de conteúdos sensíveis, tais como os mencionados anteriormente. Para validação, nós escolhemos pornografia e violência ¿ dois dos tipos mais comuns de material impróprio ¿ como representantes de interesse, de conteúdo sensível. Nestes termos, realizamos experimentos de classificação e de localização, e reportamos resultados para ambos os tipos de conteúdo. As soluções propostas apresentam uma acurácia de 93% em classificação de pornografia, e permitem a correta localização de 91% de conteúdo pornográfico em fluxo de vídeo. Os resultados para violência também são interessantes: com as abordagens apresentadas, nós obtivemos o segundo lugar em uma competição internacional de detecção de cenas violentas. Colocando ambas em perspectiva, nós aprendemos que a detecção de pornografia é mais fácil que a de violência, abrindo várias oportunidades de pesquisa para a comunidade científica. A principal razão para tal diferença está relacionada aos níveis distintos de subjetividade que são inerentes a cada conceito. Enquanto pornografia é em geral mais explícita, violência apresenta um espectro mais amplo de possíveis manifestaçõesAbstract: Sensitive video can be defined as any motion picture that may pose threats to its audience. Typical representatives include ¿ but are not limited to ¿ pornography, violence, child abuse, cruelty to animals, etc. Nowadays, with the ever more pervasive role of digital data in our lives, sensitive-content analysis represents a major concern to law enforcers, companies, tutors, and parents, due to the potential harm of such contents over minors, students, workers, etc. Notwithstanding, the employment of human mediators for constantly analyzing huge troves of sensitive data often leads to stress and trauma, justifying the search for computer-aided analysis. In this work, we tackle this problem in two ways. In the first one, we aim at deciding whether or not a video stream presents sensitive content, which we refer to as sensitive-video classification. In the second one, we aim at finding the exact moments a stream starts and ends displaying sensitive content, at frame level, which we refer to as sensitive-content localization. For both cases, we aim at designing and developing effective and efficient methods, with low memory footprint and suitable for deployment on mobile devices. In this vein, we provide four major contributions. The first one is a novel Bag-of-Visual-Words-based pipeline for efficient time-aware sensitive-video classification. The second is a novel high-level multimodal fusion pipeline for sensitive-content localization. The third, in turn, is a novel space-temporal video interest point detector and video content descriptor. Finally, the fourth contribution comprises a frame-level annotated 140-hour pornographic video dataset, which is the first one in the literature that is appropriate for pornography localization. An important aspect of the first three contributions is their generalization nature, in the sense that they can be employed ¿ without step modifications ¿ to the detection of diverse sensitive content types, such as the previously mentioned ones. For validation, we choose pornography and violence ¿ two of the commonest types of inappropriate material ¿ as target representatives of sensitive content. We therefore perform classification and localization experiments, and report results for both types of content. The proposed solutions present an accuracy of 93% in pornography classification, and allow the correct localization of 91% of pornographic content within a video stream. The results for violence are also compelling: with the proposed approaches, we reached second place in an international competition of violent scenes detection. Putting both in perspective, we learned that pornography detection is easier than its violence counterpart, opening several opportunities for additional investigations by the research community. The main reason for such difference is related to the distinct levels of subjectivity that are inherent to each concept. While pornography is usually more explicit, violence presents a broader spectrum of possible manifestationsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação1572763, 1197473CAPE

    From whole-brain data to functional circuit models: the zebrafish optomotor response

    Get PDF
    Detailed descriptions of brain-scale sensorimotor circuits underlying vertebrate behavior remain elusive. Recent advances in zebrafish neuroscience offer new opportunities to dissect such circuits via whole-brain imaging, behavioral analysis, functional perturbations, and network modeling. Here, we harness these tools to generate a brain-scale circuit model of the optomotor response, an orienting behavior evoked by visual motion. We show that such motion is processed by diverse neural response types distributed across multiple brain regions. To transform sensory input into action, these regions sequentially integrate eye- and direction-specific sensory streams, refine representations via interhemispheric inhibition, and demix locomotor instructions to independently drive turning and forward swimming. While experiments revealed many neural response types throughout the brain, modeling identified the dimensions of functional connectivity most critical for the behavior. We thus reveal how distributed neurons collaborate to generate behavior and illustrate a paradigm for distilling functional circuit models from whole-brain data
    corecore