Search CORE

185 research outputs found

A new pose-based representation for recognizing actions from multiple cameras

Author: Duygulu P.
Pehlivan S.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

Cataloged from PDF version of article.We address the problem of recognizing actions from arbitrary views for a multi-camera system. We argue that poses are important for understanding human actions and the strength of the pose representation affects the overall performance of the action recognition system. Based on this idea, we present a new view-independent representation for human poses. Assuming that the data is initially provided in the form of volumetric data, the volume of the human body is first divided into a sequence of horizontal layers, and then the intersections of the body segments with each layer are coded with enclosing circles. The circular features in all layers (i) the number of circles, (ii) the area of the outer circle, and (iii) the area of the inner circle are then used to generate a pose descriptor. The pose descriptors of all frames in an action sequence are further combined to generate corresponding motion descriptors. Action recognition is then performed with a simple nearest neighbor classifier. Experiments performed on the benchmark IXMAS multi-view dataset demonstrate that the performance of our method is comparable to the other methods in the literature. (C) 2010 Elsevier Inc. All rights reserved

Bilkent University Institutional Repository

Interesting faces: A graph-based approach for finding people in news

Author: Duygulu P.
Ozkan D.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2010
Field of study

Cataloged from PDF version of article.In this study, we propose a method for finding people in large news photograph and video collections. Our method exploits the multi-modal nature of these data sets to recognize people and does not require any supervisory input. It first uses the name of the person to populate an initial set of candidate faces. From this set, which is likely to include the faces of other people, it selects the group of most similar faces corresponding to the queried person in a variety of conditions. Our main contribution is to transform the problem of recognizing the faces of the queried person in a set of candidate faces to the problem of finding the highly connected sub-graph (the densest component) in a graph representing the similarities of faces. We also propose a novel technique for finding the similarities of faces by matching interest points extracted from the faces. The proposed method further allows the classification of new faces without needing to re-build the graph. The experiments are performed on two data sets: thousands of news photographs from Yahoo! news and over 200 news videos from TRECVid2004. The results show that the proposed method provides significant improvements over textbased methods. (C) 2009 Elsevier Ltd. All rights reserve

Bilkent University Institutional Repository

Histogram of oriented rectangles: A new pose descriptor for human action recognition

Author: Duygulu P.
Ikizler N.
Publication venue: 'Elsevier BV'
Publication date: 02/09/2009
Field of study

Cataloged from PDF version of article.Most of the approaches to human action recognition tend to form complex models which require lots of parameter estimation and computation time. In this study, we show that, human actions can be simply represented by pose without dealing with the complex representation of dynamics. Based on this idea, we propose a novel pose descriptor which we name as Histogram-of-Oriented-Rectangles (HOR) for representing and recognizing human actions in videos. We represent each human pose in an action sequence by oriented rectangular patches extracted over the human silhouette. We then form spatial oriented histograms to represent the distribution of these rectangular patches. We make use of several matching strategies to carry the information from the spatial domain described by the HOR descriptor to temporal domain. These are (i) nearest neighbor classification, which recognizes the actions by matching the descriptors of each frame, (ii) global histogramming, which extends the idea of Motion Energy Image proposed by Bobick and Davis to rectangular patches, (iii) a classifier-based approach using Support Vector Machines, and (iv) adaptation of Dynamic Time Warping on the temporal representation of the HOR descriptor. For the cases when pose descriptor is not sufficiently strong alone, such as to differentiate actions "jogging" and "running", we also incorporate a simple velocity descriptor as a prior to the pose based classification step. We test our system with different configurations and experiment on two commonly used action datasets: the Weizmann dataset and the KTH dataset. Results show that our method is superior to other methods on Weizmann dataset with a perfect accuracy rate of 100%, and is comparable to the other methods on KTH dataset with a very high success rate close to 90%. These results prove that with a simple and compact representation, we can achieve robust recognition of human actions, compared to complex representations. (C) 2009 Elsevier B.V. All rights reserved

Bilkent University Institutional Repository

A line-based representation for matching words in historical manuscripts

Author: Can E. F.
Duygulu P.
Publication venue: 'Elsevier BV'
Publication date: 01/06/2011
Field of study

Cataloged from PDF version of article.In this study, we propose a new method for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matching based on matching the lines. We carry out experiments on a benchmark dataset consisting of manuscripts by George Washington, as well as on Ottoman manuscripts. (C) 2011 Elsevier B.V. All rights reserved

Bilkent University Institutional Repository

Cross-document word matching for segmentation and retrieval of Ottoman divans

Author: Arifoglu D.
Duygulu P.
Kalpakli M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Cataloged from PDF version of article.Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are dif- ficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents

Bilkent University Institutional Repository

Objects that Sound

Author: A Owens
I Misra
K Barnard
M Noroozi
P Duygulu
R Zhang
ST Shivappa
TG Dietterich
Publication venue
Publication date: 25/07/2018
Field of study

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.Comment: Appears in: European Conference on Computer Vision (ECCV) 201

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Automatic categorization of Ottoman poems

Author: Can E. F.
Can F.
Duygulu P.
Kalpakli M.
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 08/04/2014
Field of study

Cataloged from PDF version of article.This work is partially supported by the Scientific and Technical Research Council of Turkey (TÜBİTAK) under the grant number 109E006.Authorship attribution and identifying time period of literary works are fundamental problems in quantitative analysis of languages. We investigate two fundamentally different machine learning text categorization methods, Support Vector Machines (SVM) and Naïve Bayes (NB), and several style markers in the categorization of Ottoman poems according to their poets and time periods. We use the collected works (divans) of ten different Ottoman poets: two poets from each of the five different hundred-year periods ranging from the 15th to 19 th century. Our experimental evaluation and statistical assessments show that it is possible to obtain highly accurate and reliable classifications and to distinguish the methods and style markers in terms of their effectiveness

Bilkent University Institutional Repository

Systematic evaluation of machine translation methods for image and video annotation

Author: Duygulu P.
Virga P.
Publication venue
Publication date: 01/01/2005
Field of study

In this study, we present a systematic evaluation of machine translation methods applied to the image annotation problem. We used the well-studied Corel data set and the broadcast news videos used by TRECVID 2003 as our dataset. We experimented with different models of machine translation with different parameters. The results showed that the simplest model produces the best performance. Based on this experience, we also proposed a new method, based on cross-lingual information retrieval techniques, and obtained a better retrieval performance. © Springer-Verlag Berlin Heidelberg 2005

Bilkent University Institutional Repository

A line-based representation for matching words in historical manuscripts

Author: Can E.F.
Duygulu P.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

In this study, we propose a new method for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matching based on matching the lines. We carry out experiments on a benchmark dataset consisting of manuscripts by George Washington, as well as on Ottoman manuscripts. © 2011 Elsevier B.V. All rights reserved

Bilkent University Institutional Repository

Recognizing objects and scenes in news videos

Author: Baştan M.
Duygulu P.
Publication venue: Springer Verlag
Publication date: 01/01/2006
Field of study

We propose a new approach to recognize objects and scenes in news videos motivated by the availability of large video collections. This approach considers the recognition problem as the translation of visual elements to words. The correspondences between visual elements and words are learned using the methods adapted from statistical machine translation and used to predict words for particular image regions (region naming), for entire images (auto-annotation), or to associate the automatically generated speech transcript text with the correct video frames (video alignment). Experimental results are presented on TRECVID 2004 data set, which consists of about 150 hours of news videos associated with manual annotations and speech transcript text. The results show that the retrieval performance can be improved by associating visual and textual elements. Also, extensive analysis of features are provided and a method to combine features are proposed. © Springer-Verlag Berlin Heidelberg 2006

Bilkent University Institutional Repository