18 research outputs found

    Visual on-line learning in distributed camera networks

    Get PDF
    Automatic detection of persons is an important application in visual surveillance. In general, state-of-the-art systems have two main disadvantages: First, usually a general detector has to be learned that is applicable to a wide range of scenes. Thus, the training is time-consuming and requires a huge amount of labeled data. Second, the data is usually processed centralized, which leads to a huge network traffic. Thus, the goal of this paper is to overcome these problems, which is realized by a person detection system, that is based on distributed smart cameras (DSCs). Assuming that we have a large number of cameras with partly overlapping views, the main idea is to reduce the model complexity of the detector by training a specific detector for each camera. These detectors are initialized by a pre-trained classifier, that is then adapted for a specific camera by co-training. In particular, for co-training we apply an on-line learning method (i.e., boosting for feature selection), where the information exchange is realized via mapping the overlapping views onto each other by using a homography. Thus, we have a compact scenedependent representation, which allows to train and to evaluate the classifiers on an embedded device. Moreover, since the information transfer is reduced to exchanging positions the required network-traffic is minimal. The power of the approach is demonstrated in various experiments on different publicly available data sets. In fact, we show that on-line learning and applying DSCs can benefit from each other. Index Terms — visual on-line learning, object detection, multi-camera networks 1

    TCGM: An Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning

    Full text link
    Fusing data from multiple modalities provides more information to train machine learning systems. However, it is prohibitively expensive and time-consuming to label each modality with a large amount of data, which leads to a crucial problem of semi-supervised multi-modal learning. Existing methods suffer from either ineffective fusion across modalities or lack of theoretical guarantees under proper assumptions. In this paper, we propose a novel information-theoretic approach, namely \textbf{T}otal \textbf{C}orrelation \textbf{G}ain \textbf{M}aximization (TCGM), for semi-supervised multi-modal learning, which is endowed with promising properties: (i) it can utilize effectively the information across different modalities of unlabeled data points to facilitate training classifiers of each modality (ii) it has theoretical guarantee to identify Bayesian classifiers, i.e., the ground truth posteriors of all modalities. Specifically, by maximizing TC-induced loss (namely TC gain) over classifiers of all modalities, these classifiers can cooperatively discover the equivalent class of ground-truth classifiers; and identify the unique ones by leveraging limited percentage of labeled data. We apply our method to various tasks and achieve state-of-the-art results, including news classification, emotion recognition and disease prediction.Comment: ECCV 2020 (oral

    Adaptive Real-Time Image Processing for Human-Computer Interaction

    Get PDF

    Person tracking on a mobile robot with heterogeneous inter-characteristic feedback

    Full text link

    Biased Competition in Visual Processing Hierarchies: A Learning Approach Using Multiple Cues

    Get PDF
    In this contribution, we present a large-scale hierarchical system for object detection fusing bottom-up (signal-driven) processing results with top-down (model or task-driven) attentional modulation. Specifically, we focus on the question of how the autonomous learning of invariant models can be embedded into a performing system and how such models can be used to define object-specific attentional modulation signals. Our system implements bi-directional data flow in a processing hierarchy. The bottom-up data flow proceeds from a preprocessing level to the hypothesis level where object hypotheses created by exhaustive object detection algorithms are represented in a roughly retinotopic way. A competitive selection mechanism is used to determine the most confident hypotheses, which are used on the system level to train multimodal models that link object identity to invariant hypothesis properties. The top-down data flow originates at the system level, where the trained multimodal models are used to obtain space- and feature-based attentional modulation signals, providing biases for the competitive selection process at the hypothesis level. This results in object-specific hypothesis facilitation/suppression in certain image regions which we show to be applicable to different object detection mechanisms. In order to demonstrate the benefits of this approach, we apply the system to the detection of cars in a variety of challenging traffic videos. Evaluating our approach on a publicly available dataset containing approximately 3,500 annotated video images from more than 1 h of driving, we can show strong increases in performance and generalization when compared to object detection in isolation. Furthermore, we compare our results to a late hypothesis rejection approach, showing that early coupling of top-down and bottom-up information is a favorable approach especially when processing resources are constrained

    Transferring a generic pedestrian detector towards specific scenes.

    Get PDF
    近年來,在公開的大規模人工標注數據集上訓練通用行人檢測器的方法有了顯著的進步。然而,當通用行人檢測器被應用到一個特定的,未公開過的場景中時,它的性能會不如預期。這是由待檢測的數據(源樣本)與訓練數據(目標樣本)的不匹配,以及新場景中視角、光照、分辨率和背景噪音的變化擾動造成的。在本論文中,我們提出一個新的自動將通用行人檢測器適應到特定場景中的框架。這個框架分為兩個階段。在第一階段,我們探索監控錄像場景中提供的特定表征。利用這些表征,從目標場景中選擇正負樣本並重新訓練行人檢測器,該過程不斷迭代直至收斂。在第二階段,我們提出一個新的機器學習框架,該框架綜合每個樣本的標簽和比重。根據這些比重,源樣本和目標樣本被重新權重,以優化最終的分類器。這兩種方法都屬於半監督學習,僅僅需要非常少的人工干預。使用提出的方法可以顯著提高通用行人檢測器的准確性。實驗顯示,由方法訓練出來的檢測器可以和使用大量手工標注的目標場景數據訓練出來的媲美。與其它解決類似問題的方法比較,該方法同樣好於許多已有方法。本論文的工作已經分別於朲朱朱年和朲朱朲年在杉杅杅杅計算機視覺和模式識別會議(权杖材杒)中發表。In recent years, significant progress has been made in learning generic pedestrian detectors from publicly available manually labeled large scale training datasets. However, when a generic pedestrian detector is applied to a specific, previously undisclosed scene where the testing data (target examples) does not match with the training data (source examples) because of variations of viewpoints, resolutions, illuminations and backgrounds, its accuracy may decrease greatly.In this thesis, a new framework is proposed automatically adapting a pre-trained generic pedestrian detector to a specific traffic scene. The framework is two-phased. In the first phase, scene-specific cues in the video surveillance sequence are explored. Utilizing the multi-cue information, both condent positive and negative examples from the target scene are selected to re-train the detector iteratively. In the second phase, a new machine learning framework is proposed, incorporating not only example labels but also example confidences. Source and target examples are re-weighted according to their confidence, optimizing the performance of the final classifier. Both methods belong to semi-supervised learning and require very little human intervention.The proposed approaches significantly improve the accuracy of the generic pedestrian detector. Their results are comparable with the detector trained using a large number of manually labeled frames from the target scene. Comparison with other existing approaches tackling similar problems shows that the proposed approaches outperform many contemporary methods.The works have been published on the IEEE Conference on Computer Vision and Pattern Recognition in 2011 and 2012, respectively.Detailed summary in vernacular field only.Detailed summary in vernacular field only.Detailed summary in vernacular field only.Detailed summary in vernacular field only.Wang, Meng.Thesis (M.Phil.)--Chinese University of Hong Kong, 2012.Includes bibliographical references (leaves 42-45).Abstracts also in Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- PedestrianDetection --- p.1Chapter 1.1.1 --- Overview --- p.1Chapter 1.1.2 --- StatisticalLearning --- p.1Chapter 1.1.3 --- ObjectRepresentation --- p.2Chapter 1.1.4 --- SupervisedStatisticalLearninginObjectDetection --- p.3Chapter 1.2 --- PedestrianDetectioninVideoSurveillance --- p.4Chapter 1.2.1 --- ProblemSetting --- p.4Chapter 1.2.2 --- Challenges --- p.4Chapter 1.2.3 --- MotivationsandContributions --- p.5Chapter 1.3 --- RelatedWork --- p.6Chapter 1.4 --- OrganizationsofChapters --- p.9Chapter 2 --- Label Inferring by Multi-Cues --- p.10Chapter 2.1 --- DataSet --- p.10Chapter 2.2 --- Method --- p.12Chapter 2.2.1 --- CondentPositiveExamplesofPedestrians --- p.13Chapter 2.2.2 --- CondentNegativeExamplesfromtheBackground --- p.17Chapter 2.2.3 --- CondentNegativeExamplesfromVehicles --- p.17Chapter 2.2.4 --- FinalSceneSpecicPedestrianDetector --- p.19Chapter 2.3 --- ExperimentResults --- p.20Chapter 3 --- Transferring a Detector by Condence Propagation --- p.24Chapter 3.1 --- Method --- p.25Chapter 3.1.1 --- Overview --- p.25Chapter 3.1.2 --- InitialEstimationofCondenceScores --- p.27Chapter 3.1.3 --- Re-weightingSourceSamples --- p.27Chapter 3.1.4 --- Condence-EncodedSVM --- p.30Chapter 3.2 --- Experiments --- p.33Chapter 3.2.1 --- Datasets --- p.33Chapter 3.2.2 --- ParameterSetting --- p.35Chapter 3.2.3 --- Results --- p.36Chapter 4 --- Conclusions and Future Work --- p.4

    Learning Hierarchical Representations For Video Analysis Using Deep Learning

    Get PDF
    With the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non-linear transformations of the data, with the goal of yielding more abstract and ultimately more useful representations. The advantages of the deep models are three fold: 1) They learn the features directly from the raw signal in contrast to the hand-designed features. 2) The learning can be unsupervised, which is suitable for large data where labeling all the data is expensive and unpractical. 3) They learn a hierarchy of features one level at a time and the layerwise stacking of feature extraction, this often yields better representations. However, not many deep learning models have been proposed to solve the problems in video analysis, especially videos “in a wild”. Most of them are either dealing with simple datasets, or limited to the low-level local spatial-temporal feature descriptors for action recognition. Moreover, as the learning algorithms are unsupervised, the learned features preserve generative properties rather than the discriminative ones which are more favorable in the classification tasks. In this context, the thesis makes two major contributions. First, we propose several formulations and extensions of deep learning methods which learn hierarchical representations for three challenging video analysis tasks, including complex event recognition, object detection in videos and measuring action similarity. The proposed methods are extensively demonstrated for each work on the state-of-the-art challenging datasets. Besides learning the low-level local features, higher level representations are further designed to be learned in the context of applications. The data-driven concept representations and sparse representation of the events are learned for complex event recognition; the representations for object body parts iii and structures are learned for object detection in videos; and the relational motion features and similarity metrics between video pairs are learned simultaneously for action verification. Second, in order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features a better discriminative ability. Second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experiments with quantitative and qualitative results on the tasks of human detection and action verification demonstrate the superiority of our proposed models

    Classifying tracked objects in far-field video surveillance

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 67-70).Automated visual perception of the real world by computers requires classification of observed physical objects into semantically meaningful categories (such as 'car' or 'person'). We propose a partially-supervised learning framework for classification of moving objects-mostly vehicles and pedestrians-that are detected and tracked in a variety of far-field video sequences, captured by a static, uncalibrated camera. We introduce the use of scene-specific context features (such as image-position of objects) to improve classification performance in any given scene. At the same time, we design a scene-invariant object classifier, along with an algorithm to adapt this classifier to a new scene. Scene-specific context information is extracted through passive observation of unlabelled data. Experimental results are demonstrated in the context of outdoor visual surveillance of a wide variety of scenes.by Biswajit Bose.S.M
    corecore