651 research outputs found
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
Learning Transferable Representations for Visual Recognition
In the last half-decade, a new renaissance of machine learning originates from the applications of convolutional neural networks to visual recognition tasks. It is believed that a combination of big curated data and novel deep learning techniques can lead to unprecedented results. However, the increasingly large training data is still a drop in the ocean compared with scenarios in the wild. In this literature, we focus on learning transferable representation in the neural networks to ensure the models stay robust, even given different data distributions. We present three exemplar topics in three chapters, respectively: zero-shot learning, domain adaptation, and generalizable adversarial attack. By zero-shot learning, we enable models to predict labels not seen in the training phase. By domain adaptation, we improve a model\u27s performance on the target domain by mitigating its discrepancy from a labeled source model, without any target annotation. Finally, the generalization adversarial attack focuses on learning an adversarial camouflage that ideally would work in every possible scenario. Despite sharing the same transfer learning philosophy, each of the proposed topics poses a unique challenge requiring a unique solution. In each chapter, we introduce the problem as well as present our solution to the problem. We also discuss some other researchers\u27 approaches and compare our solution to theirs in the experiments
Labeling and modeling large databases of videos
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 91-98).As humans, we can say many things about the scenes surrounding us. For instance, we can tell what type of scene and location an image depicts, describe what objects live in it, their material properties, or their spatial arrangement. These comprise descriptions of a scene and are majorly studied areas in computer vision. This thesis, however, hypotheses that observers have an inherent prior knowledge that can be applied to the scene at hand. This prior knowledge can be translated into the cognisance of which objects move, or in the trajectories and velocities to expect. Conversely, when faced with unusual events such as car accidents, humans are very well tuned to identify them regardless of having observed the scene a priori. This is, in part, due to prior observations that we have for scenes with similar configurations to the current one. This thesis emulates the prior knowledge base of humans by creating a large and heterogeneous database and annotation tool for videos depicting real world scenes. The first application of this thesis is in the area of unusual event detection. Given a short clip, the task is to identify the moving portions of the scene that depict abnormal events. We adopt a data-driven framework powered by scene matching techniques to retrieve the videos nearest to the query clip and integrate the motion information in the nearest videos. The result is a final clip with localized annotations for unusual activity. The second application lies in the area of event prediction. Given a static image, we adapt our framework to compile a prediction of motions to expect in the image. This result is crafted by integrating the knowledge of videos depicting scenes similar to the query image. With the help of scene matching, only scenes relevant to the queries are considered, resulting in reliable predictions. Our dataset, experimentation, and proposed model introduce and explore a new facet of scene understanding in images and videos.by Jenny Yuen.Ph.D
Unlocking the capabilities of explainable fewshot learning in remote sensing
Recent advancements have significantly improved the efficiency and
effectiveness of deep learning methods for imagebased remote sensing tasks.
However, the requirement for large amounts of labeled data can limit the
applicability of deep neural networks to existing remote sensing datasets. To
overcome this challenge, fewshot learning has emerged as a valuable approach
for enabling learning with limited data. While previous research has evaluated
the effectiveness of fewshot learning methods on satellite based datasets,
little attention has been paid to exploring the applications of these methods
to datasets obtained from UAVs, which are increasingly used in remote sensing
studies. In this review, we provide an up to date overview of both existing and
newly proposed fewshot classification techniques, along with appropriate
datasets that are used for both satellite based and UAV based data. Our
systematic approach demonstrates that fewshot learning can effectively adapt to
the broader and more diverse perspectives that UAVbased platforms can provide.
We also evaluate some SOTA fewshot approaches on a UAV disaster scene
classification dataset, yielding promising results. We emphasize the importance
of integrating XAI techniques like attention maps and prototype analysis to
increase the transparency, accountability, and trustworthiness of fewshot
models for remote sensing. Key challenges and future research directions are
identified, including tailored fewshot methods for UAVs, extending to unseen
tasks like segmentation, and developing optimized XAI techniques suited for
fewshot remote sensing problems. This review aims to provide researchers and
practitioners with an improved understanding of fewshot learnings capabilities
and limitations in remote sensing, while highlighting open problems to guide
future progress in efficient, reliable, and interpretable fewshot methods.Comment: Under review, once the paper is accepted, the copyright will be
transferred to the corresponding journa
A computer vision system for detecting and analysing critical events in cities
Whether for commuting or leisure, cycling is a growing transport mode in many cities worldwide. However, it is still perceived as a dangerous activity. Although serious incidents related to cycling leading to major injuries are rare, the fear of getting hit or falling hinders the expansion of cycling as a major transport mode. Indeed, it has been shown that focusing on serious injuries only touches the tip of the iceberg. Near miss data can provide much more information about potential problems and how to avoid risky situations that may lead to serious incidents. Unfortunately, there is a gap in the knowledge in identifying and analysing near misses. This hinders drawing statistically significant conclusions to provide measures for the built-environment that ensure a safer environment for people on bikes. In this research, we develop a method to detect and analyse near misses and their risk factors using artificial intelligence. This is accomplished by analysing video streams linked to near miss incidents within a novel framework relying on deep learning and computer vision. This framework automatically detects near misses and extracts their risk factors from video streams before analysing their statistical significance. It also provides practical solutions implemented in a camera with embedded AI (URBAN-i Box) and a cloud-based service (URBAN-i Cloud) to tackle the stated issue in the real-world settings for use by researchers, policy-makers, or citizens. The research aims to provide human-centred evidence that may enable policy-makers and planners to provide a safer built environment for cycling in London, or elsewhere. More broadly, this research aims to contribute to the scientific literature with the theoretical and empirical foundations of a computer vision system that can be utilised for detecting and analysing other critical events in a complex environment. Such a system can be applied to a wide range of events, such as traffic incidents, crime or overcrowding
Adversarial Attacks and Defenses in Machine Learning-Powered Networks: A Contemporary Survey
Adversarial attacks and defenses in machine learning and deep neural network
have been gaining significant attention due to the rapidly growing applications
of deep learning in the Internet and relevant scenarios. This survey provides a
comprehensive overview of the recent advancements in the field of adversarial
attack and defense techniques, with a focus on deep neural network-based
classification models. Specifically, we conduct a comprehensive classification
of recent adversarial attack methods and state-of-the-art adversarial defense
techniques based on attack principles, and present them in visually appealing
tables and tree diagrams. This is based on a rigorous evaluation of the
existing works, including an analysis of their strengths and limitations. We
also categorize the methods into counter-attack detection and robustness
enhancement, with a specific focus on regularization-based methods for
enhancing robustness. New avenues of attack are also explored, including
search-based, decision-based, drop-based, and physical-world attacks, and a
hierarchical classification of the latest defense methods is provided,
highlighting the challenges of balancing training costs with performance,
maintaining clean accuracy, overcoming the effect of gradient masking, and
ensuring method transferability. At last, the lessons learned and open
challenges are summarized with future research opportunities recommended.Comment: 46 pages, 21 figure
Soft Biometric Analysis: MultiPerson and RealTime Pedestrian Attribute Recognition in Crowded Urban Environments
Traditionally, recognition systems were only based on human hard biometrics. However,
the ubiquitous CCTV cameras have raised the desire to analyze human biometrics from
far distances, without people attendance in the acquisition process. Highresolution
face closeshots
are rarely available at far distances such that facebased
systems cannot
provide reliable results in surveillance applications. Human soft biometrics such as body
and clothing attributes are believed to be more effective in analyzing human data collected
by security cameras.
This thesis contributes to the human soft biometric analysis in uncontrolled environments
and mainly focuses on two tasks: Pedestrian Attribute Recognition (PAR) and person reidentification
(reid).
We first review the literature of both tasks and highlight the history
of advancements, recent developments, and the existing benchmarks. PAR and person reid
difficulties are due to significant distances between intraclass
samples, which originate
from variations in several factors such as body pose, illumination, background, occlusion,
and data resolution. Recent stateoftheart
approaches present endtoend
models that
can extract discriminative and comprehensive feature representations from people. The
correlation between different regions of the body and dealing with limited learning data
is also the objective of many recent works. Moreover, class imbalance and correlation
between human attributes are specific challenges associated with the PAR problem.
We collect a large surveillance dataset to train a novel gender recognition model suitable
for uncontrolled environments. We propose a deep residual network that extracts several
posewise
patches from samples and obtains a comprehensive feature representation. In
the next step, we develop a model for multiple attribute recognition at once. Considering
the correlation between human semantic attributes and class imbalance, we respectively
use a multitask
model and a weighted loss function. We also propose a multiplication
layer on top of the backbone features extraction layers to exclude the background features
from the final representation of samples and draw the attention of the model to the
foreground area.
We address the problem of person reid
by implicitly defining the receptive fields of
deep learning classification frameworks. The receptive fields of deep learning models
determine the most significant regions of the input data for providing correct decisions.
Therefore, we synthesize a set of learning data in which the destructive regions (e.g.,
background) in each pair of instances are interchanged. A segmentation module
determines destructive and useful regions in each sample, and the label of synthesized
instances are inherited from the sample that shared the useful regions in the synthesized
image. The synthesized learning data are then used in the learning phase and help
the model rapidly learn that the identity and background regions are not correlated.
Meanwhile, the proposed solution could be seen as a data augmentation approach that
fully preserves the label information and is compatible with other data augmentation
techniques.
When reid
methods are learned in scenarios where the target person appears with identical garments in the gallery, the visual appearance of clothes is given the most
importance in the final feature representation. Clothbased
representations are not
reliable in the longterm
reid
settings as people may change their clothes. Therefore,
developing solutions that ignore clothing cues and focus on identityrelevant
features are
in demand. We transform the original data such that the identityrelevant
information of
people (e.g., face and body shape) are removed, while the identityunrelated
cues (i.e.,
color and texture of clothes) remain unchanged. A learned model on the synthesized
dataset predicts the identityunrelated
cues (shortterm
features). Therefore, we train a
second model coupled with the first model and learns the embeddings of the original data
such that the similarity between the embeddings of the original and synthesized data is
minimized. This way, the second model predicts based on the identityrelated
(longterm)
representation of people.
To evaluate the performance of the proposed models, we use PAR and person reid
datasets, namely BIODI, PETA, RAP, Market1501,
MSMTV2,
PRCC, LTCC, and MIT
and compared our experimental results with stateoftheart
methods in the field.
In conclusion, the data collected from surveillance cameras have low resolution, such
that the extraction of hard biometric features is not possible, and facebased
approaches
produce poor results. In contrast, soft biometrics are robust to variations in data quality.
So, we propose approaches both for PAR and person reid
to learn discriminative features
from each instance and evaluate our proposed solutions on several publicly available
benchmarks.This thesis was prepared at the University of Beria Interior, IT Instituto de Telecomunicações, Soft Computing and Image Analysis Laboratory (SOCIA Lab), Covilhã Delegation, and was submitted to the University of Beira Interior for defense in a public examination session
The New Reflexivity: Puzzle Films, Found Footage, and Cinematic Narration in the Digital Age
“The New Reflexivity” tracks two narrative styles of contemporary Hollywood production that have yet to be studied in tandem: the puzzle film and the found footage horror film. In early August 1999, near the end of what D.N. Rodowick refers to as “the summer of digital paranoia,” two films entered the wide-release U.S. theatrical marketplace and enjoyed surprisingly massive financial success, just as news of the “death of film” circulated widely. Though each might typically be classified as belonging to the horror genre, both the unreliable “puzzle film” The Sixth Sense and the fake-documentary “found footage film” The Blair Witch Project stood as harbingers of new narrative currents in global cinema. This dissertation looks closely at these two films, reading them as illustrative of two decidedly millennial narrative styles, styles that stepped out strikingly from the computer-generated shadows cast by big-budget Hollywood. The industrial shift to digital media that coincides with the rise of these films in the late 90s reframed the cinematic image as inherently manipulable, no longer a necessary index of physical reality. Directors become image-writers, constructing photorealistic imagery from scratch. Meanwhile, DVDs and online paratexts encourage cinephiles to digitize, to attain and interact with cinema in novel ways. “The New Reflexivity” reads The Sixth Sense and The Blair Witch Project as reflexive allegories of cinema’s and society’s encounters with new digital media. The most basic narrative tricks and conceits of puzzle films and found footage films produce an unusually intense and ludic engagement with narrative boundaries and limits, thus undermining the naturalized practices of classical Hollywood narration. Writers and directors of these films treat recorded events and narrative worlds as reviewable, remixable, and upgradeable, just as Hollywood digitizes and tries to keep up with new media. Though a great deal of critical attention has been paid to both puzzle and found footage films separately, no lengthy critical survey has yet been undertaken that considers these movies in terms of their shared formal and thematic concerns. Rewriting the rules of popular cinematic narration, these films encourage viewers to be suspicious of what they see onscreen, to be aware of the possibility of unreliable narration, or CGI and the “Photoshopped.” Urgent to film and cultural studies, “The New Reflexivity” suggests that these genres’ complicitous critique of new media is decidedly instructive for a networked society struggling with what it means to be digital
- …