1,838 research outputs found
Crop conditional Convolutional Neural Networks for massive multi-crop plant disease classification over cell phone acquired images taken on real field conditions
Convolutional Neural Networks (CNN) have demonstrated their capabilities on the agronomical field, especially for plant visual symptoms assessment. As these models grow both in the number of training images and in the number of supported crops and diseases, there exist the dichotomy of (1) generating smaller models for specific crop or, (2) to generate a unique multi-crop model in a much more complex task (especially at early disease stages) but with the benefit of the entire multiple crop image dataset variability to enrich image feature description learning.
In this work we first introduce a challenging dataset of more than one hundred-thousand images taken by cell phone in real field wild conditions. This dataset contains almost equally distributed disease stages of seventeen diseases and five crops (wheat, barley, corn, rice and rape-seed) where several diseases can be present on the same picture.
When applying existing state of the art deep neural network methods to validate the two hypothesised approaches, we obtained a balanced accuracy (BAC=0.92) when generating the smaller crop specific models and a balanced accuracy (BAC=0.93) when generating a single multi-crop model.
In this work, we propose three different CNN architectures that incorporate contextual non-image meta-data such as crop information onto an image based Convolutional Neural Network. This combines the advantages of simultaneously learning from the entire multi-crop dataset while reducing the complexity of the disease classification tasks. The crop-conditional plant disease classification network that incorporates the contextual information by concatenation at the embedding vector level obtains a balanced accuracy of 0.98 improving all previous methods and removing 71% of the miss-classifications of the former methods
What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification
Matching pedestrians across disjoint camera views, known as person
re-identification (re-id), is a challenging problem that is of importance to
visual recognition and surveillance. Most existing methods exploit local
regions within spatial manipulation to perform matching in local
correspondence. However, they essentially extract \emph{fixed} representations
from pre-divided regions for each image and perform matching based on the
extracted representation subsequently. For models in this pipeline, local finer
patterns that are crucial to distinguish positive pairs from negative ones
cannot be captured, and thus making them underperformed. In this paper, we
propose a novel deep multiplicative integration gating function, which answers
the question of \emph{what-and-where to match} for effective person re-id. To
address \emph{what} to match, our deep network emphasizes common local patterns
by learning joint representations in a multiplicative way. The network
comprises two Convolutional Neural Networks (CNNs) to extract convolutional
activations, and generates relevant descriptors for pedestrian matching. This
thus, leads to flexible representations for pair-wise images. To address
\emph{where} to match, we combat the spatial misalignment by performing
spatially recurrent pooling via a four-directional recurrent neural network to
impose spatial dependency over all positions with respect to the entire image.
The proposed network is designed to be end-to-end trainable to characterize
local pairwise feature interactions in a spatially aligned manner. To
demonstrate the superiority of our method, extensive experiments are conducted
over three benchmark data sets: VIPeR, CUHK03 and Market-1501.Comment: Published at Pattern Recognition, Elsevie
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
- …