2 research outputs found
Detecci贸n y seguimiento de manos en videos digitales utilizando computadores y mini-computadores
El problema del seguimiento de manos o hand tracking puede definirse como la capacidad de un sistema
computacional de poder reconocer las manos de un individuo (usuario) y hacerles un seguimiento en todo
momento. El inter麓es por el estudio del movimiento de las manos se debe a dos particularidades. En primer
lugar, se debe a que las manos son protagonistas en la realizaci麓on de varias tareas diarias del ser humano,
pues las manos son un distintivo de las diferentes actividades humanas. Las manos permite la manipulaci
麓on de objetos; de lo cual se basa una gran dimensi麓on de la interactividad del hombre con sus diferentes
herramientas de trabajo [1]. No es de sorprender que, con el reconocimiento del movimiento de las manos,
se puedan reconocer varias actividades de las personas: comer, saludar, martillar, apu藴nar, se藴nalar, etc. En
segundo lugar, las manos, junto con el rostro, son los dos mayores indicadores gestuales dentro de la comunicaci贸n no verbal; lo cual indica que en las manos hay un gran despliegue de diferentes gestos, seas y
apariencias, y por tanto, tengan una gran riqueza de significado comunicativo.Tesi
Spatiotemporal visual analysis of human actions
In this dissertation we propose four methods for the recognition of human activities. In all four of
them, the representation of the activities is based on spatiotemporal features that are automatically
detected at areas where there is a significant amount of independent motion, that is, motion that is
due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features
throughout this dissertation. The algorithms presented, however, can be used with any kind of features,
as long as the latter are well localized and have a well-defined area of support in space and time. We
introduce the utilized spatiotemporal salient points in the first method presented in this dissertation.
By extending previous work on spatial saliency, we measure the variations in the information content of
pixel neighborhoods both in space and time, and detect the points at the locations and scales for which
this information content is locally maximized. In this way, an activity is represented as a collection of
spatiotemporal salient points. We propose an iterative linear space-time warping technique in order
to align the representations in space and time and propose to use Relevance Vector Machines (RVM)
in order to classify each example into an action category. In the second method proposed in this
dissertation we propose to enhance the acquired representations of the first method. More specifically,
we propose to track each detected point in time, and create representations based on sets of trajectories,
where each trajectory expresses how the information engulfed by each salient point evolves over time.
In order to deal with imperfect localization of the detected points, we augment the observation model
of the tracker with background information, acquired using a fully automatic background estimation
algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels.
In addition, we perform experiments where the tracked templates are localized on specific parts of the
body, like the hands and the head, and we further augment the tracker鈥檚 observation model using a
human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm
(LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and
RVMs for classification. In the third method that we propose, we assume that neighboring salient
points follow a similar motion. This is in contrast to the previous method, where each salient point was
tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual
descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The
latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal
neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in
translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the
corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are
extracted across the whole dataset are subsequently clustered in order to create a codebook, which is
used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for
classification. The fourth and last method addresses the joint problem of localization and recognition
of human activities depicted in unsegmented image sequences. Its main contribution is the use of an
implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal
localization of characteristic ensembles of spatiotemporal features. The latter are localized around
automatically detected salient points. Evidence for the spatiotemporal localization of the activity
is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in
order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct
class-specific spatiotemporal models, which encode where in space and time each codeword ensemble
appears in the training set. During testing, each activated codeword ensemble casts probabilistic
votes concerning the spatiotemporal localization of the activity, according to the information stored
during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable
hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume
which potentially engulfs the activity, and is verified by performing action category classification with
an RVM classifier