The growth of video streaming has increased noticeably through
the last decade. Because of it, the task of searching and recommending
videos is becoming more and more difficult. Whereas before
the existence of video streaming information retrievals was only based
on text and metadata, nowadays content-based image and video
retrieval is starting to be researched. In order to add value and success
to user’s searchings, it is interesting to assess the quality and
aesthetic value of the information it is retrieved.
On this thesis we are going to extract several motion related descriptors
in order to aesthetically assess a car commercial database.
The videos in the database are extracted from YouTube and labeled
in order to metadata provided by the website. Specifically three
kinds of labeling are going to be used: based on quality or likes/-
dislikes, quantity or number of views and the combination of both
of them. Quality and quantity provide a binary labeling, and the
combination clusters the videos in four classes.
As it is usually done in computer vision, the main objective is
suggesting a set of descriptors and designing and providing the procedures
for calculating their values on the corpus of videos. These
values are called descriptors, and they can be obtained by processing
frames and handling the data got on the procedure to get specific
numbers. With their help it may be possible to know whether they
give enough information to determine the aesthetic appealing of the
videos. On this project we are going focus on motion descriptors.
As an approach to get data about the video motion, the optical
flow is estimated between each pair of frames. To do so, a Matlab
friendly C++ code developed is used. This algorithm is based
on the brightness constancy assumption between two frames, leading
to a continuous spatial-temporal function. This is discretized, linearized and the temporal factor removed by assuming only the function
in two frames. Afterwards the zero gradient values are found using
Iterative Reweighted Least Squares (IRLS), a method which iterates
calculating different weights in order to find the ones fulfilling the
zero gradient condition. From this a linear system is obtained and it
is solved by using Successive Over-Relaxation (SOR) method, which
is Gauss’ variant with faster convergence.
The optical flow algorithm needs several parameters to be set.
Because of the difficulty of setting these parameters automatically,
these values are determined by the observation of each performance
and efficiency representing the observed motion. When the optical
flow is calculated, we filter homogeneous texture regions, due to possible
error estimation induced by similar pixel values on the neighborhood.
In order to determine the texture level on different frame
regions, we measure the entropy on each one, which will provide a
measurement of pixel’s randomness. This is made by turning each
frame into gray-scale and dividing them into 60 different windows.
Afterwards, a threshold is set to determine which region will be considered
as a low texture one. This is done considering that filtering
excessively could mean that the descriptors extraction will not be
representative. However, in cases with a lot of very homogeneous
regions (e.g.: completely black) the amount of vectors discarded will
be high no matter what threshold is set. Then, when the region’s
entropy is less than the threshold, it will be considered as a low texture
one and, as a consequence, its optical flow vectors will not be
taken into consideration.
After filtering based on the texture, the first step is getting the
angle and modulus of the movement estimated in every pixel using
the components got. For easy direction interpretation when getting
the different descriptors, the angles are nominalized according to the
8 cardinal points.
By using both cardinals and modulus obtained, it is possible to estimate
approximately which camera motion is taking place on every
frame or shot. For this we are only taking into account those values
on the margins of each frame. Previous to the camera motion detection,
it is necessary to apply some weights to the cardinal values
on the margins. This is done not only to give relevance to N-S-E-W about the camera motion although they do not belong to the purest
pan and tilt motion types. Adding each different weight depending
on the cardinal, we get a percentage in comparison to the ideal motion
type (this is, every pixel moving towards the same direction)
which gives out the “amount” of movement going to each N-S-E-W
direction.
The most common shot type on the database is done by using
fixed cameras and it is detected by setting a threshold to the mean
modulus of the margins of each frame, which should include those
frames which are fixed but have some kind of movement on the margins
because of the captured scene. If it is less than the mentioned
threshold, the frame will be considered as fixed. If not, we begin to
detect zoom presence. In order to do that, margins are divided into 2
vertical and 2 horizontal regions, and each maximum percentage cardinal
is obtained. When detecting which type of zoom it is, we know
the specific directions each margin should have in theory. Starting
from that, we can compare the theoretical value with the maximum
direction got before using weights. When 3 or 4 of the margin’s directions
correspond to the theoretical pixels motion in each zoom
type, the current frame will be considered as one with zoom in or
out, depending on the conditions. Considering the zoom just under
3 conditions is not that restrictive, and this is because having just
those 3 conditions fulfilled is very unlikely unless we have a zoom.
If the zoom is not the case, we evaluate whether it is a pan or tilt
camera motion. Now, the maximum percentage is obtained among
all the margins instead of dividing them into regions, because directions
should be the same along all the frame. If the difference of the
maximum value with respect to the rest of the cardinal percentages
is greater than a threshold, and the maximum value is higher than
another different threshold, it will be considered pan right/left or tilt
up/down depending on the cardinal which belongs to the maximum.
This is done in order to discard those maximums in non predominant
directions. Finally, if none of these conditions are fulfilled, the
frame will have assigned a non-specific motion.
Once frame camera motion is determined, we proceed to detect
shots on each video by computing the Sum of Absolute Differences (SAD) of the gray intensity pixels and its first and second derivatives.
Shot changes are detected when this second derivative has
a value greater than a chosen threshold. After shot detection, the
mode camera motion type is obtained, as well as the percentage of
frames on the shot with that value. When this percentage is greater
than a threshold, it is considered the shot has a predominant value
which corresponds to the most present motion on the frames.
At this point the data computed are not at video level. This is
why we need to use the data in order to obtain single values which
represent each video. In order to get statistical parameters at video
level, it is not possible to iterate creating a matrix with every single
angle or modulus value, because it is highly memory and computing
time consuming, so it is necessary to do it sequentially. This means
getting single values on each frame which will be helpful when computing
statistical video descriptors. As angles have a circular nature,
using circular statistics is compulsory. For this purpose we are going
to store only the sum of each vector component through the whole
set of frames, as well as the sum of modulus. We also obtain the
number of pixels taken into account on the operation because it will
not be constant due to the low texture region filtering. With these
values we get everything needed to compute the mean an standard
deviation at video level. However, when dealing with data like camera
motion type through frames and shots, it is not possible to get
means and standard deviations, and this is why we get the percentage
of each motion type at shot and frame level.
When data handling is done, we extract 27 different descriptors
which are going to be evaluated using the three labeling methods
previously referred. By using machine learning algorithms provided
by Weka, several features and classifiers are tested, getting the best
performance using quantity labeling and angle and modulus related
features, getting a 60% accuracy with tree classifier SimpleCart. Although
in general descriptors performance is not remarkably good,
by using the Experimenter tool provided by Weka, we can find out
which set of features and classifiers really provide a statistically significant
improvement with respect to ZeroR classifier. We observe
that with combination labeling the accuracy is less than those in
quality and quantity labeling. This is because combination deals binary, although it does not mean that combination labeling works
worse than the binary ones, because its improvement with respect to
ZeroR could be better. In fact, combination labeling gives more information
because it has an statistically relevant performance when
choosing angles and modulus related features and SimpleCart and
SimpleLogistic classifiers, which means that the accuracy percentage
is not something casual. We also get significant results when using
quantity labeling and the same set of features with SimpleCart.
These results lead to the conclusion that camera motion is not
particularly relevant when assessing aesthetics on this database. This
is something contradictory to what one might think, because camera
motion is used typically to add drama on an audiovisual context.
An explanation to this could be the fact that in general, fixed and
hand-carried camera motion are noticeably common on the database
and that is why it does not really affect when the user decides
whether he likes it or not. In addition, it is well known that establishing
ground-truth when dealing with people’s likings is not trivial
because of their subjectivity, and this could affect results. The lack
of ca database with camera motion labels is also crucial, because
it makes difficult knowing if the rest of the non-manually labeled
videos behave correctly when the camera motion detection method
is applied.
In this project we check that not always theoretical knowledge
corresponds to what it is observed in a practical context, but we
also provide a way to extract descriptors and analyze them with a
simple approach. This could be improved in the future by labeling
the database with respect to the camera motion and by segmenting
background in order to improve the steady camera detection. Binary
aesthetic labeling could be also improved by using supervised
annotation extracted by measuring involuntary biological responses
experienced by the evaluator.En este proyecto se extraen diferentes características relacionadas
con el movimiento de los vídeos proporcionados en la base de datos,
los cuales se corresponden con anuncios de coches. Para ello, se proporciona
una estimación del flujo óptico haciendo uso del algoritmo
proporcionado.
Mediante un análisis del flujo óptico en los márgenes de los fotogramas
se caracteriza el movimiento de cámara presente en éstos
y se calculan los ángulos y módulos correspondientes al movimiento
de cada píxel. Posteriormente se procede a un cálculo secuencial de
estos valores para obtener descriptores a nivel de vídeo.
Con los datos obtenidos y con tres diferentes etiquetados de los
vídeos basados en calidad, cantidad y en la combinación de éstos,
se aplican métodos de aprendizaje máquina con diferentes conjuntos
de descriptores y clasificadores para la evaluación de la estética, la
cual estará basada en los metadatos proporcionados por los usuarios
a través de YouTube. Se concluye de esta manera que el tipo
de movimiento de cámara no afecta notablemente a la evaluación
estética por parte de los usuarios, mientras que sí lo hacen el ángulo y módulo presentes en cada vídeo.Ingeniería en Tecnologías de Telecomunicació