Extracción de descriptores de movimiento en vídeos para la evaluación de la estética

Tirado Martín, Paloma

unknown

Extracción de descriptores de movimiento en vídeos para la evaluación de la estética

Authors: Paloma Tirado Martín
Publication date: 15 October 2015
Publisher

Abstract

The growth of video streaming has increased noticeably through the last decade. Because of it, the task of searching and recommending videos is becoming more and more difficult. Whereas before the existence of video streaming information retrievals was only based on text and metadata, nowadays content-based image and video retrieval is starting to be researched. In order to add value and success to user’s searchings, it is interesting to assess the quality and aesthetic value of the information it is retrieved. On this thesis we are going to extract several motion related descriptors in order to aesthetically assess a car commercial database. The videos in the database are extracted from YouTube and labeled in order to metadata provided by the website. Specifically three kinds of labeling are going to be used: based on quality or likes/- dislikes, quantity or number of views and the combination of both of them. Quality and quantity provide a binary labeling, and the combination clusters the videos in four classes. As it is usually done in computer vision, the main objective is suggesting a set of descriptors and designing and providing the procedures for calculating their values on the corpus of videos. These values are called descriptors, and they can be obtained by processing frames and handling the data got on the procedure to get specific numbers. With their help it may be possible to know whether they give enough information to determine the aesthetic appealing of the videos. On this project we are going focus on motion descriptors. As an approach to get data about the video motion, the optical flow is estimated between each pair of frames. To do so, a Matlab friendly C++ code developed is used. This algorithm is based on the brightness constancy assumption between two frames, leading to a continuous spatial-temporal function. This is discretized, linearized and the temporal factor removed by assuming only the function in two frames. Afterwards the zero gradient values are found using Iterative Reweighted Least Squares (IRLS), a method which iterates calculating different weights in order to find the ones fulfilling the zero gradient condition. From this a linear system is obtained and it is solved by using Successive Over-Relaxation (SOR) method, which is Gauss’ variant with faster convergence. The optical flow algorithm needs several parameters to be set. Because of the difficulty of setting these parameters automatically, these values are determined by the observation of each performance and efficiency representing the observed motion. When the optical flow is calculated, we filter homogeneous texture regions, due to possible error estimation induced by similar pixel values on the neighborhood. In order to determine the texture level on different frame regions, we measure the entropy on each one, which will provide a measurement of pixel’s randomness. This is made by turning each frame into gray-scale and dividing them into 60 different windows. Afterwards, a threshold is set to determine which region will be considered as a low texture one. This is done considering that filtering excessively could mean that the descriptors extraction will not be representative. However, in cases with a lot of very homogeneous regions (e.g.: completely black) the amount of vectors discarded will be high no matter what threshold is set. Then, when the region’s entropy is less than the threshold, it will be considered as a low texture one and, as a consequence, its optical flow vectors will not be taken into consideration. After filtering based on the texture, the first step is getting the angle and modulus of the movement estimated in every pixel using the components got. For easy direction interpretation when getting the different descriptors, the angles are nominalized according to the 8 cardinal points. By using both cardinals and modulus obtained, it is possible to estimate approximately which camera motion is taking place on every frame or shot. For this we are only taking into account those values on the margins of each frame. Previous to the camera motion detection, it is necessary to apply some weights to the cardinal values on the margins. This is done not only to give relevance to N-S-E-W about the camera motion although they do not belong to the purest pan and tilt motion types. Adding each different weight depending on the cardinal, we get a percentage in comparison to the ideal motion type (this is, every pixel moving towards the same direction) which gives out the “amount” of movement going to each N-S-E-W direction. The most common shot type on the database is done by using fixed cameras and it is detected by setting a threshold to the mean modulus of the margins of each frame, which should include those frames which are fixed but have some kind of movement on the margins because of the captured scene. If it is less than the mentioned threshold, the frame will be considered as fixed. If not, we begin to detect zoom presence. In order to do that, margins are divided into 2 vertical and 2 horizontal regions, and each maximum percentage cardinal is obtained. When detecting which type of zoom it is, we know the specific directions each margin should have in theory. Starting from that, we can compare the theoretical value with the maximum direction got before using weights. When 3 or 4 of the margin’s directions correspond to the theoretical pixels motion in each zoom type, the current frame will be considered as one with zoom in or out, depending on the conditions. Considering the zoom just under 3 conditions is not that restrictive, and this is because having just those 3 conditions fulfilled is very unlikely unless we have a zoom. If the zoom is not the case, we evaluate whether it is a pan or tilt camera motion. Now, the maximum percentage is obtained among all the margins instead of dividing them into regions, because directions should be the same along all the frame. If the difference of the maximum value with respect to the rest of the cardinal percentages is greater than a threshold, and the maximum value is higher than another different threshold, it will be considered pan right/left or tilt up/down depending on the cardinal which belongs to the maximum. This is done in order to discard those maximums in non predominant directions. Finally, if none of these conditions are fulfilled, the frame will have assigned a non-specific motion. Once frame camera motion is determined, we proceed to detect shots on each video by computing the Sum of Absolute Differences (SAD) of the gray intensity pixels and its first and second derivatives. Shot changes are detected when this second derivative has a value greater than a chosen threshold. After shot detection, the mode camera motion type is obtained, as well as the percentage of frames on the shot with that value. When this percentage is greater than a threshold, it is considered the shot has a predominant value which corresponds to the most present motion on the frames. At this point the data computed are not at video level. This is why we need to use the data in order to obtain single values which represent each video. In order to get statistical parameters at video level, it is not possible to iterate creating a matrix with every single angle or modulus value, because it is highly memory and computing time consuming, so it is necessary to do it sequentially. This means getting single values on each frame which will be helpful when computing statistical video descriptors. As angles have a circular nature, using circular statistics is compulsory. For this purpose we are going to store only the sum of each vector component through the whole set of frames, as well as the sum of modulus. We also obtain the number of pixels taken into account on the operation because it will not be constant due to the low texture region filtering. With these values we get everything needed to compute the mean an standard deviation at video level. However, when dealing with data like camera motion type through frames and shots, it is not possible to get means and standard deviations, and this is why we get the percentage of each motion type at shot and frame level. When data handling is done, we extract 27 different descriptors which are going to be evaluated using the three labeling methods previously referred. By using machine learning algorithms provided by Weka, several features and classifiers are tested, getting the best performance using quantity labeling and angle and modulus related features, getting a 60% accuracy with tree classifier SimpleCart. Although in general descriptors performance is not remarkably good, by using the Experimenter tool provided by Weka, we can find out which set of features and classifiers really provide a statistically significant improvement with respect to ZeroR classifier. We observe that with combination labeling the accuracy is less than those in quality and quantity labeling. This is because combination deals binary, although it does not mean that combination labeling works worse than the binary ones, because its improvement with respect to ZeroR could be better. In fact, combination labeling gives more information because it has an statistically relevant performance when choosing angles and modulus related features and SimpleCart and SimpleLogistic classifiers, which means that the accuracy percentage is not something casual. We also get significant results when using quantity labeling and the same set of features with SimpleCart. These results lead to the conclusion that camera motion is not particularly relevant when assessing aesthetics on this database. This is something contradictory to what one might think, because camera motion is used typically to add drama on an audiovisual context. An explanation to this could be the fact that in general, fixed and hand-carried camera motion are noticeably common on the database and that is why it does not really affect when the user decides whether he likes it or not. In addition, it is well known that establishing ground-truth when dealing with people’s likings is not trivial because of their subjectivity, and this could affect results. The lack of ca database with camera motion labels is also crucial, because it makes difficult knowing if the rest of the non-manually labeled videos behave correctly when the camera motion detection method is applied. In this project we check that not always theoretical knowledge corresponds to what it is observed in a practical context, but we also provide a way to extract descriptors and analyze them with a simple approach. This could be improved in the future by labeling the database with respect to the camera motion and by segmenting background in order to improve the steady camera detection. Binary aesthetic labeling could be also improved by using supervised annotation extracted by measuring involuntary biological responses experienced by the evaluator.En este proyecto se extraen diferentes características relacionadas con el movimiento de los vídeos proporcionados en la base de datos, los cuales se corresponden con anuncios de coches. Para ello, se proporciona una estimación del flujo óptico haciendo uso del algoritmo proporcionado. Mediante un análisis del flujo óptico en los márgenes de los fotogramas se caracteriza el movimiento de cámara presente en éstos y se calculan los ángulos y módulos correspondientes al movimiento de cada píxel. Posteriormente se procede a un cálculo secuencial de estos valores para obtener descriptores a nivel de vídeo. Con los datos obtenidos y con tres diferentes etiquetados de los vídeos basados en calidad, cantidad y en la combinación de éstos, se aplican métodos de aprendizaje máquina con diferentes conjuntos de descriptores y clasificadores para la evaluación de la estética, la cual estará basada en los metadatos proporcionados por los usuarios a través de YouTube. Se concluye de esta manera que el tipo de movimiento de cámara no afecta notablemente a la evaluación estética por parte de los usuarios, mientras que sí lo hacen el ángulo y módulo presentes en cada vídeo.Ingeniería en Tecnologías de Telecomunicació