34 research outputs found

    Learning Explainable User Sentiment and Preferences for Information Filtering

    Get PDF
    In the last decade, online social networks have enabled people to interact in many ways with each other and with content. The digital traces of such actions reveal people's preferences towards online content such as news or products. These traces often result from interactions such as sharing or liking, but also from interactions in natural language. The continuous growth of the amount of content and of digital traces has led to information overload: surrounded by large volumes of information, people are facing difficulties when searching for information relevant to their interests. To improve user experience, information systems must be able to assist users in achieving their search goals, effectively and efficiently. This thesis is concerned with two important challenges that information systems need to address in order to significantly improve search experience and overcome information overload. First, these systems need to model accurately the variety of user traces, and second, they need to meaningfully explain search results and recommendations to users. To address these challenges, this thesis proposes novel methods based on machine learning to model user sentiment and preferences for information filtering systems, which are effective, scalable, and easily interpretable by humans. We focus on two prominent types of user traces in social networks: on the one hand, user comments accompanied by unary preferences such as likes, and on the other hand, user reviews accompanied by numerical preferences such as star ratings. In both cases, we advocate that by better understanding user text through mining its semantics and modeling its structure, we can not only improve information filtering, but also explain predictions to users. Within this context, we aim to answer three main research questions, namely: (i)~how do item semantics help to predict unary preferences; (ii)~how do sentiments of free-form user texts help to predict unary preferences; and (iii)~how to model fine-grained numerical preferences from user review texts. Our goal is to model and extract from user text the knowledge required to answer these questions, and to obtain insights on how to design better information filtering systems that are more effective and improve user experience. To answer the first question, we formulate the recommendation problem based on unary preferences as a top-N retrieval task and we define an appropriate dataset and metrics for measuring performance. Then, we propose and evaluate several content-based methods based on semantic similarities under presence or absence of preferences. To answer the second question, we propose a sentiment-aware neighborhood model which integrates the sentiment of user comments with unary preferences, either through fixed or through learned mapping functions. For the latter type, we propose a learning algorithm which adapts the sentiment of user comments to unary preferences at collective or individual levels. To answer the third question, we cast the problem of modeling user attitude toward aspects of items as a weakly supervised problem, and we propose a weighted multiple-instance learning method for solving it. Lastly, we show that the learned saliency weights, apart from being easily interpretable, are useful indicators for review segmentation and summarization

    DISCRIMINATIVE LEARNING AND RECOGNITION USING DICTIONARIES

    Get PDF
    In recent years, the theory of sparse representation has emerged as a powerful tool for efficient processing of data in non-traditional ways. This is mainly due to the fact that most signals and images of interest tend to be sparse or compressible in some dictionary. In other words, they can be well approximated by a linear combination of a few elements (also known as atoms) of a dictionary. This dictionary can either be an analytic dictionary composed of wavelets or Fourier basis or it can be directly trained from data. It has been observed that dictionaries learned directly from data provide better representation and hence can improve the performance of many practical applications such as restoration and classification. In this dissertation, we study dictionary learning and recognition under supervised, unsupervised, and semi-supervised settings. In the supervised case, we propose an approach to recognize humans in unconstrained videos, where the main challenge is exploiting the identity information in multiple frames and the accompanying dynamic signature. These identity cues include face, body, and motion. Our approach is based on video-dictionaries for face and body. We design video-dictionaries to implicitly encode temporal, pose, and illumination information. Next, we propose a novel multivariate sparse representation method that jointly represents all the video data by a sparse linear combination of training data. To increase the ability of our algorithm to learn nonlinearities, we apply kernel methods to learn the dictionaries. Next, we address the problem of matching faces across changes in pose in unconstrained videos. Our approach consists of two methods based on 3D rotation and sparse representation that compensate for changes in pose. We demonstrate the superior performance of our approach over several state-of-the-art algorithms through extensive experiments on unconstrained video datasets. In the unsupervised case, we present an approach that simultaneously clusters images and learns dictionaries from the clusters. The method learns dictionaries in the Radon transform domain. The main feature of the proposed approach is that it provides in-plane rotation and scale invariant clustering, which is useful in many applications such as Content Based Image Retrieval (CBIR). We demonstrate through experiments that the proposed rotation and scale invariant clustering provides not only good retrieval performances but also substantial improvements and robustness compared to traditional Gabor-based and several state-of-the-art shape-based methods. We then extend the dictionary learning problem to a generalized semi-supervised formulation, where each training sample is provided with a set of possible labels and only one label among them is the true one. Such applications can be found in image and video collections where one often has only partially labeled data. For instance, given an image with multiple faces and a caption specifying the names, we can be sure that each of the faces belong to one of the names specified, while the exact identity of each face is not known. Labeling involves significant amount of human effort and is expensive. This has motivated researchers to develop learning algorithms from partially labeled training data. In this work, we develop dictionary learning algorithms that utilize such partially labeled data. The proposed method aims to solve the problem of ambiguously labeled multiclass-classification using an iterative algorithm. The dictionaries are updated using either soft (EM-based) or hard decision rules. Extensive evaluations on existing datasets demonstrate that the proposed method performs significantly better than state-of-the-art approaches for learning from ambiguously labeled data. As sparsity plays a major role in our research, we further present a sparse representation-based approach to find the salient views of 3D objects. The salient views are categorized into two groups. The first are boundary representative views that have several visible sides and object surfaces that may be attractive to humans. The second are side representative views that best represent side views of the approximating convex shape. The side representative views are class-specific views and possess the most representative power compared to other within-class views. Using the concept of characteristic view class, we first present a sparse representation-based approach for estimating the boundary representative views. With the estimated boundaries, we determine the side representative views based on a minimum reconstruction error criterion. Furthermore, to evaluate our method, we introduce the notion of geometric dictionaries built from salient views for applications in 3D object recognition, retrieval and sparse-to-full reconstruction. By a series of experiments on four publicly available 3D object datasets, we demonstrate the effectiveness of our approach over state-of-the-art algorithms and baseline methods

    Domain Transfer Learning for Object and Action Recognition

    Get PDF
    Visual recognition has always been a fundamental problem in computer vision. Its task is to learn visual categories using labeled training data and then identify unlabeled new instances of those categories. However, due to the large variations in visual data, visual recognition is still a challenging problem. Handling the variations in captured images is important for real-world applications where unconstrained data acquisition scenarios are widely prevalent. In this dissertation, we first address the variations between training and testing data. Particularly, for cross-domain object recognition, we propose a Grassmann manifold-based domain adaptation approach to model the domain shift using the geodesic connecting the source and target domains. We further measure the distance between two data points from different domains by integrating the distance of their projections through all the intermediate subspaces along the geodesic. Our proposed approach that exploits all the intermediate subspaces along the geodesic produces a more accurate metric. For cross-view action recognition, we present two effective approaches to learn transferable dictionaries and view-invariant sparse representations. In the first approach, we learn a set of transferable dictionaries where each dictionary corresponds to one camera view. The set of dictionaries is learned simultaneously from sets of correspondence videos taken at different views with the aim of encouraging each video in the set to have the same sparse representation. In the second approach, we relaxes this constraint by encouraging correspondence videos to have similar sparse representations. In addition, we learn a common dictionary that is incoherent to view-specific dictionaries for cross-view action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. In this way, we can align view-specific features in the sparse feature spaces spanned by the view-specific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. In order to handle the more general variations in captured images, we also exploit the semantic information to learn discriminative feature representations for visual recognition. Class labels are often organized in a hierarchical taxonomy based on their semantic meanings. We propose a novel multi-layer hierarchical dictionary learning framework for region tagging. Specifically, we learn a node-specific dictionary for each semantic label in the taxonomy and preserve the hierarchial semantic structure in the relationship among these node-dictionaries. Our approach can also transfer knowledge from semantic label at higher levels to help learn the classifiers for semantic labels at lower levels. Moreover, we exploit the semantic attributes for boosting the performance of visual recognition. We encode objects or actions based on attributes that describe them as high-level concepts. We consider two types of attributes. One type of attributes is generated by humans, while the second type is data-driven attributes extracted from data using dictionary learning methods. Attribute-based representation may exhibit variations due to noisy and redundant attributes. We propose a discriminative and compact attribute-based representation by selecting a subset of discriminative attributes from a large attribute set. Three attribute selection criteria are proposed and formulated as a submodular optimization problem. A greedy optimization algorithm is presented and its solution is guaranteed to be at least (1-1/e)-approximation to the optimum

    Image-set, Temporal and Spatiotemporal Representations of Videos for Recognizing, Localizing and Quantifying Actions

    Get PDF
    This dissertation addresses the problem of learning video representations, which is defined here as transforming the video so that its essential structure is made more visible or accessible for action recognition and quantification. In the literature, a video can be represented by a set of images, by modeling motion or temporal dynamics, and by a 3D graph with pixels as nodes. This dissertation contributes in proposing a set of models to localize, track, segment, recognize and assess actions such as (1) image-set models via aggregating subset features given by regularizing normalized CNNs, (2) image-set models via inter-frame principal recovery and sparsely coding residual actions, (3) temporally local models with spatially global motion estimated by robust feature matching and local motion estimated by action detection with motion model added, (4) spatiotemporal models 3D graph and 3D CNN to model time as a space dimension, (5) supervised hashing by jointly learning embedding and quantization, respectively. State-of-the-art performances are achieved for tasks such as quantifying facial pain and human diving. Primary conclusions of this dissertation are categorized as follows: (i) Image set can capture facial actions that are about collective representation; (ii) Sparse and low-rank representations can have the expression, identity and pose cues untangled and can be learned via an image-set model and also a linear model; (iii) Norm is related with recognizability; similarity metrics and loss functions matter; (v) Combining the MIL based boosting tracker with the Particle Filter motion model induces a good trade-off between the appearance similarity and motion consistence; (iv) Segmenting object locally makes it amenable to assign shape priors; it is feasible to learn knowledge such as shape priors online from Web data with weak supervision; (v) It works locally in both space and time to represent videos as 3D graphs; 3D CNNs work effectively when inputted with temporally meaningful clips; (vi) the rich labeled images or videos help to learn better hash functions after learning binary embedded codes than the random projections. In addition, models proposed for videos can be adapted to other sequential images such as volumetric medical images which are not included in this dissertation

    Deliverable D1.1 State of the art and requirements analysis for hypervideo

    Get PDF
    This deliverable presents a state-of-art and requirements analysis report for hypervideo authored as part of the WP1 of the LinkedTV project. Initially, we present some use-case (viewers) scenarios in the LinkedTV project and through the analysis of the distinctive needs and demands of each scenario we point out the technical requirements from a user-side perspective. Subsequently we study methods for the automatic and semi-automatic decomposition of the audiovisual content in order to effectively support the annotation process. Considering that the multimedia content comprises of different types of information, i.e., visual, textual and audio, we report various methods for the analysis of these three different streams. Finally we present various annotation tools which could integrate the developed analysis results so as to effectively support users (video producers) in the semi-automatic linking of hypervideo content, and based on them we report on the initial progress in building the LinkedTV annotation tool. For each one of the different classes of techniques being discussed in the deliverable we present the evaluation results from the application of one such method of the literature to a dataset well-suited to the needs of the LinkedTV project, and we indicate the future technical requirements that should be addressed in order to achieve higher levels of performance (e.g., in terms of accuracy and time-efficiency), as necessary

    ビデオ映像に対する人間動作の認識

    Get PDF
    Our overall purpose in this dissertation is automatic construction of a large-scale action database with Web data, which could be helpful for the better exploration of action recognition. We conducted large-scale experiments on 100 human actions and 12 nonhuman actions and obtained promissing results. This disseration is constructed with 6 chapters. In the followings, we briefly introduce the content of each chapter. In Chapter 1, recent approaches on action recognition as well as the necessity of building a large-scale action database and its difficulties are described. Then our works to solve the problem are concisely explained. In Chapter 2, the first work which introduces a framework of extracting automatically relevant video shots of specific actions from Web videos is described in details. This framework at first, selects relevant videos among thousands of Web videos for a given action using tag co-occurance and then, divides selected videos into video shots. Video shots are then ranked based on their visual linkage. The top ranked video shots are supposed to be the most related shots of the action. Moreover, our method of adopting Web images to shot ranking is also introduced. Finally, large-scale experiments on 100 human actions and 12 non-human actions and their results are described. In Chapter 3, the second work which aims to further improve shot ranking of the above framework by proposing a novel ranking method is introduced. Our proposed ranking method, which is called VisualTextualRank, is an extension of a conventional method, VisualRank, which is applied to shot ranking in Chapter 2. VisualTextualRank effectively employs both textual information and visual information extracted from the data. Our experiment results showed that using our method instead of the conventional ranking method could obtain more relevant shots. In Chapter 4, the third work which aims to obtain more informative and representative features of videos is described. Based on a conventional method of extracting spatiotemporal features which was adopted in Chapter 2 and Chapter 3, we propose to extract spatio-temporal features with triangulation of dense SURF keypoints. Shape features of the triangles along with visual features and motion features of their points are taken into account to form our features. By applying our method of feature extraction to the framework introduced in Chapter 2, we show that more relevant video shots can be retrieved at the top. Furthermore, the effectiveness of our method is also validated on action classification for UCF-101 and UCF-50 which are well-known large-scale data sets. The experiment results demonstrate that our features are comparable and complementary to the state-of-the-art. In Chapter 5, the final work which focuses on recognition of hand motion based actions is introduced. We propose a system of hand detection and tracking for unconstrained videos and extract hand movement based features from detected and tracked hand regions. These features are supposed to help improve results for hand motion based actions. To evaluate the performance of our system on hand detection, we use Video-Pose2.0 dataset which is a challenging dataset with uncontrolled videos. To validate the effectiveness of our features, we conduct experiments on ne-grained action recognition with \\playing instruments" group in UCF-101 data set. The experiment results show the efficiency of our system. In Chapter 6, our works with their major points and findings are summarized. We also consider the potential of applying the results obtained by our works to further researches.電気通信大学201
    corecore