7 research outputs found
A matrix factorization framework for jointly analyzing multiple nonnegative data
Nonnegative matrix factorization based methods provide one of the simplest and most effective approaches to text mining. However, their applicability is mainly limited to analyzing a single data source. In this paper, we propose a novel joint matrix factorization framework which can jointly analyze multiple data sources by exploiting their shared and individual structures. The proposed framework is flexible to handle any arbitrary sharing configurations encountered in real world data. We derive an efficient algorithm for learning the factorization and show that its convergence is theoretically guaranteed. We demonstrate the utility and effectiveness of the proposed framework in two real-world applications–improving social media retrieval using auxiliary sources and cross-social media retrieval. Representing each social media source using their textual tags, for both applications, we show that retrieval performance exceeds the existing state-of-the-art techniques. The proposed solution provides a generic framework and can be applicable to a wider context in data mining wherever one needs to exploit mutual and individual knowledge present across multiple data sources
Noisy multi-label semi-supervised dimensionality reduction
Noisy labeled data represent a rich source of information that often are
easily accessible and cheap to obtain, but label noise might also have many
negative consequences if not accounted for. How to fully utilize noisy labels
has been studied extensively within the framework of standard supervised
machine learning over a period of several decades. However, very little
research has been conducted on solving the challenge posed by noisy labels in
non-standard settings. This includes situations where only a fraction of the
samples are labeled (semi-supervised) and each high-dimensional sample is
associated with multiple labels. In this work, we present a novel
semi-supervised and multi-label dimensionality reduction method that
effectively utilizes information from both noisy multi-labels and unlabeled
data. With the proposed Noisy multi-label semi-supervised dimensionality
reduction (NMLSDR) method, the noisy multi-labels are denoised and unlabeled
data are labeled simultaneously via a specially designed label propagation
algorithm. NMLSDR then learns a projection matrix for reducing the
dimensionality by maximizing the dependence between the enlarged and denoised
multi-label space and the features in the projected space. Extensive
experiments on synthetic data, benchmark datasets, as well as a real-world case
study, demonstrate the effectiveness of the proposed algorithm and show that it
outperforms state-of-the-art multi-label feature extraction algorithms.Comment: 38 page
Harnessing Teamwork in Networks: Prediction, Optimization, and Explanation
abstract: Teams are increasingly indispensable to achievements in any organizations. Despite the organizations' substantial dependency on teams, fundamental knowledge about the conduct of team-enabled operations is lacking, especially at the {\it social, cognitive} and {\it information} level in relation to team performance and network dynamics. The goal of this dissertation is to create new instruments to {\it predict}, {\it optimize} and {\it explain} teams' performance in the context of composite networks (i.e., social-cognitive-information networks).
Understanding the dynamic mechanisms that drive the success of high-performing teams can provide the key insights into building the best teams and hence lift the productivity and profitability of the organizations. For this purpose, novel predictive models to forecast the long-term performance of teams ({\it point prediction}) as well as the pathway to impact ({\it trajectory prediction}) have been developed. A joint predictive model by exploring the relationship between team level and individual level performances has also been proposed.
For an existing team, it is often desirable to optimize its performance through expanding the team by bringing a new team member with certain expertise, or finding a new candidate to replace an existing under-performing member. I have developed graph kernel based performance optimization algorithms by considering both the structural matching and skill matching to solve the above enhancement scenarios. I have also worked towards real time team optimization by leveraging reinforcement learning techniques.
With the increased complexity of the machine learning models for predicting and optimizing teams, it is critical to acquire a deeper understanding of model behavior. For this purpose, I have investigated {\em explainable prediction} -- to provide explanation behind a performance prediction and {\em explainable optimization} -- to give reasons why the model recommendations are good candidates for certain enhancement scenarios.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Recommended from our members
Machine learning for improving the quality of citizen science data
Citizen Science is a paradigm in which volunteers from the general public participate in scientific studies, often by performing data collection. This paradigm is especially useful if the scope of the study is too broad to be performed by a limited number of trained scientists. Although citizen scientists can contribute large quantities of data, data quality is often a concern due to variability in the skills of volunteers. In my thesis, I investigate applying machine learning techniques to improve the quality of data submitted to citizen science projects. The context of my work is eBird, which is one of the largest citizen science projects in existence. In the eBird project, citizen scientists act as a large global network of human sensors, recording observations of bird species and submitting these observations to a centralized database where they are used for ecological research such as species distribution modeling and reserve design. Machine learning can be used to improve data quality by modeling an observer's skill level, developing an automated data verification model and discovering groups of misidentified species