56 research outputs found
Advances in Spectral Learning with Applications to Text Analysis and Brain Imaging
Spectral learning algorithms are becoming increasingly popular in data-rich domains, driven in part by recent advances in large scale randomized SVD, and in spectral estimation of Hidden Markov Models. Extensions of these methods lead to statistical estimation algorithms which are not only fast, scalable, and useful on real data sets, but are also provably correct.
Following this line of research, we make two contributions. First, we
propose a set of spectral algorithms for text analysis and natural
language processing. In particular, we propose fast and scalable
spectral algorithms for learning word embeddings -- low dimensional
real vectors (called Eigenwords) that capture the “meaning” of words from their context. Second, we show how similar spectral methods can be applied to analyzing brain images.
State-of-the-art approaches to learning word embeddings are slow to
train or lack theoretical grounding; We propose three spectral
algorithms that overcome these limitations. All three algorithms
harness the multi-view nature of text data i.e. the left and right
context of each word, and share three characteristics:
1). They are fast to train and are scalable.
2). They have strong theoretical properties.
3). They can induce context-specific embeddings i.e. different embedding for “river bank” or “Bank of America”.
\end{enumerate}
They also have lower sample complexity and hence higher statistical
power for rare words. We provide theory which establishes
relationships between these algorithms and optimality criteria for the
estimates they provide. We also perform thorough qualitative and
quantitative evaluation of Eigenwords and demonstrate their superior performance over state-of-the-art approaches.
Next, we turn to the task of using spectral learning methods for brain imaging data.
Methods like Sparse Principal Component Analysis (SPCA), Non-negative Matrix Factorization (NMF) and Independent Component Analysis (ICA) have been used to obtain state-of-the-art accuracies in a variety of problems in machine learning. However, their usage in brain imaging, though increasing, is limited by the fact that they are used as out-of-the-box techniques and are seldom tailored to the domain specific constraints and knowledge pertaining to medical imaging, which leads to difficulties in interpretation of results.
In order to address the above shortcomings, we propose
Eigenanatomy (EANAT), a general framework for sparse matrix factorization. Its goal is to statistically learn the boundaries of
and connections between brain regions by weighing both the data and prior neuroanatomical knowledge.
Although EANAT incorporates some neuroanatomical prior knowledge in the form of connectedness and smoothness constraints, it can still be difficult for clinicians to interpret the results in specific domains where network-specific hypotheses exist. We thus extend EANAT and present a novel framework for prior-constrained sparse decomposition of matrices derived from brain imaging data, called Prior Based Eigenanatomy (p-Eigen). We formulate our solution in terms of a prior-constrained l1 penalized (sparse) principal component analysis. Experimental evaluation confirms that p-Eigen extracts biologically-relevant, patient-specific functional parcels and that it significantly aids classification of Mild Cognitive Impairment when compared to state-of-the-art competing approaches
Modeling Dynamic User Interests: A Neural Matrix Factorization Approach
In recent years, there has been significant interest in understanding users'
online content consumption patterns. But, the unstructured, high-dimensional,
and dynamic nature of such data makes extracting valuable insights challenging.
Here we propose a model that combines the simplicity of matrix factorization
with the flexibility of neural networks to efficiently extract nonlinear
patterns from massive text data collections relevant to consumers' online
consumption patterns. Our model decomposes a user's content consumption journey
into nonlinear user and content factors that are used to model their dynamic
interests. This natural decomposition allows us to summarize each user's
content consumption journey with a dynamic probabilistic weighting over a set
of underlying content attributes. The model is fast to estimate, easy to
interpret and can harness external data sources as an empirical prior. These
advantages make our method well suited to the challenges posed by modern
datasets. We use our model to understand the dynamic news consumption interests
of Boston Globe readers over five years. Thorough qualitative studies,
including a crowdsourced evaluation, highlight our model's ability to
accurately identify nuanced and coherent consumption patterns. These results
are supported by our model's superior and robust predictive performance over
several competitive baseline methods
Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MIC-GROUP) and multi-task feature selection (MIC-MULTI). MIC-GROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-MULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, TRANSFEAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems
Efficient Feature Selection in the Presence of Multiple Feature Classes
We present an information theoretic approach to feature selection when the data possesses feature classes. Feature classes are pervasive in real data. For example, in gene expression data, the genes which serve as features may be divided into classes based on their membership in gene families or pathways. When doing word sense disambiguation or named entity extraction, features fall into classes including adjacent words, their parts of speech, and the topic and venue of the document the word is in. When predictive features occur predominantly in a small number of feature classes, our information theoretic approach significantly improves feature selection. Experiments on real and synthetic data demonstrate substantial improvement in predictive accuracy over the standard L0 penalty-based stepwise and stream wise feature selection methods as well as over Lasso and Elastic Nets, all of which are oblivious to the existence of feature classes
A Risk Comparison of Ordinary Least Squares vs Ridge Regression
We compare the risk of ridge regression to a simple variant of ordinary least
squares, in which one simply projects the data onto a finite dimensional
subspace (as specified by a Principal Component Analysis) and then performs an
ordinary (un-regularized) least squares regression in this subspace. This note
shows that the risk of this ordinary least squares method is within a constant
factor (namely 4) of the risk of ridge regression.Comment: Appearing in JMLR 14, June 201
Unique in what sense? Heterogeneous relationships between multiple types of uniqueness and popularity in music
How does our society appreciate the uniqueness of cultural products? This
fundamental puzzle has intrigued scholars in many fields, including psychology,
sociology, anthropology, and marketing. It has been theorized that cultural
products that balance familiarity and novelty are more likely to become
popular. However, a cultural product's novelty is typically multifaceted. This
paper uses songs as a case study to study the multiple facets of uniqueness and
their relationship with success. We first unpack the multiple facets of a
song's novelty or uniqueness and, next, measure its impact on a song's
popularity. We employ a series of statistical models to study the relationship
between a song's popularity and novelty associated with its lyrics, chord
progressions, or audio properties. Our analyses performed on a dataset of over
fifty thousand songs find a consistently negative association between all types
of song novelty and popularity. Overall we found a song's lyrics uniqueness to
have the most significant association with its popularity. However, audio
uniqueness was the strongest predictor of a song's popularity, conditional on
the song's genre. We further found the theme and repetitiveness of a song's
lyrics to mediate the relationship between the song's popularity and novelty.
Broadly, our results contradict the "optimal distinctiveness theory" (balance
between novelty and familiarity) and call for an investigation into the
multiple dimensions along which a cultural product's uniqueness could manifest.Comment: Accepted at the International AAAI Conference on Web and Social Media
(ICWSM, 2023). Special Recognition Award at 7th International Conference on
Computational Social Science (IC2S2, 2021
Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p≫n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2p). Our algorithm Subsampled Randomized Hadamard Transform - Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets
Metric Learning for Graph-based Domain Adaptation
Abstract In many domain adaption formulations, it is assumed to have large amount of unlabeled data from the domain of interest (target domain), some portion of it may be labeled, and large amount of labeled data from other domains, also known as source domain(s). Motivated by the fact that labeled data is hard to obtain in any domain, we design algorithms for the settings in which there exists large amount of unlabeled data from all domains, small portion of which may be labeled. We build on recent advances in graph-based semi-supervised learning and supervised metric learning. Given all instances, labeled and unlabeled, from all domains, we build a large similarity graph between them, where an edge exists between two instances if they are close according to some metric. Instead of using predefined metric, as commonly performed, we feed the labeled instances into metric-learning algorithms and (re)construct a data-dependent metric, which is used to construct the graph. We employ different types of edges depending on the domain-identity of the two vertices touching it, and learn the weights of each edge. Experimental results show that our approach leads to significant reduction in classification error across domains, and performs better than two state-of-the-art models on the task of sentiment classification
- …