35 research outputs found
Thumbs up? Sentiment Classification using Machine Learning Techniques
We consider the problem of classifying documents not by topic, but by overall
sentiment, e.g., determining whether a review is positive or negative. Using
movie reviews as data, we find that standard machine learning techniques
definitively outperform human-produced baselines. However, the three machine
learning methods we employed (Naive Bayes, maximum entropy classification, and
support vector machines) do not perform as well on sentiment classification as
on traditional topic-based categorization. We conclude by examining factors
that make the sentiment classification problem more challenging.Comment: To appear in EMNLP-200
Generalized Model Selection For Unsupervised Learning In High Dimensions
In this paper we describe an approach to model selection in unsupervised learning. This approach determines both the feature set and the number of clusters. To this end we first derive an objective function that explicitly incorporates this generalization. We then evaluate two schemes for model selection - one using this objective function (a Bayesian estimation scheme that selects the best model structure using the marginal or integrated likelihood) and the second based on a technique using a cross-validated likelihood criterion. In the first scheme, for a particular application in document clustering, we derive a closed-form solution of the integrated likelihood by assuming an appropriate form of the likelihood function and prior. Extensive experiments are carried out to ascertain the validity of both approaches and all results are verified by comparison against ground truth. In our experiments the Bayesian scheme using our objective function gave better results tha n cross-validatio..
Clustering with model-level constraints
In this paper we describe a systematic approach to uncovering multiple clusterings underlying a dataset. In contrast to previous approaches, the proposed method uses information about structures that are not desired and consequently is very useful in an exploratory datamining setting. Specifically, the problem is formulated as constrained model-based clustering where the constraints are placed at a model-level. Two variants of an EM algorithm, for this constrained model, are derived. The performance of both variants is compared against a state-of-the-art information bottleneck algorithm on both synthetic and real datasets.
OLAP over Imprecise Data With Domain Constraints
Several recent works have focused on OLAP over imprecise data, where
each fact can be a region, instead of a point, in a multi-dimensional
space. They have provided a multiple-world semantics for such data,
and developed efficient solutions to answer OLAP aggregation queries
over the imprecise facts. These solutions however assume that the
imprecise facts can be interpreted {\em independently\/} of one another, a
key assumption that is often violated in practice. Indeed, imprecise
facts in real-world applications are often correlated, and such
correlations can be captured as domain integrity constraints (e.g.,
repairs with the same customer names and models took place in the same
city, or a text span can refer to a person or a city, but not both).
In this paper we provide a solution to answer OLAP aggregation queries
over imprecise data, in the presence of such domain constraints. We first
describe a relatively simple yet powerful constraint language, and define
what it means to take into account such constraints in query answering.
Next, we prove that OLAP queries can be answered efficiently given a
database of fact marginals. We then exploit the regularities in the
constraint space (captured in a constraint hypergraph) and the fact space
to efficiently construct D*. Extensive experiments over real-world and
synthetic data demonstrate the effectiveness of our approach
OLAP Over Uncertain and Imprecise Data
We extend the OLAP data model to represent data ambiguity, specifically imprecision and uncertainty, and introduce an allocation-based approach to the semantics of aggregation queries over such data. We identify three natural query properties and use them to shed light on alternative query semantics. While there is much work on representing and querying ambiguous data, to our knowledge this is the first paper to handle both imprecision and uncertainty in an OLAP setting