5,791 research outputs found

    From Review to Rating: Exploring Dependency Measures for Text Classification

    Full text link
    Various text analysis techniques exist, which attempt to uncover unstructured information from text. In this work, we explore using statistical dependence measures for textual classification, representing text as word vectors. Student satisfaction scores on a 3-point scale and their free text comments written about university subjects are used as the dataset. We have compared two textual representations: a frequency word representation and term frequency relationship to word vectors, and found that word vectors provide a greater accuracy. However, these word vectors have a large number of features which aggravates the burden of computational complexity. Thus, we explored using a non-linear dependency measure for feature selection by maximizing the dependence between the text reviews and corresponding scores. Our quantitative and qualitative analysis on a student satisfaction dataset shows that our approach achieves comparable accuracy to the full feature vector, while being an order of magnitude faster in testing. These text analysis and feature reduction techniques can be used for other textual data applications such as sentiment analysis.Comment: 8 page

    The Expressive Power of Word Embeddings

    Full text link
    We seek to better understand the difference in quality of the several publicly released embeddings. We propose several tasks that help to distinguish the characteristics of different embeddings. Our evaluation of sentiment polarity and synonym/antonym relations shows that embeddings are able to capture surprisingly nuanced semantics even in the absence of sentence structure. Moreover, benchmarking the embeddings shows great variance in quality and characteristics of the semantics captured by the tested embeddings. Finally, we show the impact of varying the number of dimensions and the resolution of each dimension on the effective useful features captured by the embedding space. Our contributions highlight the importance of embeddings for NLP tasks and the effect of their quality on the final results.Comment: submitted to ICML 2013, Deep Learning for Audio, Speech and Language Processing Workshop. 8 pages, 8 figure

    An Efficient Dual Approach to Distance Metric Learning

    Full text link
    Distance metric learning is of fundamental interest in machine learning because the distance metric employed can significantly affect the performance of many learning methods. Quadratic Mahalanobis metric learning is a popular approach to the problem, but typically requires solving a semidefinite programming (SDP) problem, which is computationally expensive. Standard interior-point SDP solvers typically have a complexity of O(D6.5)O(D^{6.5}) (with DD the dimension of input data), and can thus only practically solve problems exhibiting less than a few thousand variables. Since the number of variables is D(D+1)/2D (D+1) / 2 , this implies a limit upon the size of problem that can practically be solved of around a few hundred dimensions. The complexity of the popular quadratic Mahalanobis metric learning approach thus limits the size of problem to which metric learning can be applied. Here we propose a significantly more efficient approach to the metric learning problem based on the Lagrange dual formulation of the problem. The proposed formulation is much simpler to implement, and therefore allows much larger Mahalanobis metric learning problems to be solved. The time complexity of the proposed method is O(D3)O (D ^ 3) , which is significantly lower than that of the SDP approach. Experiments on a variety of datasets demonstrate that the proposed method achieves an accuracy comparable to the state-of-the-art, but is applicable to significantly larger problems. We also show that the proposed method can be applied to solve more general Frobenius-norm regularized SDP problems approximately
    • …
    corecore