5 research outputs found
A comparison of the effect of feature selection and balancing strategies upon the sentiment classification of portuguese news stories
Sentiment classification of news stories using supervised learning is a mature task in the field of Natural Language Processing. Supervised learning strategies rely upon training data to induce a classifier. Training data can be imbalanced, with typically the neutral class being the majority class. This imbalance can bias the induced classifier towards the majority class. Balancing and feature selection can mitigate the effects of imbalanced data. This paper surveys a number of common balancing and\ud
feature selections techniques, and applies them to an imbalanced data set of manually labelled Brazilian agricultural news stories. The strategies were appraised with a 90:10 holdout evaluation and compared with a baseline strategy. We found that: 1. the feature selection strategies provided no identifiable advantage over a baseline method and 2. balancing produced an advantage over baseline with random oversampling producing the best results.FAPESP (grant 11/20451-1
Causation generalization through the identification of equivalent nodes in causal sparse graphs constructed from text using node similarity strategies
Causal Bayesian Graphs can be constructed from causal information in text. These graphs can be sparse because the cause or effect event can be expressed in various ways to represent the same information. This sparseness can corrupt inferences made on the graph. This paper proposes to reduce sparseness by merging: equivalent nodes and their edges. This paper presents a number of experiments that evaluates the applicability of node similarity techniques to detect equivalent nodes. The experiments found that techniques that rely upon combination of node contents and structural information are the most accurate strategies, specifically we have employed: 1. node name similarity and 2. combination of node name similarity and common neighbours (SMCN). In addition, the SMCN returns ”better” equivalent nodes than the string matching strategy.São Paulo Research Foundation (FAPESP) (grants 2013/12191-5, 2011/22749-8 and 2011/20451-1
Lexical resources for the identification of causative relations in portuguese texts
The identification of causal relations from text is a mature problem in Natural Language Processing. There are a number of resources and tools to aid causative relation extraction in English, but there seems to be a limited number of resources for Portuguese. This paper presents a number of resources which are designed to aid the researcher and the practitioner to extract causative relations from Portuguese texts.FAPESP (grant number: 11/20451-1
Multilevel refinement based on neighborhood similarity
The multilevel graph partitioning strategy aims to reduce the computational cost of the partitioning algorithm by applying it on a coarsened version of the original graph. This strategy is very useful when large-scale networks are analyzed. To improve the multilevel solution, refinement algorithms have been used in the uncorsening phase. Typical refinement algorithms exploit network properties, for example minimum cut or modularity, but they do not exploit features from domain specific networks. For instance, in social networks partitions with high clustering coefficient or similarity between vertices indicate a better solution. In this paper, we propose a refinement algorithm (RSim) which is based on neighborhood similarity. We compare RSim with: 1. two algorithms from the literature and 2. one baseline strategy, on twelve real networks. Results indicate that RSim is competitive with methods evaluated for general domains, but for social networks it surpasses the competing refinement algorithms.CNPq (grant 151836-/2013-2)FAPESP (grants 2011/22749-8, 11/20451-1 and 2013/12191-5)CAPE