29,674 research outputs found
Two-layer classification and distinguished representations of users and documents for grouping and authorship identification
Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity
Recommended from our members
Search engine For Twitter sentiment analysis
textThe purpose of sentiment analysis is to determine the attitude of a writer or a speaker with respect to some topic or his feeling in a document. Thanks to the rise of social media, nowadays there are numerous data generated by users. Mining and categorizing these data will not only bring profits for companies, but also benefit the nation. Sentiment analysis not only enables business decision makers to better understand customers' behaviors, but also allows customers to know how the public feel about a product before purchasing. On the other hand, the aggregation of emotions will effectively measure the public response toward an event or news. For example, the level of distress and sadness will increase significantly after terror attacks or natural disaster. In our project, we are going to build a search engine that allows users to check the sentiment of his query. Some of previous researches on classifying sentiment of messages on micro-blogging services like Twitter have tried to solve this problem but they have ignored neutral tweets, which will result in problematic results (12). Our sentiment analysis will also be based on tweets collected from twitter, since twitter can offer sufficient and real-time corpora for analysis. We will preprocess each tweet in the training set and label it as positive, negative or neutral. As we use words in the tweet as the feature for our model, different features will be used. We will show that accuracy achieved by different machine learning algorithms (NaĆÆve Bayes, Maximum Entropy) can be improved with a feature vector obtained by using bigrams (5). In our practice, we find that Naive Bayes has better performance than Maximum Entropy.Statistic
Exploring The Value Of Folksonomies For Creating Semantic Metadata
Finding good keywords to describe resources is an on-going problem: typically we select such words manually from a thesaurus of terms, or they are created using automatic keyword extraction techniques. Folksonomies are an increasingly well populated source of unstructured tags describing web resources. This paper explores the value of the folksonomy tags as potential source of keyword metadata by examining the relationship between folksonomies, community produced annotations, and keywords extracted by machines. The experiment has been carried-out in two ways: subjectively, by asking two human indexers to evaluate the quality of the generated keywords from both systems; and automatically, by measuring the percentage of overlap between the folksonomy set and machine generated keywords set. The results of this experiment show that the folksonomy tags agree more closely with the human generated keywords than those automatically generated. The results also showed that the trained indexers preferred the semantics of folksonomy tags compared to keywords extracted automatically. These results can be considered as evidence for the strong relationship of folksonomies to the human indexerās mindset, demonstrating that folksonomies used in the del.icio.us bookmarking service are a potential source for generating semantic metadata to annotate web resources
- ā¦