26,322 research outputs found
From Task Classification Towards Similarity Measures for Recommendation in Crowdsourcing Systems
Task selection in micro-task markets can be supported by recommender systems
to help individuals to find appropriate tasks. Previous work showed that for
the selection process of a micro-task the semantic aspects, such as the
required action and the comprehensibility, are rated more important than
factual aspects, such as the payment or the required completion time. This work
gives a foundation to create such similarity measures. Therefore, we show that
an automatic classification based on task descriptions is possible.
Additionally, we propose similarity measures to cluster micro-tasks according
to semantic aspects.Comment: Work in Progress Paper at HCOMP 201
Variation of word frequencies across genre classification tasks
This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in
realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments
Feature Type Analysis in Automated Genre Classification
In this paper, we compare classifiers based on language model, image, and stylistic features for automated genre classification. The majority of previous studies in genre classification have created models based on an amalgamated representation of a document using a multitude of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. By independently modeling and comparing classifiers based on features belonging to three types, describing visual, stylistic, and topical properties, we demonstrate that different genres have distinctive feature strengths.
ForestHash: Semantic Hashing With Shallow Random Forests and Tiny Convolutional Networks
Hash codes are efficient data representations for coping with the ever
growing amounts of data. In this paper, we introduce a random forest semantic
hashing scheme that embeds tiny convolutional neural networks (CNN) into
shallow random forests, with near-optimal information-theoretic code
aggregation among trees. We start with a simple hashing scheme, where random
trees in a forest act as hashing functions by setting `1' for the visited tree
leaf, and `0' for the rest. We show that traditional random forests fail to
generate hashes that preserve the underlying similarity between the trees,
rendering the random forests approach to hashing challenging. To address this,
we propose to first randomly group arriving classes at each tree split node
into two groups, obtaining a significantly simplified two-class classification
problem, which can be handled using a light-weight CNN weak learner. Such
random class grouping scheme enables code uniqueness by enforcing each class to
share its code with different classes in different trees. A non-conventional
low-rank loss is further adopted for the CNN weak learners to encourage code
consistency by minimizing intra-class variations and maximizing inter-class
distance for the two random class groups. Finally, we introduce an
information-theoretic approach for aggregating codes of individual trees into a
single hash code, producing a near-optimal unique hash for each class. The
proposed approach significantly outperforms state-of-the-art hashing methods
for image retrieval tasks on large-scale public datasets, while performing at
the level of other state-of-the-art image classification techniques while
utilizing a more compact and efficient scalable representation. This work
proposes a principled and robust procedure to train and deploy in parallel an
ensemble of light-weight CNNs, instead of simply going deeper.Comment: Accepted to ECCV 201
Predicting the Quality of Short Narratives from Social Media
An important and difficult challenge in building computational models for
narratives is the automatic evaluation of narrative quality. Quality evaluation
connects narrative understanding and generation as generation systems need to
evaluate their own products. To circumvent difficulties in acquiring
annotations, we employ upvotes in social media as an approximate measure for
story quality. We collected 54,484 answers from a crowd-powered
question-and-answer website, Quora, and then used active learning to build a
classifier that labeled 28,320 answers as stories. To predict the number of
upvotes without the use of social network features, we create neural networks
that model textual regions and the interdependence among regions, which serve
as strong benchmarks for future research. To our best knowledge, this is the
first large-scale study for automatic evaluation of narrative quality.Comment: 7 pages, 2 figures. Accepted at the 2017 IJCAI conferenc
Non-Standard Words as Features for Text Categorization
This paper presents categorization of Croatian texts using Non-Standard Words
(NSW) as features. Non-Standard Words are: numbers, dates, acronyms,
abbreviations, currency, etc. NSWs in Croatian language are determined
according to Croatian NSW taxonomy. For the purpose of this research, 390 text
documents were collected and formed the SKIPEZ collection with 6 classes:
official, literary, informative, popular, educational and scientific. Text
categorization experiment was conducted on three different representations of
the SKIPEZ collection: in the first representation, the frequencies of NSWs are
used as features; in the second representation, the statistic measures of NSWs
(variance, coefficient of variation, standard deviation, etc.) are used as
features; while the third representation combines the first two feature sets.
Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms
were used in text categorization experiments. The best categorization results
are achieved using the first feature set (NSW frequencies) with the
categorization accuracy of 87%. This suggests that the NSWs should be
considered as features in highly inflectional languages, such as Croatian. NSW
based features reduce the dimensionality of the feature space without standard
lemmatization procedures, and therefore the bag-of-NSWs should be considered
for further Croatian texts categorization experiments.Comment: IEEE 37th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1415-1419,
201
- …