67,189 research outputs found
Corpus specificity in LSA and Word2vec: the role of out-of-domain documents
Latent Semantic Analysis (LSA) and Word2vec are some of the most widely used
word embeddings. Despite the popularity of these techniques, the precise
mechanisms by which they acquire new semantic relations between words remain
unclear. In the present article we investigate whether LSA and Word2vec
capacity to identify relevant semantic dimensions increases with size of
corpus. One intuitive hypothesis is that the capacity to identify relevant
dimensions should increase as the amount of data increases. However, if corpus
size grow in topics which are not specific to the domain of interest, signal to
noise ratio may weaken. Here we set to examine and distinguish these
alternative hypothesis. To investigate the effect of corpus specificity and
size in word-embeddings we study two ways for progressive elimination of
documents: the elimination of random documents vs. the elimination of documents
unrelated to a specific task. We show that Word2vec can take advantage of all
the documents, obtaining its best performance when it is trained with the whole
corpus. On the contrary, the specialization (removal of out-of-domain
documents) of the training corpus, accompanied by a decrease of dimensionality,
can increase LSA word-representation quality while speeding up the processing
time. Furthermore, we show that the specialization without the decrease in LSA
dimensionality can produce a strong performance reduction in specific tasks.
From a cognitive-modeling point of view, we point out that LSA's word-knowledge
acquisitions may not be efficiently exploiting higher-order co-occurrences and
global relations, whereas Word2vec does
Non-Standard Words as Features for Text Categorization
This paper presents categorization of Croatian texts using Non-Standard Words
(NSW) as features. Non-Standard Words are: numbers, dates, acronyms,
abbreviations, currency, etc. NSWs in Croatian language are determined
according to Croatian NSW taxonomy. For the purpose of this research, 390 text
documents were collected and formed the SKIPEZ collection with 6 classes:
official, literary, informative, popular, educational and scientific. Text
categorization experiment was conducted on three different representations of
the SKIPEZ collection: in the first representation, the frequencies of NSWs are
used as features; in the second representation, the statistic measures of NSWs
(variance, coefficient of variation, standard deviation, etc.) are used as
features; while the third representation combines the first two feature sets.
Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms
were used in text categorization experiments. The best categorization results
are achieved using the first feature set (NSW frequencies) with the
categorization accuracy of 87%. This suggests that the NSWs should be
considered as features in highly inflectional languages, such as Croatian. NSW
based features reduce the dimensionality of the feature space without standard
lemmatization procedures, and therefore the bag-of-NSWs should be considered
for further Croatian texts categorization experiments.Comment: IEEE 37th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1415-1419,
201
Individual differences in the perception of similarity and difference.
Thematically related concepts like coffee and milk are judged to be more similar than thematically unrelated concepts like coffee and lemonade. We investigated whether thematic relations exert a small effect that occurs consistently across participants (i.e., a generalized model), or a large effect that occurs inconsistently across participants (i.e., an individualized model). We also examined whether difference judgments mirrored similarity or whether these judgments were, in fact, non-inverse. Five studies demonstrated the necessity of an individualized model for both perceived similarity and difference, and additionally provided evidence that thematic relations affect similarity more than difference. Results suggest that models of similarity and difference must be attuned to large and consistent individual variability in the weighting of thematic relations
What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries
We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform
- …