819 research outputs found
Transductive Learning with String Kernels for Cross-Domain Text Classification
For many text classification tasks, there is a major problem posed by the
lack of labeled data in a target domain. Although classifiers for a target
domain can be trained on labeled text data from a related source domain, the
accuracy of such classifiers is usually lower in the cross-domain setting.
Recently, string kernels have obtained state-of-the-art results in various text
classification tasks such as native language identification or automatic essay
scoring. Moreover, classifiers based on string kernels have been found to be
robust to the distribution gap between different domains. In this paper, we
formally describe an algorithm composed of two simple yet effective
transductive learning approaches to further improve the results of string
kernels in cross-domain settings. By adapting string kernels to the test set
without using the ground-truth test labels, we report significantly better
accuracy rates in cross-domain English polarity classification.Comment: Accepted at ICONIP 2018. arXiv admin note: substantial text overlap
with arXiv:1808.0840
Name Disambiguation from link data in a collaboration graph using temporal and topological features
In a social community, multiple persons may share the same name, phone number
or some other identifying attributes. This, along with other phenomena, such as
name abbreviation, name misspelling, and human error leads to erroneous
aggregation of records of multiple persons under a single reference. Such
mistakes affect the performance of document retrieval, web search, database
integration, and more importantly, improper attribution of credit (or blame).
The task of entity disambiguation partitions the records belonging to multiple
persons with the objective that each decomposed partition is composed of
records of a unique person. Existing solutions to this task use either
biographical attributes, or auxiliary features that are collected from external
sources, such as Wikipedia. However, for many scenarios, such auxiliary
features are not available, or they are costly to obtain. Besides, the attempt
of collecting biographical or external data sustains the risk of privacy
violation. In this work, we propose a method for solving entity disambiguation
task from link information obtained from a collaboration network. Our method is
non-intrusive of privacy as it uses only the time-stamped graph topology of an
anonymized network. Experimental results on two real-life academic
collaboration networks show that the proposed method has satisfactory
performance.Comment: The short version of this paper has been accepted to ASONAM 201
Text Mining Infrastructure in R
During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.
- …