13,126 research outputs found
Multiple Accounts Detection on Facebook Using Semi-Supervised Learning on Graphs
In social networks, a single user may create multiple accounts to spread his
/ her opinions and to influence others, by actively comment on different news
pages. It would be beneficial to both social networks and their communities, to
demote such abnormal activities, and the first step is to detect those
accounts. However, the detection is challenging, because these accounts may
have very realistic names and reasonable activity patterns. In this paper, we
investigate three different approaches, and propose using graph embedding
together with semi-supervised learning, to predict whether a pair of accounts
are created by the same user. We carry out extensive experimental analyses to
understand how changes in the input data and algorithmic parameters /
optimization affect the prediction performance. We also discover that local
information have higher importance than the global ones for such prediction,
and point out the threshold leading to the best results. We test the proposed
approach with 6700 Facebook pages from the Middle East, and achieve the
averaged accuracy at 0.996 and AUC (area under curve) at 0.952 for users with
the same name; with the U.S. 2016 election dataset, we obtain the best AUC at
0.877 for users with different names
Leveraging Subjective Human Annotation for Clustering Historic Newspaper Articles
The New York Public Library is participating in the Chronicling America
initiative to develop an online searchable database of historically significant
newspaper articles. Microfilm copies of the newspapers are scanned and high
resolution Optical Character Recognition (OCR) software is run on them. The
text from the OCR provides a wealth of data and opinion for researchers and
historians. However, categorization of articles provided by the OCR engine is
rudimentary and a large number of the articles are labeled editorial without
further grouping. Manually sorting articles into fine-grained categories is
time consuming if not impossible given the size of the corpus. This paper
studies techniques for automatic categorization of newspaper articles so as to
enhance search and retrieval on the archive. We explore unsupervised (e.g.
KMeans) and semi-supervised (e.g. constrained clustering) learning algorithms
to develop article categorization schemes geared towards the needs of
end-users. A pilot study was designed to understand whether there was unanimous
agreement amongst patrons regarding how articles can be categorized. It was
found that the task was very subjective and consequently automated algorithms
that could deal with subjective labels were used. While the small scale pilot
study was extremely helpful in designing machine learning algorithms, a much
larger system needs to be developed to collect annotations from users of the
archive. The "BODHI" system currently being developed is a step in that
direction, allowing users to correct wrongly scanned OCR and providing keywords
and tags for newspaper articles used frequently. On successful implementation
of the beta version of this system, we hope that it can be integrated with
existing software being developed for the Chronicling America project
Automatic Keyword Extraction for Text Summarization: A Survey
In recent times, data is growing rapidly in every domain such as news, social
media, banking, education, etc. Due to the excessiveness of data, there is a
need of automatic summarizer which will be capable to summarize the data
especially textual data in original document without losing any critical
purposes. Text summarization is emerged as an important research area in recent
past. In this regard, review of existing work on text summarization process is
useful for carrying out further research. In this paper, recent literature on
automatic keyword extraction and text summarization are presented since text
summarization process is highly depend on keyword extraction. This literature
includes the discussion about different methodology used for keyword extraction
and text summarization. It also discusses about different databases used for
text summarization in several domains along with evaluation matrices. Finally,
it discusses briefly about issues and research challenges faced by researchers
along with future direction.Comment: 12 pages, 4 figure
An Optimization Framework for Semi-Supervised and Transfer Learning using Multiple Classifiers and Clusterers
Unsupervised models can provide supplementary soft constraints to help
classify new, "target" data since similar instances in the target set are more
likely to share the same class label. Such models can also help detect possible
differences between training and target distributions, which is useful in
applications where concept drift may take place, as in transfer learning
settings. This paper describes a general optimization framework that takes as
input class membership estimates from existing classifiers learnt on previously
encountered "source" data, as well as a similarity matrix from a cluster
ensemble operating solely on the target data to be classified, and yields a
consensus labeling of the target data. This framework admits a wide range of
loss functions and classification/clustering methods. It exploits properties of
Bregman divergences in conjunction with Legendre duality to yield a principled
and scalable approach. A variety of experiments show that the proposed
framework can yield results substantially superior to those provided by popular
transductive learning techniques or by naively applying classifiers learnt on
the original task to the target data
Automated Extraction of Socio-political Events from News (AESPEN): Workshop and Shared Task Report
We describe our effort on automated extraction of socio-political events from
news in the scope of a workshop and a shared task we organized at Language
Resources and Evaluation Conference (LREC 2020). We believe the event
extraction studies in computational linguistics and social and political
sciences should further support each other in order to enable large scale
socio-political event information collection across sources, countries, and
languages. The event consists of regular research papers and a shared task,
which is about event sentence coreference identification (ESCI), tracks. All
submissions were reviewed by five members of the program committee. The
workshop attracted research papers related to evaluation of machine learning
methodologies, language resources, material conflict forecasting, and a shared
task participation report in the scope of socio-political event information
collection. It has shown us the volume and variety of both the data sources and
event information collection approaches related to socio-political events and
the need to fill the gap between automated text processing techniques and
requirements of social and political sciences
Semi-supervised Bootstrapping approach for Named Entity Recognition
The aim of Named Entity Recognition (NER) is to identify references of named
entities in unstructured documents, and to classify them into pre-defined
semantic categories. NER often aids from added background knowledge in the form
of gazetteers. However using such a collection does not deal with name variants
and cannot resolve ambiguities associated in identifying the entities in
context and associating them with predefined categories. We present a
semi-supervised NER approach that starts with identifying named entities with a
small set of training data. Using the identified named entities, the word and
the context features are used to define the pattern. This pattern of each named
entity category is used as a seed pattern to identify the named entities in the
test set. Pattern scoring and tuple value score enables the generation of the
new patterns to identify the named entity categories. We have evaluated the
proposed system for English language with the dataset of tagged (IEER) and
untagged (CoNLL 2003) named entity corpus and for Tamil language with the
documents from the FIRE corpus and yield an average f-measure of 75% for both
the languages.Comment: 13 pages, 2 figures, 5 table
Edge-labeling Graph Neural Network for Few-shot Learning
In this paper, we propose a novel edge-labeling graph neural network (EGNN),
which adapts a deep neural network on the edge-labeling graph, for few-shot
learning. The previous graph neural network (GNN) approaches in few-shot
learning have been based on the node-labeling framework, which implicitly
models the intra-cluster similarity and the inter-cluster dissimilarity. In
contrast, the proposed EGNN learns to predict the edge-labels rather than the
node-labels on the graph that enables the evolution of an explicit clustering
by iteratively updating the edge-labels with direct exploitation of both
intra-cluster similarity and the inter-cluster dissimilarity. It is also well
suited for performing on various numbers of classes without retraining, and can
be easily extended to perform a transductive inference. The parameters of the
EGNN are learned by episodic training with an edge-labeling loss to obtain a
well-generalizable model for unseen low-data problem. On both of the supervised
and semi-supervised few-shot image classification tasks with two benchmark
datasets, the proposed EGNN significantly improves the performances over the
existing GNNs.Comment: accepted to CVPR 201
A New Vision of Collaborative Active Learning
Active learning (AL) is a learning paradigm where an active learner has to
train a model (e.g., a classifier) which is in principal trained in a
supervised way, but in AL it has to be done by means of a data set with
initially unlabeled samples. To get labels for these samples, the active
learner has to ask an oracle (e.g., a human expert) for labels. The goal is to
maximize the performance of the model and to minimize the number of queries at
the same time. In this article, we first briefly discuss the state of the art
and own, preliminary work in the field of AL. Then, we propose the concept of
collaborative active learning (CAL). With CAL, we will overcome some of the
harsh limitations of current AL. In particular, we envision scenarios where an
expert may be wrong for various reasons, there might be several or even many
experts with different expertise, the experts may label not only samples but
also knowledge at a higher level such as rules, and we consider that the
labeling costs depend on many conditions. Moreover, in a CAL process human
experts will profit by improving their own knowledge, too.Comment: 16 pages, 6 Figure
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
The amount of text that is generated every day is increasing dramatically.
This tremendous volume of mostly unstructured text cannot be simply processed
and perceived by computers. Therefore, efficient and effective techniques and
algorithms are required to discover useful patterns. Text mining is the task of
extracting meaningful information from text, which has gained significant
attentions in recent years. In this paper, we describe several of the most
fundamental text mining tasks and techniques including text pre-processing,
classification and clustering. Additionally, we briefly explain text mining in
biomedical and health care domains.Comment: some of References format have update
Semi-supervised Deep Representation Learning for Multi-View Problems
While neural networks for learning representation of multi-view data have
been previously proposed as one of the state-of-the-art multi-view dimension
reduction techniques, how to make the representation discriminative with only a
small amount of labeled data is not well-studied. We introduce a
semi-supervised neural network model, named Multi-view Discriminative Neural
Network (MDNN), for multi-view problems. MDNN finds nonlinear view-specific
mappings by projecting samples to a common feature space using multiple coupled
deep networks. It is capable of leveraging both labeled and unlabeled data to
project multi-view data so that samples from different classes are separated
and those from the same class are clustered together. It also uses the
inter-view correlation between views to exploit the available information in
both the labeled and unlabeled data. Extensive experiments conducted on four
datasets demonstrate the effectiveness of the proposed algorithm for multi-view
semi-supervised learning.Comment: Accepted to IEEE Big Data 2018. 9 Page
- …