Search CORE

13,126 research outputs found

Multiple Accounts Detection on Facebook Using Semi-Supervised Learning on Graphs

Author: Hong Yunfeng
Hsieh Cho-Jui
Lai Chun-Ming
Wang Xiaoyun
Wu S. Felix
Publication venue
Publication date: 29/01/2018
Field of study

In social networks, a single user may create multiple accounts to spread his / her opinions and to influence others, by actively comment on different news pages. It would be beneficial to both social networks and their communities, to demote such abnormal activities, and the first step is to detect those accounts. However, the detection is challenging, because these accounts may have very realistic names and reasonable activity patterns. In this paper, we investigate three different approaches, and propose using graph embedding together with semi-supervised learning, to predict whether a pair of accounts are created by the same user. We carry out extensive experimental analyses to understand how changes in the input data and algorithmic parameters / optimization affect the prediction performance. We also discover that local information have higher importance than the global ones for such prediction, and point out the threshold leading to the best results. We test the proposed approach with 6700 Facebook pages from the Middle East, and achieve the averaged accuracy at 0.996 and AUC (area under curve) at 0.952 for users with the same name; with the U.S. 2016 election dataset, we obtain the best AUC at 0.877 for users with different names

arXiv.org e-Print Archive

Leveraging Subjective Human Annotation for Clustering Historic Newspaper Articles

Author: Chan William
Dutta Haimonti
Lee Austin
Passonneau Rebecca
Pooleery Manoj
Radeva Axinia
Rego Kyle
Shankargouda Deepak
Taranto Barbara
Xie Boyi
Publication venue
Publication date: 17/08/2012
Field of study

The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the newspapers are scanned and high resolution Optical Character Recognition (OCR) software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled editorial without further grouping. Manually sorting articles into fine-grained categories is time consuming if not impossible given the size of the corpus. This paper studies techniques for automatic categorization of newspaper articles so as to enhance search and retrieval on the archive. We explore unsupervised (e.g. KMeans) and semi-supervised (e.g. constrained clustering) learning algorithms to develop article categorization schemes geared towards the needs of end-users. A pilot study was designed to understand whether there was unanimous agreement amongst patrons regarding how articles can be categorized. It was found that the task was very subjective and consequently automated algorithms that could deal with subjective labels were used. While the small scale pilot study was extremely helpful in designing machine learning algorithms, a much larger system needs to be developed to collect annotations from users of the archive. The "BODHI" system currently being developed is a step in that direction, allowing users to correct wrongly scanned OCR and providing keywords and tags for newspaper articles used frequently. On successful implementation of the beta version of this system, we hope that it can be integrated with existing software being developed for the Chronicling America project

arXiv.org e-Print Archive

Automatic Keyword Extraction for Text Summarization: A Survey

Author: Babu Korra Sathya
Bharti Santosh Kumar
Publication venue
Publication date: 11/04/2017
Field of study

In recent times, data is growing rapidly in every domain such as news, social media, banking, education, etc. Due to the excessiveness of data, there is a need of automatic summarizer which will be capable to summarize the data especially textual data in original document without losing any critical purposes. Text summarization is emerged as an important research area in recent past. In this regard, review of existing work on text summarization process is useful for carrying out further research. In this paper, recent literature on automatic keyword extraction and text summarization are presented since text summarization process is highly depend on keyword extraction. This literature includes the discussion about different methodology used for keyword extraction and text summarization. It also discusses about different databases used for text summarization in several domains along with evaluation matrices. Finally, it discusses briefly about issues and research challenges faced by researchers along with future direction.Comment: 12 pages, 4 figure

arXiv.org e-Print Archive

An Optimization Framework for Semi-Supervised and Transfer Learning using Multiple Classifiers and Clusterers

Author: Acharya Ayan
Acharyya Sreangsu
Ghosh Joydeep
Hruschka Eduardo R.
Publication venue
Publication date: 19/04/2012
Field of study

Unsupervised models can provide supplementary soft constraints to help classify new, "target" data since similar instances in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place, as in transfer learning settings. This paper describes a general optimization framework that takes as input class membership estimates from existing classifiers learnt on previously encountered "source" data, as well as a similarity matrix from a cluster ensemble operating solely on the target data to be classified, and yields a consensus labeling of the target data. This framework admits a wide range of loss functions and classification/clustering methods. It exploits properties of Bregman divergences in conjunction with Legendre duality to yield a principled and scalable approach. A variety of experiments show that the proposed framework can yield results substantially superior to those provided by popular transductive learning techniques or by naively applying classifiers learnt on the original task to the target data

arXiv.org e-Print Archive

Automated Extraction of Socio-political Events from News (AESPEN): Workshop and Shared Task Report

Author: Hürriyetoğlu Ali
Mutlu Osman
Safaya Ali
Tanev Hristo
Yörük Erdem
Zavarella Vanni
Publication venue
Publication date: 12/05/2020
Field of study

We describe our effort on automated extraction of socio-political events from news in the scope of a workshop and a shared task we organized at Language Resources and Evaluation Conference (LREC 2020). We believe the event extraction studies in computational linguistics and social and political sciences should further support each other in order to enable large scale socio-political event information collection across sources, countries, and languages. The event consists of regular research papers and a shared task, which is about event sentence coreference identification (ESCI), tracks. All submissions were reviewed by five members of the program committee. The workshop attracted research papers related to evaluation of machine learning methodologies, language resources, material conflict forecasting, and a shared task participation report in the scope of socio-political event information collection. It has shown us the volume and variety of both the data sources and event information collection approaches related to socio-political events and the need to fill the gap between automated text processing techniques and requirements of social and political sciences

arXiv.org e-Print Archive

Semi-supervised Bootstrapping approach for Named Entity Recognition

Author: Balaji J.
Geetha T. V.
Thenmalar S.
Publication venue
Publication date: 20/11/2015
Field of study

The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.Comment: 13 pages, 2 figures, 5 table

arXiv.org e-Print Archive

Edge-labeling Graph Neural Network for Few-shot Learning

Author: Kim Jongmin
Kim Sungwoong
Kim Taesup
Yoo Chang D.
Publication venue
Publication date: 04/05/2019
Field of study

In this paper, we propose a novel edge-labeling graph neural network (EGNN), which adapts a deep neural network on the edge-labeling graph, for few-shot learning. The previous graph neural network (GNN) approaches in few-shot learning have been based on the node-labeling framework, which implicitly models the intra-cluster similarity and the inter-cluster dissimilarity. In contrast, the proposed EGNN learns to predict the edge-labels rather than the node-labels on the graph that enables the evolution of an explicit clustering by iteratively updating the edge-labels with direct exploitation of both intra-cluster similarity and the inter-cluster dissimilarity. It is also well suited for performing on various numbers of classes without retraining, and can be easily extended to perform a transductive inference. The parameters of the EGNN are learned by episodic training with an edge-labeling loss to obtain a well-generalizable model for unseen low-data problem. On both of the supervised and semi-supervised few-shot image classification tasks with two benchmark datasets, the proposed EGNN significantly improves the performances over the existing GNNs.Comment: accepted to CVPR 201

arXiv.org e-Print Archive

A New Vision of Collaborative Active Learning

Author: Calma Adrian
Embrechts Mark
Lukowicz Paul
Reitmaier Tobias
Sick Bernhard
Publication venue
Publication date: 22/12/2015
Field of study

Active learning (AL) is a learning paradigm where an active learner has to train a model (e.g., a classifier) which is in principal trained in a supervised way, but in AL it has to be done by means of a data set with initially unlabeled samples. To get labels for these samples, the active learner has to ask an oracle (e.g., a human expert) for labels. The goal is to maximize the performance of the model and to minimize the number of queries at the same time. In this article, we first briefly discuss the state of the art and own, preliminary work in the field of AL. Then, we propose the concept of collaborative active learning (CAL). With CAL, we will overcome some of the harsh limitations of current AL. In particular, we envision scenarios where an expert may be wrong for various reasons, there might be several or even many experts with different expertise, the experts may label not only samples but also knowledge at a higher level such as rules, and we consider that the labeling costs depend on many conditions. Moreover, in a CAL process human experts will profit by improving their own knowledge, too.Comment: 16 pages, 6 Figure

arXiv.org e-Print Archive

A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

Author: Allahyari Mehdi
Assefi Mehdi
Gutierrez Juan B.
Kochut Krys
Pouriyeh Seyedamin
Safaei Saied
Trippe Elizabeth D.
Publication venue
Publication date: 28/07/2017
Field of study

The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.Comment: some of References format have update

arXiv.org e-Print Archive

Semi-supervised Deep Representation Learning for Multi-View Problems

Author: Bahaadini Sara
Noroozi Vahid
Shao Weixiang
Xie Sihong
Yu Philip S.
Zheng Lei
Publication venue
Publication date: 11/11/2018
Field of study

While neural networks for learning representation of multi-view data have been previously proposed as one of the state-of-the-art multi-view dimension reduction techniques, how to make the representation discriminative with only a small amount of labeled data is not well-studied. We introduce a semi-supervised neural network model, named Multi-view Discriminative Neural Network (MDNN), for multi-view problems. MDNN finds nonlinear view-specific mappings by projecting samples to a common feature space using multiple coupled deep networks. It is capable of leveraging both labeled and unlabeled data to project multi-view data so that samples from different classes are separated and those from the same class are clustered together. It also uses the inter-view correlation between views to exploit the available information in both the labeled and unlabeled data. Extensive experiments conducted on four datasets demonstrate the effectiveness of the proposed algorithm for multi-view semi-supervised learning.Comment: Accepted to IEEE Big Data 2018. 9 Page

arXiv.org e-Print Archive