59,393 research outputs found
Feature Ranking for Text Classifiers
Feature selection based on feature ranking has received much
attention by researchers in the field of text classification. The
major reasons are their scalability, ease of use, and fast computation. %,
However, compared to the search-based feature selection methods such
as wrappers and filters, they suffer from poor performance. This is
linked to their major deficiencies, including: (i) feature ranking
is problem-dependent; (ii) they ignore term dependencies, including
redundancies and correlation; and (iii) they usually fail in
unbalanced data.
While using feature ranking methods for dimensionality reduction, we
should be aware of these drawbacks, which arise from the function of
feature ranking methods. In this thesis, a set of solutions is
proposed to handle the drawbacks of feature ranking and boost their
performance. First, an evaluation framework called feature
meta-ranking is proposed to evaluate ranking measures. The framework
is based on a newly proposed Differential Filter Level Performance
(DFLP) measure. It was proved that, in ideal cases, the performance
of text classifier is a monotonic, non-decreasing function of the
number of features. Then we theoretically and empirically validate
the effectiveness of DFLP as a meta-ranking measure to evaluate and
compare feature ranking methods. The meta-ranking framework is also
examined by a stopword extraction problem. We use the framework to
select appropriate feature ranking measure for building
domain-specific stoplists. The proposed framework is evaluated by
SVM and Rocchio text classifiers on six benchmark data. The
meta-ranking method suggests that in searching for a proper feature
ranking measure, the backward feature ranking is as important as the
forward one.
Second, we show that the destructive effect of term redundancy gets
worse as we decrease the feature ranking threshold. It implies that
for aggressive feature selection, an effective redundancy reduction
should be performed as well as feature ranking. An algorithm based
on extracting term dependency links using an information theoretic
inclusion index is proposed to detect and handle term dependencies.
The dependency links are visualized by a tree structure called a
term dependency tree. By grouping the nodes of the tree into two
categories, including hub and link nodes, a heuristic algorithm is
proposed to handle the term dependencies by merging or removing the
link nodes. The proposed method of redundancy reduction is evaluated
by SVM and Rocchio classifiers for four benchmark data sets.
According to the results, redundancy reduction is more effective on
weak classifiers since they are more sensitive to term redundancies.
It also suggests that in those feature ranking methods which compact
the information in a small number of features, aggressive feature
selection is not recommended.
Finally, to deal with class imbalance in feature level using ranking
methods, a local feature ranking scheme called reverse
discrimination approach is proposed. The proposed method is applied
to a highly unbalanced social network discovery problem. In this
case study, the problem of learning a social network is translated
into a text classification problem using newly proposed actor and
relationship modeling. Since social networks are usually sparse
structures, the corresponding text classifiers become highly
unbalanced. Experimental assessment of the reverse discrimination
approach validates the effectiveness of the local feature ranking
method to improve the classifier performance when dealing with
unbalanced data. The application itself suggests a new approach to
learn social structures from textual data
Predicting links in ego-networks using temporal information
Link prediction appears as a central problem of network science, as it calls
for unfolding the mechanisms that govern the micro-dynamics of the network. In
this work, we are interested in ego-networks, that is the mere information of
interactions of a node to its neighbors, in the context of social
relationships. As the structural information is very poor, we rely on another
source of information to predict links among egos' neighbors: the timing of
interactions. We define several features to capture different kinds of temporal
information and apply machine learning methods to combine these various
features and improve the quality of the prediction. We demonstrate the
efficiency of this temporal approach on a cellphone interaction dataset,
pointing out features which prove themselves to perform well in this context,
in particular the temporal profile of interactions and elapsed time between
contacts.Comment: submitted to EPJ Data Scienc
RankMerging: A supervised learning-to-rank framework to predict links in large social network
Uncovering unknown or missing links in social networks is a difficult task
because of their sparsity and because links may represent different types of
relationships, characterized by different structural patterns. In this paper,
we define a simple yet efficient supervised learning-to-rank framework, called
RankMerging, which aims at combining information provided by various
unsupervised rankings. We illustrate our method on three different kinds of
social networks and show that it substantially improves the performances of
unsupervised metrics of ranking. We also compare it to other combination
strategies based on standard methods. Finally, we explore various aspects of
RankMerging, such as feature selection and parameter estimation and discuss its
area of relevance: the prediction of an adjustable number of links on large
networks.Comment: 43 pages, published in Machine Learning Journa
Neural‑Brane: Neural Bayesian Personalized Ranking for Attributed Network Embedding
Network embedding methodologies, which learn a distributed vector representation for each vertex in a network, have attracted considerable interest in recent years. Existing works have demonstrated that vertex representation learned through an embedding method provides superior performance in many real-world applications, such as node classification, link prediction, and community detection. However, most of the existing methods for network embedding only utilize topological information of a vertex, ignoring a rich set of nodal attributes (such as user profiles of an online social network, or textual contents of a citation network), which is abundant in all real-life networks. A joint network embedding that takes into account both attributional and relational information entails a complete network information and could further enrich the learned vector representations. In this work, we present Neural-Brane, a novel Neural Bayesian Personalized Ranking based Attributed Network Embedding. For a given network, Neural-Brane extracts latent feature representation of its vertices using a designed neural network model that unifies network topological information and nodal attributes. Besides, it utilizes Bayesian personalized ranking objective, which exploits the proximity ordering between a similar node pair and a dissimilar node pair. We evaluate the quality of vertex embedding produced by Neural-Brane by solving the node classification and clustering tasks on four real-world datasets. Experimental results demonstrate the superiority of our proposed method over the state-of-the-art existing methods
A framework for interrogating social media images to reveal an emergent archive of war
The visual image has long been central to how war is seen, contested and legitimised, remembered and forgotten. Archives are pivotal to these ends as is their ownership and access, from state and other official repositories through to the countless photographs scattered and hidden from a collective understanding of what war looks like in individual collections and dusty attics. With the advent and rapid development of social media, however, the amateur and the professional, the illicit and the sanctioned, the personal and the official, and the past and the present, all seem to inhabit the same connected and chaotic space.However, to even begin to render intelligible the complexity, scale and volume of what war looks like in social media archives is a considerable task, given the limitations of any traditional human-based method of collection and analysis. We thus propose the production of a series of ‘snapshots’, using computer-aided extraction and identification techniques to try to offer an experimental way in to conceiving a new imaginary of war. We were particularly interested in testing to see if twentieth century wars, obviously initially captured via pre-digital means, had become more ‘settled’ over time in terms of their remediated presence today through their visual representations and connections on social media, compared with wars fought in digital media ecologies (i.e. those fought and initially represented amidst the volume and pervasiveness of social media images).To this end, we developed a framework for automatically extracting and analysing war images that appear in social media, using both the features of the images themselves, and the text and metadata associated with each image. The framework utilises a workflow comprising four core stages: (1) information retrieval, (2) data pre-processing, (3) feature extraction, and (4) machine learning. Our corpus was drawn from the social media platforms Facebook and Flickr
- …