124 research outputs found
Detecting Sockpuppets in Deceptive Opinion Spam
This paper explores the problem of sockpuppet detection in deceptive opinion
spam using authorship attribution and verification approaches. Two methods are
explored. The first is a feature subsampling scheme that uses the KL-Divergence
on stylistic language models of an author to find discriminative features. The
second is a transduction scheme, spy induction that leverages the diversity of
authors in the unlabeled test set by sending a set of spies (positive samples)
from the training set to retrieve hidden samples in the unlabeled test set
using nearest and farthest neighbors. Experiments using ground truth sockpuppet
data show the effectiveness of the proposed schemes.Comment: 18 pages, Accepted at CICLing 2017, 18th International Conference on
Intelligent Text Processing and Computational Linguistic
Near-optimal Linear Decision Trees for k-SUM and Related Problems
We construct near-optimal linear decision trees for a variety of decision problems in combinatorics and discrete geometry. For example, for any constant
k
, we construct linear decision trees that solve the
k
-SUM problem on
n
elements using
O
(
n
log
2
n
) linear queries. Moreover, the queries we use are comparison queries, which compare the sums of two
k
-subsets; when viewed as linear queries, comparison queries are 2
k
-sparse and have only { −1,0,1} coefficients. We give similar constructions for sorting sumsets A+B and for solving the SUBSET-SUM problem, both with optimal number of queries, up to poly-logarithmic terms.
Our constructions are based on the notion of “inference dimension,” recently introduced by the authors in the context of active classification with comparison queries. This can be viewed as another contribution to the fruitful link between machine learning and discrete geometry, which goes back to the discovery of the VC dimension
MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
We present MCRapper, an algorithm for efficient computation of Monte-Carlo
Empirical Rademacher Averages (MCERA) for families of functions exhibiting
poset (e.g., lattice) structure, such as those that arise in many pattern
mining tasks. The MCERA allows us to compute upper bounds to the maximum
deviation of sample means from their expectations, thus it can be used to find
both statistically-significant functions (i.e., patterns) when the available
data is seen as a sample from an unknown distribution, and approximations of
collections of high-expectation functions (e.g., frequent patterns) when the
available data is a small sample from a large dataset. This feature is a strong
improvement over previously proposed solutions that could only achieve one of
the two. MCRapper uses upper bounds to the discrepancy of the functions to
efficiently explore and prune the search space, a technique borrowed from
pattern mining itself. To show the practical use of MCRapper, we employ it to
develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining.
TFP-R gives guarantees on the probability of including any false positives
(precision) and exhibits higher statistical power (recall) than existing
methods offering the same guarantees. We evaluate MCRapper and TFP-R and show
that they outperform the state-of-the-art for their respective tasks
Do Prices Coordinate Markets?
Walrasian equilibrium prices can be said to coordinate markets: They support
a welfare optimal allocation in which each buyer is buying bundle of goods that
is individually most preferred. However, this clean story has two caveats.
First, the prices alone are not sufficient to coordinate the market, and buyers
may need to select among their most preferred bundles in a coordinated way to
find a feasible allocation. Second, we don't in practice expect to encounter
exact equilibrium prices tailored to the market, but instead only approximate
prices, somehow encoding "distributional" information about the market. How
well do prices work to coordinate markets when tie-breaking is not coordinated,
and they encode only distributional information?
We answer this question. First, we provide a genericity condition such that
for buyers with Matroid Based Valuations, overdemand with respect to
equilibrium prices is at most 1, independent of the supply of goods, even when
tie-breaking is done in an uncoordinated fashion. Second, we provide
learning-theoretic results that show that such prices are robust to changing
the buyers in the market, so long as all buyers are sampled from the same
(unknown) distribution
Identifying Human Kinase-Specific Protein Phosphorylation Sites by Integrating Heterogeneous Information from Various Sources
Phosphorylation is an important type of protein post-translational modification. Identification of possible phosphorylation sites of a protein is important for understanding its functions. Unbiased screening for phosphorylation sites by in vitro or in vivo experiments is time consuming and expensive; in silico prediction can provide functional candidates and help narrow down the experimental efforts. Most of the existing prediction algorithms take only the polypeptide sequence around the phosphorylation sites into consideration. However, protein phosphorylation is a very complex biological process in vivo. The polypeptide sequences around the potential sites are not sufficient to determine the phosphorylation status of those residues. In the current work, we integrated various data sources such as protein functional domains, protein subcellular location and protein-protein interactions, along with the polypeptide sequences to predict protein phosphorylation sites. The heterogeneous information significantly boosted the prediction accuracy for some kinase families. To demonstrate potential application of our method, we scanned a set of human proteins and predicted putative phosphorylation sites for Cyclin-dependent kinases, Casein kinase 2, Glycogen synthase kinase 3, Mitogen-activated protein kinases, protein kinase A, and protein kinase C families (avaiable at http://cmbi.bjmu.edu.cn/huphospho). The predicted phosphorylation sites can serve as candidates for further experimental validation. Our strategy may also be applicable for the in silico identification of other post-translational modification substrates
- …