34,280 research outputs found
On Tree-Based Neural Sentence Modeling
Neural networks with tree-based sentence encoders have shown better results
on many downstream tasks. Most of existing tree-based encoders adopt syntactic
parsing trees as the explicit structure prior. To study the effectiveness of
different tree structures, we replace the parsing trees with trivial trees
(i.e., binary balanced tree, left-branching tree and right-branching tree) in
the encoders. Though trivial trees contain no syntactic information, those
encoders get competitive or even better results on all of the ten downstream
tasks we investigated. This surprising result indicates that explicit syntax
guidance may not be the main contributor to the superior performances of
tree-based neural sentence modeling. Further analysis show that tree modeling
gives better results when crucial words are closer to the final representation.
Additional experiments give more clues on how to design an effective tree-based
encoder. Our code is open-source and available at
https://github.com/ExplorerFreda/TreeEnc.Comment: To Appear at EMNLP 201
Broadword Implementation of Parenthesis Queries
We continue the line of research started in "Broadword Implementation of
Rank/Select Queries" proposing broadword (a.k.a. SWAR, "SIMD Within A
Register") algorithms for finding matching closed parentheses and the k-th far
closed parenthesis. Our algorithms work in time O(log w) on a word of w bits,
and contain no branch and no test instruction. On 64-bit (and wider)
architectures, these algorithms make it possible to avoid costly tabulations,
while providing a very significant speedup with respect to for-loop
implementations
Decision Forest: A Nonparametric Approach to Modeling Irrational Choice
Customer behavior is often assumed to follow weak rationality, which implies
that adding a product to an assortment will not increase the choice probability
of another product in that assortment. However, an increasing amount of
research has revealed that customers are not necessarily rational when making
decisions. In this paper, we propose a new nonparametric choice model that
relaxes this assumption and can model a wider range of customer behavior, such
as decoy effects between products. In this model, each customer type is
associated with a binary decision tree, which represents a decision process for
making a purchase based on checking for the existence of specific products in
the assortment. Together with a probability distribution over customer types,
we show that the resulting model -- a decision forest -- is able to represent
any customer choice model, including models that are inconsistent with weak
rationality. We theoretically characterize the depth of the forest needed to
fit a data set of historical assortments and prove that with high probability,
a forest whose depth scales logarithmically in the number of assortments is
sufficient to fit most data sets. We also propose two practical algorithms --
one based on column generation and one based on random sampling -- for
estimating such models from data. Using synthetic data and real transaction
data exhibiting non-rational behavior, we show that the model outperforms both
rational and non-rational benchmark models in out-of-sample predictive ability.Comment: The paper is forthcoming in Management Science (accepted on July 25,
2021
Simple and Efficient Fully-Functional Succinct Trees
The fully-functional succinct tree representation of Navarro and Sadakane
(ACM Transactions on Algorithms, 2014) supports a large number of operations in
constant time using bits. However, the full idea is hard to
implement. Only a simplified version with operation time has been
implemented and shown to be practical and competitive. We describe a new
variant of the original idea that is much simpler to implement and has
worst-case time for the operations. An implementation based on
this version is experimentally shown to be superior to existing
implementations
Multiclass Learning Approaches: A Theoretical Comparison with Implications
We theoretically analyze and compare the following five popular multiclass
classification methods: One vs. All, All Pairs, Tree-based classifiers, Error
Correcting Output Codes (ECOC) with randomly generated code matrices, and
Multiclass SVM. In the first four methods, the classification is based on a
reduction to binary classification. We consider the case where the binary
classifier comes from a class of VC dimension , and in particular from the
class of halfspaces over . We analyze both the estimation error and
the approximation error of these methods. Our analysis reveals interesting
conclusions of practical relevance, regarding the success of the different
approaches under various conditions. Our proof technique employs tools from VC
theory to analyze the \emph{approximation error} of hypothesis classes. This is
in sharp contrast to most, if not all, previous uses of VC theory, which only
deal with estimation error
Prediction of protein-protein interactions using one-class classification methods and integrating diverse data
This research addresses the problem of prediction of protein-protein interactions (PPI)
when integrating diverse kinds of biological information. This task has been commonly
viewed as a binary classification problem (whether any two proteins do or do not interact)
and several different machine learning techniques have been employed to solve this
task. However the nature of the data creates two major problems which can affect results.
These are firstly imbalanced class problems due to the number of positive examples (pairs
of proteins which really interact) being much smaller than the number of negative ones.
Secondly the selection of negative examples can be based on some unreliable assumptions
which could introduce some bias in the classification results.
Here we propose the use of one-class classification (OCC) methods to deal with the task of
prediction of PPI. OCC methods utilise examples of just one class to generate a predictive
model which consequently is independent of the kind of negative examples selected; additionally
these approaches are known to cope with imbalanced class problems. We have
designed and carried out a performance evaluation study of several OCC methods for this
task, and have found that the Parzen density estimation approach outperforms the rest. We
also undertook a comparative performance evaluation between the Parzen OCC method
and several conventional learning techniques, considering different scenarios, for example
varying the number of negative examples used for training purposes. We found that the
Parzen OCC method in general performs competitively with traditional approaches and in
many situations outperforms them. Finally we evaluated the ability of the Parzen OCC
approach to predict new potential PPI targets, and validated these results by searching for
biological evidence in the literature
Hellinger Distance Trees for Imbalanced Streams
Classifiers trained on data sets possessing an imbalanced class distribution
are known to exhibit poor generalisation performance. This is known as the
imbalanced learning problem. The problem becomes particularly acute when we
consider incremental classifiers operating on imbalanced data streams,
especially when the learning objective is rare class identification. As
accuracy may provide a misleading impression of performance on imbalanced data,
existing stream classifiers based on accuracy can suffer poor minority class
performance on imbalanced streams, with the result being low minority class
recall rates. In this paper we address this deficiency by proposing the use of
the Hellinger distance measure, as a very fast decision tree split criterion.
We demonstrate that by using Hellinger a statistically significant improvement
in recall rates on imbalanced data streams can be achieved, with an acceptable
increase in the false positive rate.Comment: 6 Pages, 2 figures, to be published in Proceedings 22nd International
Conference on Pattern Recognition (ICPR) 201
Bagging and boosting classification trees to predict churn.
Bagging; Boosting; Classification; Churn;
- …