92 research outputs found
Footballonomics: The Anatomy of American Football; Evidence from 7 years of NFL game data
Do NFL teams make rational decisions? What factors potentially affect the
probability of wining a game in NFL? How can a team come back from a
demoralizing interception? In this study we begin by examining the hypothesis
of rational coaching, that is, coaching decisions are always rational with
respect to the maximization of the expected points scored. We reject this
hypothesis by analyzing the decisions made in the past 7 NFL seasons for two
particular plays; (i) the Point(s) After Touchdown (PAT) and (ii) the fourth
down decisions. Having rejected the rational coaching hypothesis we move on to
examine how the detailed game data collected can potentially inform game-day
decisions. While NFL teams personnel definitely have an intuition on which
factors are crucial for winning a game, in this work we take a data-driven
approach and provide quantifiable evidence using a large dataset of NFL games
for the 7-year period between 2009 and 2015. In particular, we use a logistic
regression model to identify the impact and the corresponding statistical
significance of factors such as possession time, number of penalty yards,
balance between passing and rushing offense etc. Our results clearly imply that
avoiding turnovers is the best strategy for winning a game but turnovers can be
overcome with letting the offense on the field for more time. Finally we
combine our descriptive model with statistical bootstrap in order to provide a
prediction engine for upcoming NFL games. Our evaluations indicate that even by
only considering a small number of (straightforward) factors, we can achieve a
very good prediction accuracy. In particular, the average accuracy during
seasons 2014 and 2015 is approximately 63%. This performance is comparable to
the more complicated state-of-the-art prediction systems, while it outperforms
expert analysts 60% of the time.Comment: Working study - Papers has been presented at the Machine Learning and
Data Mining for Sports Analytics 2016 workshop and accepted at PLOS ON
SamBaTen: Sampling-based Batch Incremental Tensor Decomposition
Tensor decompositions are invaluable tools in analyzing multimodal datasets.
In many real-world scenarios, such datasets are far from being static, to the
contrary they tend to grow over time. For instance, in an online social network
setting, as we observe new interactions over time, our dataset gets updated in
its "time" mode. How can we maintain a valid and accurate tensor decomposition
of such a dynamically evolving multimodal dataset, without having to re-compute
the entire decomposition after every single update? In this paper we introduce
SaMbaTen, a Sampling-based Batch Incremental Tensor Decomposition algorithm,
which incrementally maintains the decomposition given new updates to the tensor
dataset. SaMbaTen is able to scale to datasets that the state-of-the-art in
incremental tensor decomposition is unable to operate on, due to its ability to
effectively summarize the existing tensor and the incoming updates, and perform
all computations in the reduced summary space. We extensively evaluate SaMbaTen
using synthetic and real datasets. Indicatively, SaMbaTen achieves comparable
accuracy to state-of-the-art incremental and non-incremental techniques, while
being 25-30 times faster. Furthermore, SaMbaTen scales to very large sparse and
dense dynamically evolving tensors of dimensions up to 100K x 100K x 100K where
state-of-the-art incremental approaches were not able to operate
Ensemble Node Embeddings using Tensor Decomposition: A Case-Study on DeepWalk
Node embeddings have been attracting increasing attention during the past
years. In this context, we propose a new ensemble node embedding approach,
called TenSemble2Vec, by first generating multiple embeddings using the
existing techniques and taking them as multiview data input of the state-of-art
tensor decomposition model namely PARAFAC2 to learn the shared
lower-dimensional representations of the nodes. Contrary to other embedding
methods, our TenSemble2Vec takes advantage of the complementary information
from different methods or the same method with different hyper-parameters,
which bypasses the challenge of choosing models. Extensive tests using
real-world data validates the efficiency of the proposed method
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security Forums
How can we extract useful information from a security forum? We focus on
identifying threads of interest to a security professional: (a) alerts of
worrisome events, such as attacks, (b) offering of malicious services and
products, (c) hacking information to perform malicious acts, and (d) useful
security-related experiences. The analysis of security forums is in its infancy
despite several promising recent works. Novel approaches are needed to address
the challenges in this domain: (a) the difficulty in specifying the "topics" of
interest efficiently, and (b) the unstructured and informal nature of the text.
We propose, REST, a systematic methodology to: (a) identify threads of interest
based on a, possibly incomplete, bag of words, and (b) classify them into one
of the four classes above. The key novelty of the work is a multi-step weighted
embedding approach: we project words, threads and classes in appropriate
embedding spaces and establish relevance and similarity there. We evaluate our
method with real data from three security forums with a total of 164k posts and
21K threads. First, REST robustness to initial keyword selection can extend the
user-provided keyword set and thus, it can recover from missing keywords.
Second, REST categorizes the threads into the classes of interest with superior
accuracy compared to five other methods: REST exhibits an accuracy between
63.3-76.9%. We see our approach as a first step for harnessing the wealth of
information of online forums in a user-friendly way, since the user can loosely
specify her keywords of interest
- …