30,185 research outputs found
Experiments in terabyte searching, genomic retrieval and novelty detection for TREC 2004
In TREC2004, Dublin City University took part in three tracks, Terabyte (in collaboration with University College Dublin), Genomic and Novelty. In this paper we will discuss each track separately and present separate conclusions from this work. In addition, we present a general description of a text retrieval engine that we have developed in the last year to support our experiments into large scale, distributed information retrieval, which underlies all of the track experiments described in this document
Estimating an NBA player's impact on his team's chances of winning
Traditional NBA player evaluation metrics are based on scoring differential
or some pace-adjusted linear combination of box score statistics like points,
rebounds, assists, etc. These measures treat performances with the outcome of
the game still in question (e.g. tie score with five minutes left) in exactly
the same way as they treat performances with the outcome virtually decided
(e.g. when one team leads by 30 points with one minute left). Because they
ignore the context in which players perform, these measures can result in
misleading estimates of how players help their teams win. We instead use a win
probability framework for evaluating the impact NBA players have on their
teams' chances of winning. We propose a Bayesian linear regression model to
estimate an individual player's impact, after controlling for the other players
on the court. We introduce several posterior summaries to derive rank-orderings
of players within their team and across the league. This allows us to identify
highly paid players with low impact relative to their teammates, as well as
players whose high impact is not captured by existing metrics.Comment: To appear in the Journal of Quantitative Analysis of Spor
Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation
Myriad of graph-based algorithms in machine learning and data mining require
parsing relational data iteratively. These algorithms are implemented in a
large-scale distributed environment in order to scale to massive data sets. To
accelerate these large-scale graph-based iterative computations, we propose
delta-based accumulative iterative computation (DAIC). Different from
traditional iterative computations, which iteratively update the result based
on the result from the previous iteration, DAIC updates the result by
accumulating the "changes" between iterations. By DAIC, we can process only the
"changes" to avoid the negligible updates. Furthermore, we can perform DAIC
asynchronously to bypass the high-cost synchronous barriers in heterogeneous
distributed environments. Based on the DAIC model, we design and implement an
asynchronous graph processing framework, Maiter. We evaluate Maiter on local
cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves
as much as 60x speedup over Hadoop and outperforms other state-of-the-art
frameworks.Comment: ScienceCloud 2012, TKDE 201
Size-Change Termination as a Contract
Termination is an important but undecidable program property, which has led
to a large body of work on static methods for conservatively predicting or
enforcing termination. One such method is the size-change termination approach
of Lee, Jones, and Ben-Amram, which operates in two phases: (1) abstract
programs into "size-change graphs," and (2) check these graphs for the
size-change property: the existence of paths that lead to infinite decreasing
sequences.
We transpose these two phases with an operational semantics that accounts for
the run-time enforcement of the size-change property, postponing (or entirely
avoiding) program abstraction. This choice has two key consequences: (1)
size-change termination can be checked at run-time and (2) termination can be
rephrased as a safety property analyzed using existing methods for systematic
abstraction.
We formulate run-time size-change checks as contracts in the style of Findler
and Felleisen. The result compliments existing contracts that enforce partial
correctness specifications to obtain contracts for total correctness. Our
approach combines the robustness of the size-change principle for termination
with the precise information available at run-time. It has tunable overhead and
can check for nontermination without the conservativeness necessary in static
checking. To obtain a sound and computable termination analysis, we apply
existing abstract interpretation techniques directly to the operational
semantics, avoiding the need for custom abstractions for termination. The
resulting analyzer is competitive with with existing, purpose-built analyzers
Searching for superspreaders of information in real-world social media
A number of predictors have been suggested to detect the most influential
spreaders of information in online social media across various domains such as
Twitter or Facebook. In particular, degree, PageRank, k-core and other
centralities have been adopted to rank the spreading capability of users in
information dissemination media. So far, validation of the proposed predictors
has been done by simulating the spreading dynamics rather than following real
information flow in social networks. Consequently, only model-dependent
contradictory results have been achieved so far for the best predictor. Here,
we address this issue directly. We search for influential spreaders by
following the real spreading dynamics in a wide range of networks. We find that
the widely-used degree and PageRank fail in ranking users' influence. We find
that the best spreaders are consistently located in the k-core across
dissimilar social platforms such as Twitter, Facebook, Livejournal and
scientific publishing in the American Physical Society. Furthermore, when the
complete global network structure is unavailable, we find that the sum of the
nearest neighbors' degree is a reliable local proxy for user's influence. Our
analysis provides practical instructions for optimal design of strategies for
"viral" information dissemination in relevant applications.Comment: 12 pages, 7 figure
Ranking to Learn: Feature Ranking and Selection via Eigenvector Centrality
In an era where accumulating data is easy and storing it inexpensive, feature
selection plays a central role in helping to reduce the high-dimensionality of
huge amounts of otherwise meaningless data. In this paper, we propose a
graph-based method for feature selection that ranks features by identifying the
most important ones into arbitrary set of cues. Mapping the problem on an
affinity graph-where features are the nodes-the solution is given by assessing
the importance of nodes through some indicators of centrality, in particular,
the Eigen-vector Centrality (EC). The gist of EC is to estimate the importance
of a feature as a function of the importance of its neighbors. Ranking central
nodes individuates candidate features, which turn out to be effective from a
classification point of view, as proved by a thoroughly experimental section.
Our approach has been tested on 7 diverse datasets from recent literature
(e.g., biological data and object recognition, among others), and compared
against filter, embedded and wrappers methods. The results are remarkable in
terms of accuracy, stability and low execution time.Comment: Preprint version - Lecture Notes in Computer Science - Springer 201
- âŠ