329 research outputs found
Social Ranking Techniques for the Web
The proliferation of social media has the potential for changing the
structure and organization of the web. In the past, scientists have looked at
the web as a large connected component to understand how the topology of
hyperlinks correlates with the quality of information contained in the page and
they proposed techniques to rank information contained in web pages. We argue
that information from web pages and network data on social relationships can be
combined to create a personalized and socially connected web. In this paper, we
look at the web as a composition of two networks, one consisting of information
in web pages and the other of personal data shared on social media web sites.
Together, they allow us to analyze how social media tunnels the flow of
information from person to person and how to use the structure of the social
network to rank, deliver, and organize information specifically for each
individual user. We validate our social ranking concepts through a ranking
experiment conducted on web pages that users shared on Google Buzz and Twitter.Comment: 7 pages, ASONAM 201
Optimizing XML Compression
The eXtensible Markup Language (XML) provides a powerful and flexible means
of encoding and exchanging data. As it turns out, its main advantage as an
encoding format (namely, its requirement that all open and close markup tags
are present and properly balanced) yield also one of its main disadvantages:
verbosity. XML-conscious compression techniques seek to overcome this drawback.
Many of these techniques first separate XML structure from the document
content, and then compress each independently. Further compression gains can be
realized by identifying and compressing together document content that is
highly similar, thereby amortizing the storage costs of auxiliary information
required by the chosen compression algorithm. Additionally, the proper choice
of compression algorithm is an important factor not only for the achievable
compression gain, but also for access performance. Hence, choosing a
compression configuration that optimizes compression gain requires one to
determine (1) a partitioning strategy for document content, and (2) the best
available compression algorithm to apply to each set within this partition. In
this paper, we show that finding an optimal compression configuration with
respect to compression gain is an NP-hard optimization problem. This problem
remains intractable even if one considers a single compression algorithm for
all content. We also describe an approximation algorithm for selecting a
partitioning strategy for document content based on the branch-and-bound
paradigm.Comment: 16 pages, extended version of paper accepted for XSym 200
Factorization in Formal Languages
We consider several novel aspects of unique factorization in formal
languages. We reprove the familiar fact that the set uf(L) of words having
unique factorization into elements of L is regular if L is regular, and from
this deduce an quadratic upper and lower bound on the length of the shortest
word not in uf(L). We observe that uf(L) need not be context-free if L is
context-free.
Next, we consider variations on unique factorization. We define a notion of
"semi-unique" factorization, where every factorization has the same number of
terms, and show that, if L is regular or even finite, the set of words having
such a factorization need not be context-free. Finally, we consider additional
variations, such as unique factorization "up to permutation" and "up to
subset"
Patterns of Individual Shopping Behavior
Much of economic theory is built on observations of aggregate, rather than
individual, behavior. Here, we present novel findings on human shopping
patterns at the resolution of a single purchase. Our results suggest that much
of our seemingly elective activity is actually driven by simple routines. While
the interleaving of shopping events creates randomness at the small scale, on
the whole consumer behavior is largely predictable. We also examine
income-dependent differences in how people shop, and find that wealthy
individuals are more likely to bundle shopping trips. These results validate
previous work on mobility from cell phone data, while describing the
unpredictability of behavior at higher resolution.Comment: 4 pages, 5 figure
Composite repetition-aware data structures
In highly repetitive strings, like collections of genomes from the same
species, distinct measures of repetition all grow sublinearly in the length of
the text, and indexes targeted to such strings typically depend only on one of
these measures. We describe two data structures whose size depends on multiple
measures of repetition at once, and that provide competitive tradeoffs between
the time for counting and reporting all the exact occurrences of a pattern, and
the space taken by the structure. The key component of our constructions is the
run-length encoded BWT (RLBWT), which takes space proportional to the number of
BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it
with data structures from LZ77 indexes, which take space proportional to the
number of LZ77 factors, and with the compact directed acyclic word graph
(CDAWG), which takes space proportional to the number of extensions of maximal
repeats. The combination of CDAWG and RLBWT enables also a new representation
of the suffix tree, whose size depends again on the number of extensions of
maximal repeats, and that is powerful enough to support matching statistics and
constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from
previous version
A New View on Worst-Case to Average-Case Reductions for NP Problems
We study the result by Bogdanov and Trevisan (FOCS, 2003), who show that
under reasonable assumptions, there is no non-adaptive worst-case to
average-case reduction that bases the average-case hardness of an NP-problem on
the worst-case complexity of an NP-complete problem. We replace the hiding and
the heavy samples protocol in [BT03] by employing the histogram verification
protocol of Haitner, Mahmoody and Xiao (CCC, 2010), which proves to be very
useful in this context. Once the histogram is verified, our hiding protocol is
directly public-coin, whereas the intuition behind the original protocol
inherently relies on private coins
Dictionary-based methods for information extraction
In this paper, we present a general method for information extraction that exploits the features of data compression techniques. We first define and focus our attention on the so-called dictionary of a sequence. Dictionaries are intrinsically interesting and a study of their features can be of great usefulness to investigate the properties of the sequences they have been extracted from e.g. DNA strings. We then describe a procedure of string comparison between dictionary-created sequences (or artificial texts) that gives very good results in several contexts. We finally present some results on self-consistent classification problems
Features of the Extension of a Statistical Measure of Complexity to Continuous Systems
We discuss some aspects of the extension to continuous systems of a
statistical measure of complexity introduced by Lopez-Ruiz, Mancini and Calbet
(LMC) [Phys. Lett. A 209 (1995) 321]. In general, the extension of a magnitude
from the discrete to the continuous case is not a trivial process and requires
some choice. In the present study, several possibilities appear available. One
of them is examined in detail. Some interesting properties desirable for any
magnitude of complexity are discovered on this particular extension.Comment: 22 pages, 0 figure
- …