158,155 research outputs found

    How Part-of-Speech Tags Affect Text Retrieval and Filtering Performance

    Full text link
    Natural language processing (NLP) applied to information retrieval (IR) and filtering problems may assign part-of-speech tags to terms and, more generally, modify queries and documents. Analytic models can predict the performance of a text filtering system as it incorporates changes suggested by NLP, allowing us to make precise statements about the average effect of NLP operations on IR. Here we provide a model of retrieval and tagging that allows us to both compute the performance change due to syntactic parsing and to allow us to understand what factors affect performance and how. In addition to a prediction of performance with tags, upper and lower bounds for retrieval performance are derived, giving the best and worst effects of including part-of-speech tags. Empirical grounds for selecting sets of tags are considered.Comment: uuencoded and compressed postscrip

    The Fence Complexity of Persistent Sets

    Full text link
    We study the psync complexity of concurrent sets in the non-volatile shared memory model. Flush instructions are used in non-volatile memory to force shared state to be written back to non-volatile memory and must typically be accompanied by the use of expensive fence instructions to enforce ordering among such flushes. Collectively we refer to a flush and a fence as a psync. The safety property of strict linearizability forces crashed operations to take effect before the crash or not take effect at all; the weaker property of durable linearizability enforces this requirement only for operations that have completed prior to the crash event. We consider lock-free implementations of list-based sets and prove two lower bounds. We prove that for any durable linearizable lock-free set there must exist an execution where some process must perform at least one redundant psync as part of an update operation. We introduce an extension to strict linearizability specialized for persistent sets that we call strict limited effect (SLE) linearizability. SLE linearizability explicitly ensures that operations do not take effect after a crash which better reflects the original intentions of strict linearizability. We show that it is impossible to implement SLE linearizable lock-free sets in which read-only (or search) operations do not flush or fence. We undertake an empirical study of persistent sets that examines various algorithmic design techniques and the impact of flush instructions in practice. We present concurrent set algorithms that provide matching upper bounds and rigorously evaluate them against existing persistent sets to expose the impact of algorithmic design and safety properties on psync complexity in practice as well as the cost of recovering the data structure following a system crash

    Generic Proofs of Consensus Numbers for Abstract Data Types

    Get PDF
    The power of shared data types to solve consensus in asynchronous wait-free systems is a fundamental question in distributed computing, but is largely considered only for specific data types. We consider general classes of abstract shared data types, and classify types of operations on those data types by the knowledge about past operations that processes can extract from the state of the shared object. We prove upper and lower bounds on the number of processes which can use data types in these classes to solve consensus. Our results generalize the consensus numbers known for a wide variety of specific shared data types, such as compare-and-swap, augmented queues and stacks, registers, and cyclic queues. Further, since the classification is based directly on the semantics of operations, one can use the bounds we present to determine the consensus number of a new data type from its specification. We show that, using sets of operations which can detect the first change to the shared object state, or even one at a fixed distance from the beginning of the execution, any number of processes can solve consensus. However, if instead of one of the first changes, operations can only detect one of the most recent changes, then fewer processes can solve consensus. In general, if each operation can either change shared state or read it, but not both, then the number of processes which can solve consensus is limited by the number of consecutive recent operations which can be viewed by a single operation. Allowing operations that both change and read the shared state can allow consensus algorithms with more processes, but if the operations can only see one change a fixed number of operations in the past, we upper bound the number of processes which can solve consensus with a small constant

    Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

    Full text link
    Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the L2L_2-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD

    Efficient Representation and Encoding of Distributive Lattices

    Get PDF
    This thesis presents two new representations of distributive lattices with an eye towards efficiency in both time and space. Distributive lattices are a well-known class of partially-ordered sets having two natural operations called meet and join. Improving on all previous results, we develop an efficient data structure for distributive lattices that supports meet and join operations in O(log n) time, where n is the size of the lattice. The structure occupies O(n log n) bits of space, which is as compact as any known data structure and within a logarithmic factor of the information-theoretic lower bound by enumeration. The second representation is a bitstring encoding of a distributive lattice that uses approximately 1.26n bits. This is within a small constant factor of the best known upper and lower bounds for this problem. A lattice can be encoded or decoded in O(n log n) time

    Lower Bounds for Symbolic Computation on Graphs: Strongly Connected Components, Liveness, Safety, and Diameter

    Full text link
    A model of computation that is widely used in the formal analysis of reactive systems is symbolic algorithms. In this model the access to the input graph is restricted to consist of symbolic operations, which are expensive in comparison to the standard RAM operations. We give lower bounds on the number of symbolic operations for basic graph problems such as the computation of the strongly connected components and of the approximate diameter as well as for fundamental problems in model checking such as safety, liveness, and co-liveness. Our lower bounds are linear in the number of vertices of the graph, even for constant-diameter graphs. For none of these problems lower bounds on the number of symbolic operations were known before. The lower bounds show an interesting separation of these problems from the reachability problem, which can be solved with O(D)O(D) symbolic operations, where DD is the diameter of the graph. Additionally we present an approximation algorithm for the graph diameter which requires O~(nD)\tilde{O}(n \sqrt{D}) symbolic steps to achieve a (1+ϵ)(1+\epsilon)-approximation for any constant ϵ>0\epsilon > 0. This compares to O(nD)O(n \cdot D) symbolic steps for the (naive) exact algorithm and O(D)O(D) symbolic steps for a 2-approximation. Finally we also give a refined analysis of the strongly connected components algorithms of Gentilini et al., showing that it uses an optimal number of symbolic steps that is proportional to the sum of the diameters of the strongly connected components
    corecore