2,297 research outputs found

    TopSig: Topology Preserving Document Signatures

    Get PDF
    Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201

    On Optimally Partitioning Variable-Byte Codes

    Get PDF
    The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 15 April 201

    Dynamic Hilbert clustering based on convex set for web services aggregation

    Get PDF
    In recent years, web services run by big corporations and different application-specific data centers have all been embraced by several companies worldwide. Web services provide several benefits when compared to other communication technologies. However, it still suffers from congestion and bottlenecks as well as a significant delay due to the tremendous load caused by a large number of web service requests from end users. Clustering and then aggregating similar web services as one compressed message can potentially achieve network traffic reduction. This paper proposes a dynamic Hilbert clustering as a new model for clustering web services based on convex set similarity. Mathematically, the suggested models compute the degree of similarity between simple object access protocol (SOAP) messages and then cluster them into groups with high similarity. Next, each cluster is aggregated as a compact message that is finally encoded by fixed-length or Huffman. The experiment results have shown the suggested model performs better than the conventional clustering techniques in terms of compression ratio. The suggested model has produced the best results, reaching up to 15 with fixed-length and up to 20 with Huffma

    Enhanced web services performance by compression and similarity-based aggregation of SOAP traffic

    Get PDF
    Many organizations around the world have adopted Web services, server farms hosted by large enterprises, and data centres for various applications. Web services offer several advantages over other communication technologies. However, it still has high latency and often suffers congestion and bottlenecks due to the massive load generated by large numbers of end users for Web service requests. Simple Object Access Protocol (SOAP) is the basic Extensible Markup Language (XML) communication protocol of Web services that is widely used over the Internet. SOAP provides interoperability by establishing access among Web servers and clients from the same or different platforms. However, the verbosity of the XML format and its encoded messages are often larger than the actual payload, causing dense traffic over the network. This thesis is proposing three innovative techniques capable of reducing small, as well as very large, messages. Furthermore, new redundancy-aware SOAP Web message aggregation models (Binary-tree, Two-bit, and One-bit XML status trees) are proposed to enable the Web servers to aggregate SOAP responses, and send them back as one compact aggregated message, thereby reducing the required bandwidth and latency, and improving the overall performance of Web services. Fractal as a mathematical model provides powerful self-similarity measurements for the fragments of regular and irregular geometric objects in their numeric representations. Fractal mathematical parameters are introduced to compute SOAP message similarities that are applied on the numeric representation of SOAP messages. Furthermore, SOAP fractal similarities are developed to devise a new unsupervised auto-clustering technique. Fast fractal similarity based clustering technique is proposed with the aim of speeding up the computations for the selection of similar messages to be aggregated together in order to achieve greater reduction

    Review of Extreme Multilabel Classification

    Full text link
    Extreme multilabel classification or XML, is an active area of interest in machine learning. Compared to traditional multilabel classification, here the number of labels is extremely large, hence, the name extreme multilabel classification. Using classical one versus all classification wont scale in this case due to large number of labels, same is true for any other classifiers. Embedding of labels as well as features into smaller label space is an essential first step. Moreover, other issues include existence of head and tail labels, where tail labels are labels which exist in relatively smaller number of given samples. The existence of tail labels creates issues during embedding. This area has invited application of wide range of approaches ranging from bit compression motivated from compressed sensing, tree based embeddings, deep learning based latent space embedding including using attention weights, linear algebra based embeddings such as SVD, clustering, hashing, to name a few. The community has come up with a useful set of metrics to identify correctly the prediction for head or tail labels.Comment: 46 pages, 13 figure

    Approaches to creating anonymous patient database

    Get PDF
    Health care providers, health plans and health care clearinghouses collect patient medical data derived from their normal operations every day. These patient data can greatly benefit the health care organization if data mining techniques are applied upon these data sets. However, individual identifiable patient information needs to be protected in accordance with Health Insurance Portability and Accountability Act (HIPAA), and the quality of patient data also needs to be ensured in order for data mining tasks achieve accurate results. This thesis describes a patient data transformation system which transforms patient data into high quality and anonymous patient records that is suitable for data mining purposes.;This document discusses the underlying technologies, features implemented in the prototype, and the methodologies used in developing the software. The prototype emphasizes the patient privacy and quality of the patient data as well as software scalability and portability. Preliminary experience of its use is presented. A performance analysis of the system\u27s behavior has also been done

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed
    • …
    corecore