2,297 research outputs found
TopSig: Topology Preserving Document Signatures
Performance comparisons between File Signatures and Inverted Files for text
retrieval have previously shown several significant shortcomings of file
signatures relative to inverted files. The inverted file approach underpins
most state-of-the-art search engine algorithms, such as Language and
Probabilistic models. It has been widely accepted that traditional file
signatures are inferior alternatives to inverted files. This paper describes
TopSig, a new approach to the construction of file signatures. Many advances in
semantic hashing and dimensionality reduction have been made in recent times,
but these were not so far linked to general purpose, signature file based,
search engines. This paper introduces a different signature file approach that
builds upon and extends these recent advances. We are able to demonstrate
significant improvements in the performance of signature file based indexing
and retrieval, performance that is comparable to that of state of the art
inverted file based systems, including Language models and BM25. These findings
suggest that file signatures offer a viable alternative to inverted files in
suitable settings and from the theoretical perspective it positions the file
signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
Dynamic Hilbert clustering based on convex set for web services aggregation
In recent years, web services run by big corporations and different application-specific data centers have all been embraced by several companies worldwide. Web services provide several benefits when compared to other communication technologies. However, it still suffers from congestion and bottlenecks as well as a significant delay due to the tremendous load caused by a large number of web service requests from end users. Clustering and then aggregating similar web services as one compressed message can potentially achieve network traffic reduction. This paper proposes a dynamic Hilbert clustering as a new model for clustering web services based on convex set similarity. Mathematically, the suggested models compute the degree of similarity between simple object access protocol (SOAP) messages and then cluster them into groups with high similarity. Next, each cluster is aggregated as a compact message that is finally encoded by fixed-length or Huffman. The experiment results have shown the suggested model performs better than the conventional clustering techniques in terms of compression ratio. The suggested model has produced the best results, reaching up to 15 with fixed-length and up to 20 with Huffma
Enhanced web services performance by compression and similarity-based aggregation of SOAP traffic
Many organizations around the world have adopted Web services, server farms hosted by large enterprises, and data centres for various applications. Web services offer several advantages over other communication technologies. However, it still has high latency and often suffers congestion and bottlenecks due to the massive load generated by large numbers of end users for Web service requests. Simple Object Access Protocol (SOAP) is the basic Extensible Markup Language (XML) communication protocol of Web services that is widely used over the Internet. SOAP provides interoperability by establishing access among Web servers and clients from the same or different platforms. However, the verbosity of the XML format and its encoded messages are often larger than the actual payload, causing dense traffic over the network. This thesis is proposing three innovative techniques capable of reducing small, as well as very large, messages. Furthermore, new redundancy-aware SOAP Web message aggregation models (Binary-tree, Two-bit, and One-bit XML status trees) are proposed to enable the Web servers to aggregate SOAP responses, and send them back as one compact aggregated message, thereby reducing the required bandwidth and latency, and improving the overall performance of Web services. Fractal as a mathematical model provides powerful self-similarity measurements for the fragments of regular and irregular geometric objects in their numeric representations. Fractal mathematical parameters are introduced to compute SOAP message similarities that are applied on the numeric representation of SOAP messages. Furthermore, SOAP fractal similarities are developed to devise a new unsupervised auto-clustering technique. Fast fractal similarity based clustering technique is proposed with the aim of speeding up the computations for the selection of similar messages to be aggregated together in order to achieve greater reduction
Review of Extreme Multilabel Classification
Extreme multilabel classification or XML, is an active area of interest in
machine learning. Compared to traditional multilabel classification, here the
number of labels is extremely large, hence, the name extreme multilabel
classification. Using classical one versus all classification wont scale in
this case due to large number of labels, same is true for any other
classifiers. Embedding of labels as well as features into smaller label space
is an essential first step. Moreover, other issues include existence of head
and tail labels, where tail labels are labels which exist in relatively smaller
number of given samples. The existence of tail labels creates issues during
embedding. This area has invited application of wide range of approaches
ranging from bit compression motivated from compressed sensing, tree based
embeddings, deep learning based latent space embedding including using
attention weights, linear algebra based embeddings such as SVD, clustering,
hashing, to name a few. The community has come up with a useful set of metrics
to identify correctly the prediction for head or tail labels.Comment: 46 pages, 13 figure
Approaches to creating anonymous patient database
Health care providers, health plans and health care clearinghouses collect patient medical data derived from their normal operations every day. These patient data can greatly benefit the health care organization if data mining techniques are applied upon these data sets. However, individual identifiable patient information needs to be protected in accordance with Health Insurance Portability and Accountability Act (HIPAA), and the quality of patient data also needs to be ensured in order for data mining tasks achieve accurate results. This thesis describes a patient data transformation system which transforms patient data into high quality and anonymous patient records that is suitable for data mining purposes.;This document discusses the underlying technologies, features implemented in the prototype, and the methodologies used in developing the software. The prototype emphasizes the patient privacy and quality of the patient data as well as software scalability and portability. Preliminary experience of its use is presented. A performance analysis of the system\u27s behavior has also been done
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
- …