115,203 research outputs found
An examination of fast similarity search trees with gating
The emergence of complex data objects that must be indexed and queried in databases has created a need for access methods that are both generic and efficient. Traditional search algorithms that only check specified fields and keys are no longer effective. Tree-structured indexing techniques based on metric spaces are widely used to solve this problem. Unfortunately, these data structures can be slow as the computational complexity of computing the distance between two points in a metric space can be high. This thesis will explore data structures for the evaluation of range queries in general metric spaces. The performance limitations of metric spaces will be analyzed and opportunities for improvement will be discussed. It will culminate with the introduction of the Fast Similarity Search Tree as a viable alternative to existing methodologies
Correlating neural and symbolic representations of language
Analysis methods which enable us to better understand the representations and
functioning of neural models of language are increasingly needed as deep
learning becomes the dominant approach in NLP. Here we present two methods
based on Representational Similarity Analysis (RSA) and Tree Kernels (TK) which
allow us to directly quantify how strongly the information encoded in neural
activation patterns corresponds to information represented by symbolic
structures such as syntax trees. We first validate our methods on the case of a
simple synthetic language for arithmetic expressions with clearly defined
syntax and semantics, and show that they exhibit the expected pattern of
results. We then apply our methods to correlate neural representations of
English sentences with their constituency parse trees.Comment: ACL 201
Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata
Many social Web sites allow users to annotate the content with descriptive
metadata, such as tags, and more recently to organize content hierarchically.
These types of structured metadata provide valuable evidence for learning how a
community organizes knowledge. For instance, we can aggregate many personal
hierarchies into a common taxonomy, also known as a folksonomy, that will aid
users in visualizing and browsing social content, and also to help them in
organizing their own content. However, learning from social metadata presents
several challenges, since it is sparse, shallow, ambiguous, noisy, and
inconsistent. We describe an approach to folksonomy learning based on
relational clustering, which exploits structured metadata contained in personal
hierarchies. Our approach clusters similar hierarchies using their structure
and tag statistics, then incrementally weaves them into a deeper, bushier tree.
We study folksonomy learning using social metadata extracted from the
photo-sharing site Flickr, and demonstrate that the proposed approach addresses
the challenges. Moreover, comparing to previous work, the approach produces
larger, more accurate folksonomies, and in addition, scales better.Comment: 10 pages, To appear in the Proceedings of ACM SIGKDD Conference on
Knowledge Discovery and Data Mining(KDD) 201
Learning Word Representations with Hierarchical Sparse Coding
We propose a new method for learning word representations using hierarchical
regularization in sparse coding inspired by the linguistic study of word
meanings. We show an efficient learning algorithm based on stochastic proximal
methods that is significantly faster than previous approaches, making it
possible to perform hierarchical sparse coding on a corpus of billions of word
tokens. Experiments on various benchmark tasks---word similarity ranking,
analogies, sentence completion, and sentiment analysis---demonstrate that the
method outperforms or is competitive with state-of-the-art methods. Our word
representations are available at
\url{http://www.ark.cs.cmu.edu/dyogatam/wordvecs/}
Identifying Web Tables - Supporting a Neglected Type of Content on the Web
The abundance of the data in the Internet facilitates the improvement of
extraction and processing tools. The trend in the open data publishing
encourages the adoption of structured formats like CSV and RDF. However, there
is still a plethora of unstructured data on the Web which we assume contain
semantics. For this reason, we propose an approach to derive semantics from web
tables which are still the most popular publishing tool on the Web. The paper
also discusses methods and services of unstructured data extraction and
processing as well as machine learning techniques to enhance such a workflow.
The eventual result is a framework to process, publish and visualize linked
open data. The software enables tables extraction from various open data
sources in the HTML format and an automatic export to the RDF format making the
data linked. The paper also gives the evaluation of machine learning techniques
in conjunction with string similarity functions to be applied in a tables
recognition task.Comment: 9 pages, 4 figure
A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate
XML Schema Clustering with Semantic and Hierarchical Similarity Measures
With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis
- …