27,151 research outputs found
Non-hierarchical Structures: How to Model and Index Overlaps?
Overlap is a common phenomenon seen when structural components of a digital
object are neither disjoint nor nested inside each other. Overlapping
components resist reduction to a structural hierarchy, and tree-based indexing
and query processing techniques cannot be used for them. Our solution to this
data modeling problem is TGSA (Tree-like Graph for Structural Annotations), a
novel extension of the XML data model for non-hierarchical structures. We
introduce an algorithm for constructing TGSA from annotated documents; the
algorithm can efficiently process non-hierarchical structures and is associated
with formal proofs, ensuring that transformation of the document to the data
model is valid. To enable high performance query analysis in large data
repositories, we further introduce an extension of XML pre-post indexing for
non-hierarchical structures, which can process both reachability and
overlapping relationships.Comment: The paper has been accepted at the Balisage 2014 conferenc
HIERARCHICAL CLUSTERING USING LEVEL SETS
Over the past several decades, clustering algorithms have earned their place as a go-to solution for database mining. This paper introduces a new concept which is used to develop a new recursive version of DBSCAN that can successfully perform hierarchical clustering, called Level- Set Clustering (LSC). A level-set is a subset of points of a data-set whose densities are greater than some threshold, ‘t’. By graphing the size of each level-set against its respective ‘t,’ indents are produced in the line graph which correspond to clusters in the data-set, as the points in a cluster have very similar densities. This new algorithm is able to produce the clustering result with the same O(n log n) time complexity as DBSCAN and OPTICS, while catching clusters the others missed
A Formal Framework for Linguistic Annotation
`Linguistic annotation' covers any descriptive or analytic notations applied
to raw language data. The basic data may be in the form of time functions --
audio, video and/or physiological recordings -- or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
`named entity' identification, co-reference annotation, and so on. While there
are several ongoing efforts to provide formats and tools for such annotations
and to publish annotated linguistic databases, the lack of widely accepted
standards is becoming a critical problem. Proposed standards, to the extent
they exist, have focussed on file formats. This paper focuses instead on the
logical structure of linguistic annotations. We survey a wide variety of
existing annotation formats and demonstrate a common conceptual core, the
annotation graph. This provides a formal framework for constructing,
maintaining and searching linguistic annotations, while remaining consistent
with many alternative data structures and file formats.Comment: 49 page
Context and Keyword Extraction in Plain Text Using a Graph Representation
Document indexation is an essential task achieved by archivists or automatic
indexing tools. To retrieve relevant documents to a query, keywords describing
this document have to be carefully chosen. Archivists have to find out the
right topic of a document before starting to extract the keywords. For an
archivist indexing specialized documents, experience plays an important role.
But indexing documents on different topics is much harder. This article
proposes an innovative method for an indexing support system. This system takes
as input an ontology and a plain text document and provides as output
contextualized keywords of the document. The method has been evaluated by
exploiting Wikipedia's category links as a termino-ontological resources
A Graph Theoretic Approach for Object Shape Representation in Compositional Hierarchies Using a Hybrid Generative-Descriptive Model
A graph theoretic approach is proposed for object shape representation in a
hierarchical compositional architecture called Compositional Hierarchy of Parts
(CHOP). In the proposed approach, vocabulary learning is performed using a
hybrid generative-descriptive model. First, statistical relationships between
parts are learned using a Minimum Conditional Entropy Clustering algorithm.
Then, selection of descriptive parts is defined as a frequent subgraph
discovery problem, and solved using a Minimum Description Length (MDL)
principle. Finally, part compositions are constructed by compressing the
internal data representation with discovered substructures. Shape
representation and computational complexity properties of the proposed approach
and algorithms are examined using six benchmark two-dimensional shape image
datasets. Experiments show that CHOP can employ part shareability and indexing
mechanisms for fast inference of part compositions using learned shape
vocabularies. Additionally, CHOP provides better shape retrieval performance
than the state-of-the-art shape retrieval methods.Comment: Paper : 17 pages. 13th European Conference on Computer Vision (ECCV
2014), Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III, pp
566-581. Supplementary material can be downloaded from
http://link.springer.com/content/esm/chp:10.1007/978-3-319-10578-9_37/file/MediaObjects/978-3-319-10578-9_37_MOESM1_ESM.pd
- …