2 research outputs found

    Resource Description and Selection for Similarity Search in Metric Spaces: Problems and Problem-Solving Approaches

    Get PDF
    In times of an ever increasing amount of data and a growing diversity of data types in different application contexts, there is a strong need for large-scale and flexible indexing and search techniques. Metric access methods (MAMs) provide this flexibility, because they only assume that the dissimilarity between two data objects is modeled by a distance metric. Furthermore, scalable solutions can be built with the help of distributed MAMs. Both IF4MI and RS4MI, which are presented in this thesis, represent metric access methods. IF4MI belongs to the group of centralized MAMs. It is based on an inverted file and thus offers a hybrid access method providing text retrieval capabilities in addition to content-based search in arbitrary metric spaces. In opposition to IF4MI, RS4MI is a distributed MAM based on resource description and selection techniques. Here, data objects are physically distributed. However, RS4MI is by no means restricted to a certain type of distributed information retrieval system. Various application fields for the resource description and selection techniques are possible, for example in the context of visual analytics. Due to the metric space assumption, possible application fields go far beyond content-based image retrieval applications which provide the example scenario here.Ständig zunehmende Datenmengen und eine immer größer werdende Vielfalt an Datentypen in verschiedenen Anwendungskontexten erfordern sowohl skalierbare als auch flexible Indexierungs- und Suchtechniken. Metrische Zugriffsstrukturen (MAMs: metric access methods) können diese Flexibilität bieten, weil sie lediglich unterstellen, dass die Distanz zwischen zwei Datenobjekten durch eine Distanzmetrik modelliert wird. Darüber hinaus lassen sich skalierbare Lösungen mit Hilfe verteilter MAMs entwickeln. Sowohl IF4MI als auch RS4MI, die beide in dieser Arbeit vorgestellt werden, stellen metrische Zugriffsstrukturen dar. IF4MI gehört zur Gruppe der zentralisierten MAMs. Diese Zugriffsstruktur basiert auf einer invertierten Liste und repräsentiert daher eine hybride Indexstruktur, die neben einer inhaltsbasierten Ähnlichkeitssuche in beliebigen metrischen Räumen direkt auch Möglichkeiten der Textsuche unterstützt. Im Gegensatz zu IF4MI handelt es sich bei RS4MI um eine verteilte MAM, die auf Techniken der Ressourcenbeschreibung und -auswahl beruht. Dabei sind die Datenobjekte physisch verteilt. RS4MI ist jedoch keineswegs auf die Anwendung in einem bestimmten verteilten Information-Retrieval-System beschränkt. Verschiedene Anwendungsfelder sind für die Techniken zur Ressourcenbeschreibung und -auswahl denkbar, zum Beispiel im Bereich der Visuellen Analyse. Dabei gehen Anwendungsmöglichkeiten weit über den für die Arbeit unterstellten Anwendungskontext der inhaltsbasierten Bildsuche hinaus

    Towards a universal information distance for structured data

    No full text
    The similarity of objects is one of the most fundamental concepts in any collection of complex information; similarity, along with techniques for storing and indexing sets of values based on it, is a concept of ever increasing importance as inherently unordered data sets become ever more common. Examples of such datasets include collections of images, multimedia, and semi-structured data. There are however two, largely separate, classes of related research. On the one hand, techniques such as clustering and similarity search give general treatments over sets of data. Results are domain-independent, typically relying only on the existence of an anonymous distance metric over the set in question. On the other hand, results in the domain of similarity measurement are often limited to the context of pairwise comparison over individual objects, and are not typically set in a wider context. Published algorithms are scattered over various demand-led subject areas, including for example bioinformatics, library sciences, and crime detection. Few, if any, of the published algorithms have the distance metric properties. We have identified a distance metric, Ensemble Distance, which we believe can help to bridge this gap. Ensemble Distance is a non-Euclidean distance metric which we believe can be used in the treatment of many classes of structured data. For any complex type where a useful characterisation exists in the form of an ensemble, we can produce a distance metric for that type. This will in turn allow use of the complex type within off-the-shelf clustering and similarity search algorithms; this would be a major result in the management of complex data sets
    corecore