163 research outputs found
Efficient access methods for very large distributed graph databases
Subgraph searching is an essential problem in graph databases, but it is also challenging due to the involved subgraph isomorphism NP-Complete sub-problem. Filter-Then-Verify (FTV) methods mitigate performance overheads by using an index to prune out graphs that do not fit the query in a filtering stage, reducing the number of subgraph isomorphism evaluations in a subsequent verification stage. Subgraph searching has to be applied to very large databases (tens of millions of graphs) in real applications such as molecular substructure searching. Previous surveys have identified the FTV solutions GraphGrepSX (GGSX) and CT-Index as the best ones for large databases (thousands of graphs), however they cannot reach reasonable performance on very large ones (tens of millions graphs). This paper proposes a generic approach for the distributed implementation of FTV solutions. Besides, three previous methods that improve the performance of GGSX and CT-Index are adapted to be executed in clusters. The evaluation shows how the achieved solutions provide a great performance improvement (between 70% and 90% of filtering time reduction) in a centralized configuration and how they may be used to achieve efficient subgraph searching over very large databases in cluster configurationsThis work has been co-funded by the Ministerio de Economía y Competitividad of the Spanish government, and by Mestrelab Research S.L. through the project NEXTCHROM (RTC-2015-3812-2) of the call Retos-Colaboración of the program Programa Estatal de Investigación, Desarrollo e Innovación Orientada a los Retos de la Sociedad. The authors wish to thank the financial support provided by Xunta de Galicia under the Project ED431B 2018/28S
ZERO-KNOWLEDGE DE NOVO ALGORITHMS FOR ANALYZING SMALL MOLECULES USING MASS SPECTROMETRY
In the analysis of mass spectra, if a superset of the molecules thought to be in a sample is known a priori, then there are well established techniques for the identification of the molecules such as database search and spectral libraries. Linear molecules are chains of subunits. For example, a peptide is a linear molecule with an “alphabet” of 20 possible amino acid subunits. A peptide of length six will have 206 = 64, 000, 000 different possible outcomes. Small molecules, such as sugars and metabolites, are not constrained to linear structures and may branch. These molecules are encoded as undirected graphs rather than simply linear chains. An undirected graph with six subunits (each of which have 20 possible outcomes) will 6 have 206 · 2(6 choose 2) = 2, 097, 152, 000, 000 possible outcomes. The vast amount of complex graphs which small molecules can form can render databases and spectral libraries impossibly large to use or incomplete as many metabolites may still be unidentified. In the absence of a usable database or spectral library, an the alphabet of subunits may be used to connect peaks in the fragmentation spectra; each connection represents a neutral loss of an alphabet mass. This technique is called “de novo sequencing” and relies on the alphabet being known in advance. Often the alphabet of m/z difference values allowed by de novo analysis is not known or is incomplete. A method is proposed that, given fragmentation mass spectra, identifies an alphabet of m/z differences that can build large connected graphs from many intense peaks in each spectrum from a collection. Once an alphabet is obtained, it is informative to find common substructures among the peaks connected by the alphabet. This is the same as finding the largest isomorphic subgraphs on the de novo graphs from all pairs of fragmentation spectra. This maximal subgraph isomorphism problem is a generalization of the subgraph isomorphism problem, which asks whether a graph G1 has a subgraph isomorphic to a graph G2 . Subgraph isomorphism is NP-complete. A novel method of efficiently finding common substructures among the subspectra induced by the alphabet is proposed. This method is then combined with a novel form of hashing, eschewing evaluation of all pairs of fragmentation spectra. These methods are generalized to Euclidean graphs embedded in Zn
Efficient Similarity Search in Structured Data
Modern database applications are characterized by two major aspects: the use of complex data types
with internal structure and the need for new data analysis methods. The focus of database users has shifted from
simple queries to complex analyses of the data, known as knowledge discovery in databases. Important
tasks in this area are the grouping of data objects (clustering), the classification of new data objects or the
detection of exceptional data objects (outlier detection). Most algorithms for solving those problems are based on
similarity search in databases. This makes efficient similarity search in large databases of structured objects
an important basic operation for modern database applications.
In this thesis we develop efficient methods for similarity search in large databases of
structured data and improve the efficiency of existing query processing techniques.
For the data objects, only a tree or graph structure is assumed which can be extended with arbitrary attribute
information.
Starting with an analysis of the demands from two example applications, several important requirements for
similarity measures are identified. One aspect is the adaptability of the similarity search method to the
requirements of the user and the application domain. This can even imply a change of the similarity measure
between two successive queries of the same user. An explanation component which makes clear why objects are considered
similar by the system is a necessary precondition for a purposeful adaption of the measure. Consequently,
the edit distance, well-known from string processing, is a common similarity measure for graph structured
objects. Its feature to allow a visualization of corresponding substructures and the possibility to weight
single operations are the reason for this popularity.
But it turns out that the edit distance and similar measures for tree structures are computationally extremely
complex which makes them unsuitable for today's large and even growing databases. Therefore, we develop a
multi-step query processing architecture which reduces the number of necessary distance calculations significantly.
This is achieved by employing suitable filter methods.
Furthermore, we show that by easing certain restrictions
on the similarity measure, a significant performance gain can be obtained without reducing the quality of the
measure. To achieve this, matchings of substructures (vertices or edges) of the data objects are determined.
An additional cost function for those matchings allows to derive a similarity measure for structured data, called
the edge matching distance, from
the cost optimal matching of the substructures. But even for this new similarity measure, efficiency can be improved
significantly by using a multi-step query processing approach. This allows the use of the edge matching distance for
knowledge discovery applications in large databases. Within the thesis, the properties of our new similarity search methods
are proved both theoretically and through experiments.Moderne Datenbankanwendungen werden vor allem durch zwei wesentliche Aspekte charakterisiert. Dies ist zum einen die
Verwendung
komplexer Datentypen mit interner Struktur und zum anderen die Notwendigkeit neuer Recherchemöglichkeiten. Der Fokus
bei der Datenbankbenutzung hat sich von einfachen Anfragen hin zu komplexen Analysen des Datenbestandes, dem
sogenannten
Knowledge-Discovery in Datenbanken, entwickelt. Wichtige Analysetechniken in diesem Bereich sind unter anderem die
Gruppierung der Daten in Teilmengen (Clustering), die Klassifikation neuer Datenobjekte im Bezug auf den vorhandenen
Datenbestand und das Erkennen von Ausreißern in den Daten (Outlier-Identifikation). Die Basis für die
meisten Verfahren
zur Lösung dieser Aufgaben bildet dabei die Bestimmung der Ähnlichkeit von Datenbankobjekten. Die effiziente
Ähnlichkeitssuche in großen Datenbanken strukturierter Objekte ist daher eine wichtige Basisoperation für moderne
Datenbankanwendungen.
In dieser Doktorarbeit werden daher effiziente Verfahren für die Ähnlichkeitssuche in großen Mengen strukturierter
Objekte entwickelt, bzw. die Effizienz vorhandener Verfahren deutlich zu verbessert. Dabei wird lediglich eine baum-
oder allgemein graphartige innere Struktur der Datenobjekte vorausgesetzt, die durch beliebige Attribute erweitert wird.
Ausgehend von einer Analyse der Anforderungen an Ähnlichkeitssuchverfahren in zwei Beispielsanwendungen aus
dem Bereich
der Bildsuche und des Proteindockings, wurden mehrere wichtige Aspekte der Ähnlichkeitssuche identifiziert. Ein erster
Aspekt ist, das Maß für die Ähnlichkeit für den Benutzer anpassbar zu gestalten, da der zugrundeliegende
Ähnlichkeitsbegriff
sowohl benutzer- als auch situationsabhängig ist, was bis hin zur Änderung des Ähnlichkeitsbegriffs zwischen zwei
aufeinanderfolgenden Anfragen gehen kann. Voraussetzung für eine zielgerichtete Anpassung des Ähnlichkeitsbegriffs
ist dabei eine Erklärungskomponente, welche dem Benutzer das Zustandekommen eines Ähnlichkeitswertes verdeutlicht.
Die aus der Stringverarbeitung bekannte Edit-Distanz ist deshalb ein weit verbreitetes Maß für die Ähnlichkeit von
graphstrukturierten Objekten, da sie eine Gewichtung einzelner Operationen erlaubt und durch eine Zuordnung von
Teilobjekten aus den zu vergleichenden Strukturen eine Erklärungskomponente liefert.
Es zeigt sich jedoch, dass die Bestimmung der Edit-Distanz und vergleichbarer Ähnlichkeitsmaße für Baum- oder
Graphstrukturen extrem zeitaufwendig ist. Es wird daher zunächst ein mehrstufiges Anfragebearbeitungsmodell entwickelt,
welches durch geeignete Filterschritte die Anzahl der notwendigen Distanzberechnungen massiv reduziert und so die
Geschwindigkeit der Anfragebearbeitung deutlich steigert bzw. erst für große Datenmengen akzeptabel macht. Im nächsten
Schritt wird aufgezeigt, wie sich durch Lockerung einiger Bedingungen für das Ähnlichkeitsmaß deutliche
Geschwindigkeitssteigerungen erreichen lassen, ohne Einbußen bezüglich der Qualität der Anfrageergebnisse
hinnehmen zu müssen. Dazu werden Paarungen von Teilstrukturen (Knoten oder Kanten) der zu vergleichenden Objekte
bestimmt, die zusätzlich mittels einer Kostenfunktion gewichtet werden. Eine bezüglich dieser Kostenfunktion optimale
Paarung aller Teilstrukturen stellt dann ein Maß für die Ähnlichkeit der Vergleichsobjekte dar, die sogenannte "edge
matching distance". Es zeigt sich jedoch,
dass auch für dieses neue Ähnlichkeitsmaß eine mehrstufige Anfragebearbeitung zusammen mit entsprechenden, neuartigen
Filtermethoden eine erhebliche Performanzsteigerung erlaubt. Diese stellt die Voraussetzung für die Anwendung der Verfahren
im Rahmen des Knowledge-Discovery in großen Datenbanken dar. Dabei werden die genannten Eigenschaften der neu
entwickelten Verfahren sowohl theoretisch als auch mittels praktischer Experimente belegt
Social Network Data Management
With the increasing usage of online social networks and the semantic web's graph structured RDF framework, and the rising adoption of networks in various fields from biology to social science, there is a rapidly growing need for indexing, querying, and analyzing massive graph structured data. Facebook has amassed over 500 million users creating huge volumes of highly connected data. Governments have made RDF datasets containing billions of triples available to the public. In the life sciences, researches have started to connect disparate data sets of research results into one giant network of valuable information. Clearly, networks are becoming increasingly popular and growing rapidly in size, requiring scalable solutions for network data management.
This thesis focuses on the following aspects of network data management. We present a hierarchical index structure for external memory storage of network data that aims to maximize data locality. We propose efficient algorithms to answer subgraph matching queries against network databases and discuss effective pruning strategies to improve performance. We show how adaptive cost models can speed up subgraph matching query answering by assigning budgets to index retrieval operations and adjusting the query plan while executing.
We develop a cloud oriented social network database, COSI, which handles massive network datasets too large for a single computer by partitioning the data across multiple machines and achieving high performance query answering through asynchronous parallelization and cluster-aware heuristics.
Tracking multiple standing queries against a social network database is much faster with our novel multi-view maintenance algorithm, which exploits common substructures between queries.
To capture uncertainty inherent in social network querying, we define probabilistic subgraph matching queries over deterministic graph data and propose algorithms to answer them efficiently.
Finally, we introduce a general relational machine learning framework and rule-based language, Probabilistic Soft Logic, to learn from and probabilistically reason about social network data and describe applications to information integration and information fusion
Context-Based classification of objects in topographic data
Large-scale topographic databases model real world features as vector data objects. These can be point, line or area features. Each of these map objects is assigned to a
descriptive class; for example, an area feature might be classed as a building, a garden or a road. Topographic data is subject to continual updates from cartographic surveys
and ongoing quality improvement. One of the most important aspects of this is assignment and verification of class descriptions to each area feature. These attributes
can be added manually, but, due to the vast volume of data involved, automated techniques are desirable to classify these polygons.
Analogy is a key thought process that underpins learning and has been the subject of much research in the field of artificial intelligence (AI). An analogy identifies
structural similarity between a well-known source domain and a less familiar target domain. In many cases, information present in the source can then be mapped to the
target, yielding a better understanding of the latter. The solution of geometric analogy problems has been a fruitful area of AI research. We observe that there is a correlation
between objects in geometric analogy problem domains and map features in topographic data. We describe two topographic area feature classification tools that use
descriptions of neighbouring features to identify analogies between polygons: content vector matching (CVM) and context structure matching (CSM). CVM and CSM classify an area feature by matching its neighbourhood context against those of analogous polygons whose class is known.
Both classifiers were implemented and then tested on high quality topographic polygon data supplied by Ordnance Survey (Great Britain). Area features were found to exhibit a high degree of variation in their neighbourhoods. CVM correctly classified 85.38% of the 79.03% of features it attempted to classify. The accuracy for CSM was 85.96% of the 62.96% of features it tried to identify. Thus, CVM can classify 25.53% more features than CSM, but is slightly less accurate. Both techniques excelled at identifying the feature classes that predominate in suburban data. Our structure-based classification approach may also benefit other types of spatial data, such as topographic line data, small-scale topographic data, raster data, architectural plans and circuit diagrams
- …