8,893 research outputs found
Partition strategies for incremental Mini-Bucket
Los modelos en grafo probabilÃsticos, tales como los campos aleatorios de
Markov y las redes bayesianas, ofrecen poderosos marcos de trabajo para la
representación de conocimiento y el razonamiento en modelos con gran número
de variables. Sin embargo, los problemas de inferencia exacta en modelos de
grafos son NP-hard en general, lo que ha causado que se produzca bastante
interés en métodos de inferencia aproximados.
El mini-bucket incremental es un marco de trabajo para inferencia aproximada
que produce como resultado lÃmites aproximados inferior y superior de la
función de partición exacta, a base de -empezando a partir de un modelo con
todos los constraints relajados, es decir, con las regiones más pequeñas posibleincrementalmente
añadir regiones más grandes a la aproximación. Los métodos
de inferencia aproximada que existen actualmente producen lÃmites superiores
ajustados de la función de partición, pero los lÃmites inferiores suelen ser demasiado
imprecisos o incluso triviales.
El objetivo de este proyecto es investigar estrategias de partición que mejoren
los lÃmites inferiores obtenidos con el algoritmo de mini-bucket, trabajando dentro
del marco de trabajo de mini-bucket incremental.
Empezamos a partir de la idea de que creemos que deberÃa ser beneficioso
razonar conjuntamente con las variables de un modelo que tienen una alta correlación,
y desarrollamos una estrategia para la selección de regiones basada en
esa idea. Posteriormente, implementamos nuestra estrategia y exploramos formas
de mejorarla, y finalmente medimos los resultados obtenidos usando nuestra
estrategia y los comparamos con varios métodos de referencia.
Nuestros resultados indican que nuestra estrategia obtiene lÃmites inferiores
más ajustados que nuestros dos métodos de referencia. También consideramos
y descartamos dos posibles hipótesis que podrÃan explicar esta mejora.Els models en graf probabilÃstics, com bé els camps aleatoris de Markov i les
xarxes bayesianes, ofereixen poderosos marcs de treball per la representació
del coneixement i el raonament en models amb grans quantitats de variables.
Tanmateix, els problemes d’inferència exacta en models de grafs son NP-hard
en general, el qual ha provocat que es produeixi bastant d’interès en mètodes
d’inferència aproximats.
El mini-bucket incremental es un marc de treball per a l’inferència aproximada
que produeix com a resultat lÃmits aproximats inferior i superior de la
funció de partició exacta que funciona començant a partir d’un model al qual
se li han relaxat tots els constraints -és a dir, un model amb les regions més
petites possibles- i anar afegint a l’aproximació regions incrementalment més
grans. Els mètodes d’inferència aproximada que existeixen actualment produeixen
lÃmits superiors ajustats de la funció de partició. Tanmateix, els lÃmits
inferiors acostumen a ser massa imprecisos o fins aviat trivials.
El objectiu d’aquest projecte es recercar estratègies de partició que millorin
els lÃmits inferiors obtinguts amb l’algorisme de mini-bucket, treballant dins del
marc de treball del mini-bucket incremental.
La nostra idea de partida pel projecte es que creiem que hauria de ser beneficiós
per la qualitat de l’aproximació raonar conjuntament amb les variables del
model que tenen una alta correlació entre elles, i desenvolupem una estratègia
per a la selecció de regions basada en aquesta idea. Posteriorment, implementem
la nostra estratègia i explorem formes de millorar-la, i finalment mesurem els
resultats obtinguts amb la nostra estratègia i els comparem a diversos mètodes
de referència.
Els nostres resultats indiquen que la nostra estratègia obté lÃmits inferiors
més ajustats que els nostres dos mètodes de referència. També considerem i
descartem dues possibles hipòtesis que podrien explicar aquesta millora.Probabilistic graphical models such as Markov random fields and Bayesian networks
provide powerful frameworks for knowledge representation and reasoning
over models with large numbers of variables. Unfortunately, exact inference
problems on graphical models are generally NP-hard, which has led to signifi-
cant interest in approximate inference algorithms.
Incremental mini-bucket is a framework for approximate inference that provides
upper and lower bounds on the exact partition function by, starting from
a model with completely relaxed constraints, i.e. with the smallest possible
regions, incrementally adding larger regions to the approximation. Current
approximate inference algorithms provide tight upper bounds on the exact partition
function but loose or trivial lower bounds.
This project focuses on researching partitioning strategies that improve the
lower bounds obtained with mini-bucket elimination, working within the framework
of incremental mini-bucket.
We start from the idea that variables that are highly correlated should be
reasoned about together, and we develop a strategy for region selection based
on that idea. We implement the strategy and explore ways to improve it, and
finally we measure the results obtained using the strategy and compare them to
several baselines.
We find that our strategy performs better than both of our baselines. We
also rule out several possible explanations for the improvement
Improving query performance on dynamic graphs
Querying large models efficiently often imposes high demands on system resources such as memory, processing time, disk access or network latency. The situation becomes more complicated when data are highly interconnected, e.g. in the form of graph structures, and when data sources are heterogeneous, partly coming from dynamic systems and partly stored in databases. These situations are now common in many existing social networking applications and geo-location systems, which require specialized and efficient query algorithms in order to make informed decisions on time. In this paper, we propose an algorithm to improve the memory consumption and time performance of this type of queries by reducing the amount of elements to be processed, focusing only on the information that is relevant to the query but without compromising the accuracy of its results. To this end, the reduced subset of data is selected depending on the type of query and its constituent f ilters. Three case studies are used to evaluate the performance of our proposal, obtaining significant speedups in all cases.This work is partially supported by the European Commission (FEDER) and the Spanish Government under projects APOLO (US-1264651), HORATIO (RTI2018-101204-B-C21), EKIPMENT-PLUS (P18-FR-2895) and COSCA (PGC2018-094905B-I00)
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
Cyber security is one of the most significant technical challenges in current
times. Detecting adversarial activities, prevention of theft of intellectual
properties and customer data is a high priority for corporations and government
agencies around the world. Cyber defenders need to analyze massive-scale,
high-resolution network flows to identify, categorize, and mitigate attacks
involving networks spanning institutional and national boundaries. Many of the
cyber attacks can be described as subgraph patterns, with prominent examples
being insider infiltrations (path queries), denial of service (parallel paths)
and malicious spreads (tree queries). This motivates us to explore subgraph
matching on streaming graphs in a continuous setting. The novelty of our work
lies in using the subgraph distributional statistics collected from the
streaming graph to determine the query processing strategy. We introduce a
"Lazy Search" algorithm where the search strategy is decided on a
vertex-to-vertex basis depending on the likelihood of a match in the vertex
neighborhood. We also propose a metric named "Relative Selectivity" that is
used to select between different query processing strategies. Our experiments
performed on real online news, network traffic stream and a synthetic social
network benchmark demonstrate 10-100x speedups over selectivity agnostic
approaches.Comment: in 18th International Conference on Extending Database Technology
(EDBT) (2015
Fast and Robust Archetypal Analysis for Representation Learning
We revisit a pioneer unsupervised learning technique called archetypal
analysis, which is related to successful data analysis methods such as sparse
coding and non-negative matrix factorization. Since it was proposed, archetypal
analysis did not gain a lot of popularity even though it produces more
interpretable models than other alternatives. Because no efficient
implementation has ever been made publicly available, its application to
important scientific problems may have been severely limited. Our goal is to
bring back into favour archetypal analysis. We propose a fast optimization
scheme using an active-set strategy, and provide an efficient open-source
implementation interfaced with Matlab, R, and Python. Then, we demonstrate the
usefulness of archetypal analysis for computer vision tasks, such as codebook
learning, signal classification, and large image collection visualization
Processing Structured Data Streams
We elaborate this study in order to choose the most suitable technology to develop our proposal.
Second, we propose three methods to reduce the set of data to be processed by a query when working with large graphs, namely spatial, temporal and random approximations. These methods are based on Approximate Query Processing techniques and consist in discarding the information that is considered not relevant for the query. The reduction of the data is performed online with the processing and considers both spatial and temporal aspects of the data. Since discarding information in the source data may decrease the validity of the results, we also define the transformation error obtain with these methods in terms of accuracy, precision and recall.
Finally, we present a preprocessing algorithm, called SDR algorithm, that is also used to reduce the set of data to be processed, but without compromising the accuracy of the results. It calculates a subgraph from the source graph that contains only the relevant information for a given query. Since this technique is a preprocessing algorithm it is run offline before the actual processing begins. In addition, an incremental version of the algorithm is developed in order to update the subgraph as new information arrives to the system.A large amount of data is daily generated from different sources such as social networks, recommendation systems or geolocation systems. Moreover, this information tends to grow exponentially every year. Companies have discovered that the processing of these data may be important in order to obtain useful conclusions that serve for decision-making or the detection and resolution of problems in a more efficient way, for instance, through the study of trends, habits or customs of the population. The information provided by these sources typically consists of a non-structured and continuous data flow, where the relations among data elements conform graph structures. Inevitably, the processing performance of this information progressively decreases as the size of the data increases. For this reason, non-structured information is usually handled taking into account only the most recent data and discarding the rest, since they are considered not relevant when drawing conclusions. However, this approach is not enough in the case of sources that provide graph-structured data, since it is necessary to consider spatial features as well as temporal features. These spatial features refer to the relationships among the data elements. For example, some cases where it is important to consider spatial aspects are marketing techniques, which require information on the location of users and their possible needs, or the detection of diseases, that use data about genetic relationships among subjects or the geographic scope.
It is worth highlighting three main contributions from this dissertation. First, we provide a comparative study of seven of the most common processing platforms to work with huge graphs and the languages that are used to query them. This study measures the performance of the queries in terms of execution time, and the syntax complexity of the languages according to three parameters: number of characters, number of operators and number of internal variables
- …