772 research outputs found

    A mathematical programming approach to overlapping community detection

    Get PDF
    We propose a new optimization model to detect overlapping communities in networks. The model elaborates suggestions contained in Zhang et al. (2007), in which overlapping communities were identified through the use of a fuzzy membership function, calculated as the outcome of a mathematical programming problem. In our approach, we retain the idea of using both mathematical programming and fuzzy membership to detect overlapping communities, but we replace the fuzzy objective function proposed there with another one, based on the Newman and Girvan's definition of modularity. Next, we formulate a new mixed-integer linear programming model to calculate optimal overlapping communities. After some computational tests, we provide some evidence that our new proposal can fix some biases of the previous model, that is, its tendency of calculating communities composed of almost all nodes. Conversely, our new model can reveal other structural properties, such as nodes or communities acting as bridges between communities. Finally, as mathematical programming can be used only for moderate size networks due to its computation time, we proposed two heuristic algorithms to solve the largest instances, that compare favourably to other methodologies. (c) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

    The algorithm by Ferson et al. is surprisingly fast: An NP-hard optimization problem solvable in almost linear time with high probability

    Full text link
    We start with the algorithm of Ferson et al. (\emph{Reliable computing} {\bf 11}(3), p.~207--233, 2005), designed for solving a certain NP-hard problem motivated by robust statistics. First, we propose an efficient implementation of the algorithm and improve its complexity bound to O(nlogn+n2ω)O(n \log n+n\cdot 2^\omega), where ω\omega is the clique number in a certain intersection graph. Then we treat input data as random variables (as it is usual in statistics) and introduce a natural probabilistic data generating model. On average, we get 2ω=O(n1/loglogn)2^\omega = O(n^{1/\log\log n}) and ω=O(logn/loglogn)\omega = O(\log n / \log\log n). This results in average computing time O(n1+ϵ)O(n^{1+\epsilon}) for ϵ>0\epsilon > 0 arbitrarily small, which may be considered as ``surprisingly good'' average time complexity for solving an NP-hard problem. Moreover, we prove the following tail bound on the distribution of computation time: ``hard'' instances, forcing the algorithm to compute in time 2Ω(n)2^{\Omega(n)}, occur rarely, with probability tending to zero faster than exponentially with nn \rightarrow \infty

    Geometric, Feature-based and Graph-based Approaches for the Structural Analysis of Protein Binding Sites : Novel Methods and Computational Analysis

    Get PDF
    In this thesis, protein binding sites are considered. To enable the extraction of information from the space of protein binding sites, these binding sites must be mapped onto a mathematical space. This can be done by mapping binding sites onto vectors, graphs or point clouds. To finally enable a structure on the mathematical space, a distance measure is required, which is introduced in this thesis. This distance measure eventually can be used to extract information by means of data mining techniques

    Approximate model composition for explanation generation

    Get PDF
    This thesis presents a framework for the formulation of knowledge models to sup¬ port the generation of explanations for engineering systems that are represented by the resulting models. Such models are automatically assembled from instantiated generic component descriptions, known as modelfragments. The model fragments are of suffi¬ cient detail that generally satisfies the requirements of information content as identified by the user asking for explanations. Through a combination of fuzzy logic based evidence preparation, which exploits the history of prior user preferences, and an approximate reasoning inference engine, with a Bayesian evidence propagation mechanism, different uncertainty sources can be han¬ dled. Model fragments, each representing structural or behavioural aspects of a com¬ ponent of the domain system of interest, are organised in a library. Those fragments that represent the same domain system component, albeit with different representation detail, form parts of the same assumption class in the library. Selected fragments are assembled to form an overall system model, prior to extraction of any textual infor¬ mation upon which to base the explanations. The thesis proposes and examines the techniques that support the fragment selection mechanism and the assembly of these fragments into models. In particular, a Bayesian network-based model fragment selection mechanism is de¬ scribed that forms the core of the work. The network structure is manually determined prior to any inference, based on schematic information regarding the connectivity of the components present in the domain system under consideration. The elicitation of network probabilities, on the other hand is completely automated using probability elicitation heuristics. These heuristics aim to provide the information required to select fragments which are maximally compatible with the given evidence of the fragments preferred by the user. Given such initial evidence, an existing evidence propagation algorithm is employed. The preparation of the evidence for the selection of certain fragments, based on user preference, is performed by a fuzzy reasoning evidence fab¬ rication engine. This engine uses a set of fuzzy rules and standard fuzzy reasoning mechanisms, attempting to guess the information needs of the user and suggesting the selection of fragments of sufficient detail to satisfy such needs. Once the evidence is propagated, a single fragment is selected for each of the domain system compo¬ nents and hence, the final model of the entire system is constructed. Finally, a highly configurable XML-based mechanism is employed to extract explanation content from the newly formulated model and to structure the explanatory sentences for the final explanation that will be communicated to the user. The framework is illustratively applied to a number of domain systems and is compared qualitatively to existing compositional modelling methodologies. A further empirical assessment of the performance of the evidence propagation algorithm is carried out to determine its performance limits. Performance is measured against the number of frag¬ ments that represent each of the components of a large domain system, and the amount of connectivity permitted in the Bayesian network between the nodes that stand for the selection or rejection of these fragments. Based on this assessment recommenda¬ tions are made as to how the framework may be optimised to cope with real world applications

    Multiple graph matching and applications

    Get PDF
    En aplicaciones de reconocimiento de patrones, los grafos con atributos son en gran medida apropiados. Normalmente, los vértices de los grafos representan partes locales de los objetos i las aristas relaciones entre estas partes locales. No obstante, estas ventajas vienen juntas con un severo inconveniente, la distancia entre dos grafos no puede ser calculada en un tiempo polinómico. Considerando estas características especiales el uso de los prototipos de grafos es necesariamente omnipresente. Las aplicaciones de los prototipos de grafos son extensas, siendo las más habituales clustering, clasificación, reconocimiento de objetos, caracterización de objetos i bases de datos de grafos entre otras. A pesar de la diversidad de aplicaciones de los prototipos de grafos, el objetivo del mismo es equivalente en todas ellas, la representación de un conjunto de grafos. Para construir un prototipo de un grafo todos los elementos del conjunto de enteramiento tienen que ser etiquetados comúnmente. Este etiquetado común consiste en identificar que nodos de que grafos representan el mismo tipo de información en el conjunto de entrenamiento. Una vez este etiquetaje común esta hecho, los atributos locales pueden ser combinados i el prototipo construido. Hasta ahora los algoritmos del estado del arte para calcular este etiquetaje común mancan de efectividad o bases teóricas. En esta tesis, describimos formalmente el problema del etiquetaje global i mostramos una taxonomía de los tipos de algoritmos existentes. Además, proponemos seis nuevos algoritmos para calcular soluciones aproximadas al problema del etiquetaje común. La eficiencia de los algoritmos propuestos es evaluada en diversas bases de datos reales i sintéticas. En la mayoría de experimentos realizados los algoritmos propuestos dan mejores resultados que los existentes en el estado del arte.In pattern recognition, the use of graphs is, to a great extend, appropriate and advantageous. Usually, vertices of the graph represent local parts of an object while edges represent relations between these local parts. However, its advantages come together with a sever drawback, the distance between two graph cannot be optimally computed in polynomial time. Taking into account this special characteristic the use of graph prototypes becomes ubiquitous. The applicability of graphs prototypes is extensive, being the most common applications clustering, classification, object characterization and graph databases to name some. However, the objective of a graph prototype is equivalent to all applications, the representation of a set of graph. To synthesize a prototype all elements of the set must be mutually labeled. This mutual labeling consists in identifying which nodes of which graphs represent the same information in the training set. Once this mutual labeling is done the set can be characterized and combined to create a graph prototype. We call this initial labeling a common labeling. Up to now, all state of the art algorithms to compute a common labeling lack on either performance or theoretical basis. In this thesis, we formally describe the common labeling problem and we give a clear taxonomy of the types of algorithms. Six new algorithms that rely on different techniques are described to compute a suboptimal solution to the common labeling problem. The performance of the proposed algorithms is evaluated using an artificial and several real datasets. In addition, the algorithms have been evaluated on several real applications. These applications include graph databases and group-wise image registration. In most of the tests and applications evaluated the presented algorithms have showed a great improvement in comparison to state of the art applications

    Towards Certain Fixes with Editing Rules and Master Data

    Get PDF
    A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions , and a class of editing rules . A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm. </jats:p
    corecore