14 research outputs found

    A new iterative algorithm for computing a quality approximate median of strings based on edit operations

    Get PDF
    This paper presents a new algorithm that can be used to compute an approximation to the median of a set of strings. The approximate median is obtained through the successive improvements of a partial solution. The edit distance from the partial solution to all the strings in the set is computed in each iteration, thus accounting for the frequency of each of the edit operations in all the positions of the approximate median. A goodness index for edit operations is later computed by multiplying their frequency by the cost. Each operation is tested, starting from that with the highest index, in order to verify whether applying it to the partial solution leads to an improvement. If successful, a new iteration begins from the new approximate median. The algorithm finishes when all the operations have been examined without a better solution being found. Comparative experiments involving Freeman chain codes encoding 2D shapes and the Copenhagen chromosome database show that the quality of the approximate median string is similar to benchmark approaches but achieves a much faster convergence.This work is partially supported by the Spanish CICYT under project DPI2006-15542-C04-01, the Spanish MICINN through project TIN2009-14205-CO4-01 and by the Spanish research program Consolider Ingenio 2010: MIPRCV (CSD2007-00018)

    Boosting Perturbation-Based Iterative Algorithms to Compute the Median String

    Get PDF
    [Abstract] The most competitive heuristics for calculating the median string are those that use perturbation-based iterative algorithms. Given the complexity of this problem, which under many formulations is NP-hard, the computational cost involved in the exact solution is not affordable. In this work, the heuristic algorithms that solve this problem are addressed, emphasizing its initialization and the policy to order possible editing operations. Both factors have a significant weight in the solution of this problem. Initial string selection influences the algorithm’s speed of convergence, as does the criterion chosen to select the modification to be made in each iteration of the algorithm. To obtain the initial string, we use the median of a subset of the original dataset; to obtain this subset, we employ the Half Space Proximal (HSP) test to the median of the dataset. This test provides sufficient diversity within the members of the subset while at the same time fulfilling the centrality criterion. Similarly, we provide an analysis of the stop condition of the algorithm, improving its performance without substantially damaging the quality of the solution. To analyze the results of our experiments, we computed the execution time of each proposed modification of the algorithms, the number of computed editing distances, and the quality of the solution obtained. With these experiments, we empirically validated our proposal.This work was supported in part by the Comisión Nacional de Investigación Científica y Tecnológica - Programa de Formación de Capital Humano Avanzado (CONICYT-PCHA)/Doctorado Nacional/2014-63140074 through the Ph.D. Scholarship, in part by the European Union's Horizon 2020 under the Marie Sklodowska-Curie under Grant 690941, in part by the Millennium Institute for Foundational Research on Data (IMFD), and in part by the FONDECYT-CONICYT under Grant 1170497. The work of ÓSCAR PEDREIRA was supported in part by the Xunta de Galicia/FEDER-UE refs under Grant CSI ED431G/01 and Grant GRC: ED431C 2017/58, in part by the Office of the Vice President for Research and Postgraduate Studies of the Universidad Católica de Temuco, VIPUCT Project 2020EM-PS-08, and in part by the FEQUIP 2019-INRN-03 of the Universidad Católica de TemucoXunta de Galicia; ED431G/01Xunta de Galicia; ED431C 2017/58Chile. Comisión Nacional de Investigación Científica y Tecnológica; 2014-63140074Chile. Comisión Nacional de Investigación Científica y Tecnológica; 1170497Universidad Católica de Temuco (Chile); 2020EM-PS-08Universidad Católica de Temuco (Chile); 2019-INRN-0

    Pivot Selection for Median String Problem

    Full text link
    The Median String Problem is W[1]-Hard under the Levenshtein distance, thus, approximation heuristics are used. Perturbation-based heuristics have been proved to be very competitive as regards the ratio approximation accuracy/convergence speed. However, the computational burden increase with the size of the set. In this paper, we explore the idea of reducing the size of the problem by selecting a subset of representative elements, i.e. pivots, that are used to compute the approximate median instead of the whole set. We aim to reduce the computation time through a reduction of the problem size while achieving similar approximation accuracy. We explain how we find those pivots and how to compute the median string from them. Results on commonly used test data suggest that our approach can reduce the computational requirements (measured in computed edit distances) by 88\% with approximation accuracy as good as the state of the art heuristic. This work has been supported in part by CONICYT-PCHA/Doctorado Nacional/2014631400742014-63140074 through a Ph.D. Scholarship; Universidad Cat\'{o}lica de la Sant\'{i}sima Concepci\'{o}n through the research project DIN-01/2016; European Union's Horizon 2020 under the Marie Sk\l odowska-Curie grant agreement 690941690941; Millennium Institute for Foundational Research on Data (IMFD); FONDECYT-CONICYT grant number 11704971170497; and for O. Pedreira, Xunta de Galicia/FEDER-UE refs. CSI ED431G/01 and GRC: ED431C 2017/58

    Using Learned Conditional Distributions as Edit Distance

    No full text
    International audienc

    Merge-and-simplify operation for compact combinatorial pyramid definition

    Get PDF
    International audienceImage pyramids are employed for years in digital image processing. They permit to store and use different scales/levels of details of an image. To represent all the topological information of the different levels, combinatorial pyramids have proved having many interests. But, when using an explicit representation, one drawback of this structure is the memory space required to store such a pyramid. In this paper, this drawback is solved by defining a compact version of combinatorial pyramids. This definition is based on the definition of a new operation, called "merge-and-simplify", which simultaneously merges regions and simplifies their boundaries. Our experiments show that the memory space of our solution is much smaller than the one of the original version. Moreover, the computation time of our solution is faster, because there are less levels in our pyramid than in the original one

    Multimodal biometrics scheme based on discretized eigen feature fusion for identical twins identification

    Get PDF
    The subject of twins multimodal biometrics identification (TMBI) has consistently been an interesting and also a valuable area of study. Considering high dependency and acceptance, TMBI greatly contributes to the domain of twins identification in biometrics traits. The variation of features resulting from the process of multimodal biometrics feature extraction determines the distinctive characteristics possessed by a twin. However, these features are deemed as inessential as they cause the increase in the search space size and also the difficulty in the generalization process. In this regard, the key challenge is to single out features that are deemed most salient with the ability to accurately recognize the twins using multimodal biometrics. In identification of twins, effective designs of methodology and fusion process are important in assuring its success. These processes could be used in the management and integration of vital information including highly selective biometrics characteristic possessed by any of the twins. In the multimodal biometrics twins identification domain, exemplification of the best features from multiple traits of twins and biometrics fusion process remain to be completely resolved. This research attempts to design a new scheme and more effective multimodal biometrics twins identification by introducing the Dis-Eigen feature-based fusion with the capacity in generating a uni-representation and distinctive features of numerous modalities of twins. First, Aspect United Moment Invariant (AUMI) was used as global feature in the extraction of features obtained from the twins handwritingfingerprint shape and style. Then, the feature-based fusion was examined in terms of its generalization. Next, to achieve better classification accuracy, the Dis-Eigen feature-based fusion algorithm was used. A total of eight distinctive classifiers were used in executing four different training and testing of environment settings. Accordingly, the most salient features of Dis-Eigen feature-based fusion were trained and tested to determine the accuracy of the classification, particularly in terms of performance. The results show that the identification of twins improved as the error of similarity for intra-class decreased while at the same time, the error of similarity for inter-class increased. Hence, with the application of diverse classifiers, the identification rate was improved reaching more than 93%. It can be concluded from the experimental outcomes that the proposed method using Receiver Operation Characteristics (ROC) considerably increases the twins handwriting-fingerprint identification process with 90.25% rate of identification when False Acceptance Rate (FAR) is at 0.01%. It is also indicated that 93.15% identification rate is achieved when FAR is at 0.5% and 98.69% when FAR is at 1.00%. The new proposed solution gives a promising alternative to twins identification application

    Interactive Pattern Recognition applied to Natural Language Processing

    Full text link
    This thesis is about Pattern Recognition. In the last decades, huge efforts have been made to develop automatic systems able to rival human capabilities in this field. Although these systems achieve high productivity rates, they are not precise enough in most situations. Humans, on the contrary, are very accurate but comparatively quite slower. This poses an interesting question: the possibility of benefiting from both worlds by constructing cooperative systems. This thesis presents diverse contributions to this kind of collaborative approach. The point is to improve the Pattern Recognition systems by properly introducing a human operator into the system. We call this Interactive Pattern Recognition (IPR). Firstly, a general proposal for IPR will be stated. The aim is to develop a framework to easily derive new applications in this area. Some interesting IPR issues are also introduced. Multi-modality or adaptive learning are examples of extensions that can naturally fit into IPR. In the second place, we will focus on a specific application. A novel method to obtain high quality speech transcriptions (CAST, Computer Assisted Speech Transcription). We will start by proposing a CAST formalization and, next, we will cope with different implementation alternatives. Practical issues, as the system response time, will be also taken into account, in order to allow for a practical implementation of CAST. Word graphs and probabilistic error correcting parsing are tools that will be used to reach an alternative formulation that allows for the use of CAST in a real scenario. Afterwards, a special application within the general IPR framework will be discussed. This is intended to test the IPR capabilities in an extreme environment, where no input pattern is available and the system only has access to the user actions to produce a hypothesis. Specifically, we will focus here on providing assistance in the problem of text generation.Rodríguez Ruiz, L. (2010). Interactive Pattern Recognition applied to Natural Language Processing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8479Palanci

    APIC: A method for automated pattern identification and classification

    Get PDF
    Machine Learning (ML) is a transformative technology at the forefront of many modern research endeavours. The technology is generating a tremendous amount of attention from researchers and practitioners, providing new approaches to solving complex classification and regression tasks. While concepts such as Deep Learning have existed for many years, the computational power for realising the utility of these algorithms in real-world applications has only recently become available. This dissertation investigated the efficacy of a novel, general method for deploying ML in a variety of complex tasks, where best feature selection, data-set labelling, model definition and training processes were determined automatically. Models were developed in an iterative fashion, evaluated using both training and validation data sets. The proposed method was evaluated using three distinct case studies, describing complex classification tasks often requiring significant input from human experts. The results achieved demonstrate that the proposed method compares with, and often outperforms, less general, comparable methods designed specifically for each task. Feature selection, data-set annotation, model design and training processes were optimised by the method, where less complex, comparatively accurate classifiers with lower dependency on computational power and human expert intervention were produced. In chapter 4, the proposed method demonstrated improved efficacy over comparable systems, automatically identifying and classifying complex application protocols traversing IP networks. In chapter 5, the proposed method was able to discriminate between normal and anomalous traffic, maintaining accuracy in excess of 99%, while reducing false alarms to a mere 0.08%. Finally, in chapter 6, the proposed method discovered more optimal classifiers than those implemented by comparable methods, with classification scores rivalling those achieved by state-of-the-art systems. The findings of this research concluded that developing a fully automated, general method, exhibiting efficacy in a wide variety of complex classification tasks with minimal expert intervention, was possible. The method and various artefacts produced in each case study of this dissertation are thus significant contributions to the field of ML
    corecore