8 research outputs found

    WinnER: A Winner-Take-All Hashing-Based Unsupervised Model for Entity Resolution Problems

    Get PDF
    Σε αυτή τη μελέτη, προτείνουμε μια ολοκληρωμένη ιδέα για ένα μοντέλο μη επιβλεπόμενης μηχανικής μάθησης, το οποίο μπορεί να χρησιμοποιηθεί σε προβλήματα ανεύρεσης όμοιων οντοτήτων σε ένα σύνολο συμβολοσειρών, οι οποίες περιγράφουν το ίδιο φυσικό αντικείμενο, ενώ διαφέρουν σαν συμβολοσειρές. Στην μεθοδολογία αυτή, χρησιμοποιείται ένας καινοτόμος αλγόριθμος επιλογής πρωτοτύπων προκειμένου να δημιουργηθεί ένας ευκλείδειος και ταυτόχρονα ανομοιόμορφος χώρος. Μέρος αυτής της μελέτης, είναι μια πλήρης παρουσίαση των θεωρητικών πλεονεκτημάτων ενός ευκλείδειου και ταυτόχρονα ανομοιογενούς χώρου. Στη συνέχεια, παρουσιάζουμε μια μέθοδο διανυσματοποίησης του αρχικού συνόλου δεδομένων, η οποία βασίζεται στη μετατροπή των διανυσμάτων σε βαθμωτά διανύσματα, μια τεχνική η οποία αντιμετωπίζει το γνωστό πρόβλημα της Μηχανικής Μάθησης, το πρόβλημα των μεγάλων διαστάσεων. Το κεντρικό και πιο καθοριστικό κομμάτι αυτής της μεθοδολογίας, είναι η χρήση ενός αλγορίθμου κατακερματισμού, ο οποίος ονομάζεται Winner-Take-All. Με αυτόν τον αλγόριθμο μειώνεται καθοριστικά ο χρόνος εκτέλεσης της μεθοδολογίας μας ενώ ταυτόχρονα παρέχει εξαιρετικά αποτελέσματα κατά την φάση ελέγχου ομοιότητας μεταξύ των οντοτήτων. Για τη φάση ελέγχου ομοιότητας, υιοθετούμε τον συντελεστή συσχέτισης κατάταξης Kendall Tau, μια ευρέως αποδεκτή μέτρηση για τη σύγκριση των βαθμωτών διανυσμάτων. Τέλος χρησιμοποιούμε δύο σύγχρονα μοντέλα προκειμένου να κάνουμε μια ολοκληρωμένη αξιολόγηση της μεθοδολογίας μας, σε ένα διάσημο σύνολο δεδομένων, στοχευμένο για ανεύρεση όμοιων οντοτήτων.In this study, we propose an end-to-end unsupervised learning model that can be used for Entity Resolution problems on string data sets. An innovative prototype selection algorithm is utilized in order to create a rich euclidean, and at the same time, dissimilarity space. Part of this work, is a fine presentation of the theoretical benefits of a euclidean and dissimilarity space. Following we present an embedding scheme based on rank-ordered vectors, that circumvents the Curse of Dimensionality problem. The core of our framework is a locality hashing algorithm named Winner-Take-All, which accelerates our models run time while also maintaining great scores in the similarity checking phase. For the similarity checking phase, we adopt Kendall Tau rank correlation coefficient, a metric for comparing rankings. Finally, we use two state-of-the-art frameworks in order to make a consistent evaluation of our methodology among a famous Entity Resolution data set

    NNMap: A method to construct a good embedding for nearest neighbor classification

    Get PDF
    a b s t r a c t This paper aims to deal with the practical shortages of nearest neighbor classifier. We define a quantitative criterion of embedding quality assessment for nearest neighbor classification, and present a method called NNMap to construct a good embedding. Furthermore, an efficient distance is obtained in the embedded vector space, which could speed up nearest neighbor classification. The quantitative quality criterion is proposed as a local structure descriptor of sample data distribution. Embedding quality corresponds to the quality of the local structure. In the framework of NNMap, one-dimension embeddings act as weak classifiers with pseudo-losses defined on the amount of the local structure preserved by the embedding. Based on this property, the NNMap method reduces the problem of embedding construction to the classical boosting problem. An important property of NNMap is that the embedding optimization criterion is appropriate for both vector and non-vector data, and equally valid in both metric and non-metric spaces. The effectiveness of the new method is demonstrated by experiments conducted on the MNIST handwritten dataset, the CMU PIE face images dataset and the datasets from UCI machine learning repository

    Properties of embedding methods for similarity searching in metric spaces

    Full text link

    An Analogy Based Costing System For Injection Molds Based Upon Geometry Similarity With Wavelets

    Get PDF
    The injection molding industry is large and diversified. However there is no universally accepted way to bid molds, despite the fact that the mold and related design comprise 50% of the total cost of an injection-molded part over its lifetime. This is due to both the structure of the industry and technical difficulties in developing an automated and practical cost estimation system. The technical challenges include lack of a common data format for both parts and molds; the comprehensive consideration of the data about a wide variety of mold types, designs, complexities, number of cavities and other factors that directly affect cost; and the robustness of estimation due to variations of build time and cost. In this research, we propose a new mold cost estimation approach based upon clustered features of parts. Geometry similarity is used to estimate the complexity of a mold from a 2D image with one orthographic view of the injection-molded part. Wavelet descriptors of boundaries as well as other inherent shape properties such as size, number of boundaries, etc. are used to describe the complexity of the part. Regression models are then built to predict costs. In addition to mean estimates, prediction intervals are calculated to support risk management

    Content-Based Image Retrieval Using Self-Organizing Maps

    Full text link

    Efficient Image Retrieval through Vantage Objects

    No full text
    We describe a new indexing structure for general image retrieval that relies solely on a distance function giving the similarity between two images. For each image object in the database, its distance to a set of m predetermined vantage objects is calculated; the m-vector of these distances specifies a point in the m-dimensional vantage space. The database objects that are similar (in terms of the distance function) to a given query object can be determined by means of an efficient nearest-neighbor search on these points. We demonstrate the viability of our approach through experimental results obtained with a database of about 48,000 hieroglyphic polylines

    3D Shape Similarity Through Structural Descriptors

    Get PDF
    Due to the recent improvements to 3D object acquisition, visualization and modeling techniques, the number of 3D models available is more and more growing, and there is an increasing demand for tools supporting the automatic search for 3D objects and their sub-parts in digital archives. Whilst there are already techniques for rapidly extracting knowledge from massive volumes of texts (like Google [htt]) it is harder to structure, filter, organize, retrieve and maintain archives of digital shapes like images, 3D objects, 3D animations and virtual or augmented reality. This situations suggests that in the future a primary challenge in computer graphics will be how to find models having a similar global and/or local appearance. Shape descriptors and the methodologies used to compare them, occupy an important role for achieving this task. For this reason a first contribution of this thesis is to provide a critical analysis of the most representative geometric and structural shape descriptors with respect to a set of properties that shape descriptors should have. This analysis is targeted at highlighting the differences between descriptors in order to better understand where a descriptor fails and another succeed. As a second contribution, the thesis investigates the problem of using a structural descriptor for shape comparison purposes. A large class of structural shape descriptors can be easily encoded as directed, a-cyclic and attributed graphs, thus the problem of comparing structural descriptors is approached as a graph matching problem. The techniques used for graph comparison have an exponential computational complexity and it is therefore necessary to define an algorithmic approximation of the optimal solution. The methods for structural descriptors comparison, commonly used in the computer graphics community, consist of heuristic graph matching algorithms for specific application tasks, while it is lacking a general approach suitable for incorporating different heuristics applicable in different application tasks. The second contribution presented in this thesis is aimed at defining a framework for expressing the optimal algorithm for the computation of the maximal common subgraph in a formalization which makes it straightforward usable for plugging heuristics in it, in order to achieving different approximations of the optimal solution according to the specific case. Implemented heuristics for robust graph matching with respect to graph structural noise are discussed and experimented on sub-part correspondence between similar 3D objects, and shape retrieval application with respect to different structural graph descriptors

    Resource Description and Selection for Similarity Search in Metric Spaces: Problems and Problem-Solving Approaches

    Get PDF
    In times of an ever increasing amount of data and a growing diversity of data types in different application contexts, there is a strong need for large-scale and flexible indexing and search techniques. Metric access methods (MAMs) provide this flexibility, because they only assume that the dissimilarity between two data objects is modeled by a distance metric. Furthermore, scalable solutions can be built with the help of distributed MAMs. Both IF4MI and RS4MI, which are presented in this thesis, represent metric access methods. IF4MI belongs to the group of centralized MAMs. It is based on an inverted file and thus offers a hybrid access method providing text retrieval capabilities in addition to content-based search in arbitrary metric spaces. In opposition to IF4MI, RS4MI is a distributed MAM based on resource description and selection techniques. Here, data objects are physically distributed. However, RS4MI is by no means restricted to a certain type of distributed information retrieval system. Various application fields for the resource description and selection techniques are possible, for example in the context of visual analytics. Due to the metric space assumption, possible application fields go far beyond content-based image retrieval applications which provide the example scenario here.Ständig zunehmende Datenmengen und eine immer größer werdende Vielfalt an Datentypen in verschiedenen Anwendungskontexten erfordern sowohl skalierbare als auch flexible Indexierungs- und Suchtechniken. Metrische Zugriffsstrukturen (MAMs: metric access methods) können diese Flexibilität bieten, weil sie lediglich unterstellen, dass die Distanz zwischen zwei Datenobjekten durch eine Distanzmetrik modelliert wird. Darüber hinaus lassen sich skalierbare Lösungen mit Hilfe verteilter MAMs entwickeln. Sowohl IF4MI als auch RS4MI, die beide in dieser Arbeit vorgestellt werden, stellen metrische Zugriffsstrukturen dar. IF4MI gehört zur Gruppe der zentralisierten MAMs. Diese Zugriffsstruktur basiert auf einer invertierten Liste und repräsentiert daher eine hybride Indexstruktur, die neben einer inhaltsbasierten Ähnlichkeitssuche in beliebigen metrischen Räumen direkt auch Möglichkeiten der Textsuche unterstützt. Im Gegensatz zu IF4MI handelt es sich bei RS4MI um eine verteilte MAM, die auf Techniken der Ressourcenbeschreibung und -auswahl beruht. Dabei sind die Datenobjekte physisch verteilt. RS4MI ist jedoch keineswegs auf die Anwendung in einem bestimmten verteilten Information-Retrieval-System beschränkt. Verschiedene Anwendungsfelder sind für die Techniken zur Ressourcenbeschreibung und -auswahl denkbar, zum Beispiel im Bereich der Visuellen Analyse. Dabei gehen Anwendungsmöglichkeiten weit über den für die Arbeit unterstellten Anwendungskontext der inhaltsbasierten Bildsuche hinaus
    corecore