9 research outputs found

    CBR-based Recommender Systems for Research Topic Finding

    Get PDF

    Approximate string matching methods for duplicate detection and clustering tasks

    Get PDF
    Approximate string matching methods are utilized by a vast number of duplicate detection and clustering applications in various knowledge domains. The application area is expected to grow due to the recent significant increase in the amount of digital data and knowledge sources. Despite the large number of existing string similarity metrics, there is a need for more precise approximate string matching methods to improve the efficiency of computer-driven data processing, thus decreasing labor-intensive human involvement. This work introduces a family of novel string similarity methods, which outperform a number of effective well-known and widely used string similarity functions. The new algorithms are designed to overcome the most common problem of the existing methods which is the lack of context sensitivity. In this evaluation, the Longest Approximately Common Prefix (LACP) method achieved the highest values of average precision and maximum F1 on three out of four medical informatics datasets used. The LACP demonstrated the lowest execution time ensured by the linear computational complexity within the set of evaluated algorithms. An online interactive spell checker of biomedical terms was developed based on the LACP method. The main goal of the spell checker was to evaluate the LACP method’s ability to make it possible to estimate the similarity of resulting sets at a glance. The Shortest Path Edit Distance (SPED) outperformed all evaluated similarity functions and gained the highest possible values of the average precision and maximum F1 measures on the bioinformatics datasets. The SPED design was inspired by the preceding work on the Markov Random Field Edit Distance (MRFED). The SPED eradicates two shortcomings of the MRFED, which are prolonged execution time and moderate performance. Four modifications of the Histogram Difference (HD) method demonstrated the best performance on the majority of the life and social sciences data sources used in the experiments. The modifications of the HD algorithm were achieved using several re- scorers: HD with Normalized Smith-Waterman Re-scorer, HD with TFIDF and Jaccard re-scorers, HD with the Longest Common Prefix and TFIDF re-scorers, and HD with the Unweighted Longest Common Prefix Re-scorer. Another contribution of this dissertation includes the extensive analysis of the string similarity methods evaluation for duplicate detection and clustering tasks on the life and social sciences, bioinformatics, and medical informatics domains. The experimental results are illustrated with precision-recall charts and a number of tables presenting the average precision, maximum F1, and execution time

    Relaxation of Subgraph Queries Delivering Empty Results

    Get PDF
    Graph databases with the property graph model are used in multiple domains including social networks, biology, and data integration. They provide schema-flexible storage for data of a different degree of a structure and support complex, expressive queries such as subgraph isomorphism queries. The exibility and expressiveness of graph databases make it difficult for the users to express queries correctly and can lead to unexpected query results, e.g. empty results. Therefore, we propose a relaxation approach for subgraph isomorphism queries that is able to automatically rewrite a graph query, such that the rewritten query is similar to the original query and returns a non-empty result set. In detail, we present relaxation operations applicable to a query, cardinality estimation heuristics, and strategies for prioritizing graph query elements to be relaxed. To determine the similarity between the original query and its relaxed variants, we propose a novel cardinality-based graph edit distance. The feasibility of our approach is shown by using real-world queries from the DBpedia query log

    Matching Vehicle License Plate Numbers Using License Plate Recognition and Text Mining Techniques

    Get PDF
    License plate recognition (LPR) technology has been widely applied in many different transportation applications such as enforcement, vehicle monitoring and access control. In most applications involving enforcement (e.g. cashless toll collection, congestion charging) and access control (e.g. car parking) a plate is recognized at one location (or checkpoint) and compared against a list of authorized vehicles. In this research I dealt with applications where a vehicle is detected at two locations and there is no list of reference for vehicle identification. There seems to be very little effort in the past to exploit all information generated by LPR systems. In nowadays, LPR machines have the ability to recognize most characters on the vehicle plates even under the harshest practical conditions. Therefore, even though the equipment are not perfect in terms of plate reading, it is still possible to judge with certain confidence if a pair of imperfect readings, in the form of sequenced characters (strings), most likely belong to the same vehicle. The challenge here is to design a matching procedure in order to decide whether or not they belong to same vehicle. In view of the aforementioned problem, this research intended to design and assess a matching procedure that takes advantage of a similarity measure called edit distance (ED) between two strings. The ED measure the minimum editing cost to convert a string to another. The study first attempted to assess a simple case of a dual LPR setup using the traditional ED formulation with 0 or 1 cost assignments (i.e. 0 if a pair-wise character is the same, and 1 otherwise). For this dual setup, this research has further proposed a symbol-based weight function using a probabilistic approach having as input parameters the conditional probability matrix of character association. As a result, this new formulation outperformed the original ED formulation. Lastly, the research sought to incorporate the passage time information into the procedure. With this, the performance of the matching procedure improved considerably resulting in a high positive matching rate and much lower (about 2%) false matching rate

    An Investigation and Application of Biology and Bioinformatics for Activity Recognition

    Get PDF
    Activity recognition in a smart home context is inherently difficult due to the variable nature of human activities and tracking artifacts introduced by video-based tracking systems. This thesis addresses the activity recognition problem via introducing a biologically-inspired chemotactic approach and bioinformatics-inspired sequence alignment techniques to recognise spatial activities. The approaches are demonstrated in real world conditions to improve robustness and recognise activities in the presence of innate activity variability and tracking noise

    Arbitrary Keyword Spotting in Handwritten Documents

    Get PDF
    Despite the existence of electronic media in today’s world, a considerable amount of written communications is in paper form such as books, bank cheques, contracts, etc. There is an increasing demand for the automation of information extraction, classification, search, and retrieval of documents. The goal of this research is to develop a complete methodology for the spotting of arbitrary keywords in handwritten document images. We propose a top-down approach to the spotting of keywords in document images. Our approach is composed of two major steps: segmentation and decision. In the former, we generate the word hypotheses. In the latter, we decide whether a generated word hypothesis is a specific keyword or not. We carry out the decision step through a two-level classification where first, we assign an input image to a keyword or non-keyword class; and then transcribe the image if it is passed as a keyword. By reducing the problem from the image domain to the text domain, we do not only address the search problem in handwritten documents, but also the classification and retrieval, without the need for the transcription of the whole document image. The main contribution of this thesis is the development of a generalized minimum edit distance for handwritten words, and to prove that this distance is equivalent to an Ergodic Hidden Markov Model (EHMM). To the best of our knowledge, this work is the first to present an exact 2D model for the temporal information in handwriting while satisfying practical constraints. Some other contributions of this research include: 1) removal of page margins based on corner detection in projection profiles; 2) removal of noise patterns in handwritten images using expectation maximization and fuzzy inference systems; 3) extraction of text lines based on fast Fourier-based steerable filtering; 4) segmentation of characters based on skeletal graphs; and 5) merging of broken characters based on graph partitioning. Our experiments with a benchmark database of handwritten English documents and a real-world collection of handwritten French documents indicate that, even without any word/document-level training, our results are comparable with two state-of-the-art word spotting systems for English and French documents

    Systems and models for secure fallback authentication

    Get PDF
    Fallback authentication (FA) techniques such as security questions, Email resets, and SMS resets have significant security flaws that easily undermine the primary method of authentication. Security questions have been shown to be often guessable. Email resets assume a secure channel of communication and pose the threat of the avalanche effect; where one compromised email account can compromise a series of other accounts. SMS resets also assume a secure channel of communication and are vulnerable to attacks on telecommunications protocols. Additionally, all of these FA techniques are vulnerable to the known adversary. The known adversary is any individual with elevated knowledge of a potential victim, or elevated access to a potential victim's devices that uses these privileges with malicious intent, undermining the most commonly used FA techniques. An authentication system is only as strong as its weakest link; in many cases this is the FA technique used. As a result of that, we explore one new and one altered FA system: GeoPassHints a geographic authentication system paired with a secret note, as well as GeoSQ, an autobiographical authentication scheme that relies on location data to generate questions. We also propose three models to quantify the known adversary in order to establish an improved measurement tool for security research. We test GeoSQ and GeoPassHints for usability, security, and deployability through a user study with paired participants (n=34). We also evaluate the models for the purpose of measuring vulnerabilities to the known adversary by correlating the scores obtained in each model to the successful guesses that our participant pairs made

    [[alternative]]Retrieval of Vehicle License Number from a Database Using Imperfect Input

    No full text
    [[abstract]]Visual-based car license plate recognition seems to be an easy work, but practically speaking, it would be a hard work to have high recognition rate. Luminance condition has a huge influence over all visual-based systems, such that a perfect car license plate recognition system is still unavailable. In some specific applications, these unstable results could be useful if the work of recognition was supported by a license plate database. For example, a database can contain vehicle information of all cars parking in visual-based intelligent parking lot, when a vehicle owner wants to take his/her car, we can retrieve candidate cars from database according to edit distance between input license plate and license plates in database. Edit distance is a powerful tool for measuring the similarity between two strings. When comparing two license plates, we use edit distance technique to calculate the difference of two license plates using Chamfer distance to define the character editing cost which contains 「inserting character」、「deleting character」 and 「replacing character」. Then the edit distance of two license plates can represent the similarity of them. We modify the method of calculating edit distance by considering the neighborhood relationship of characters in source string when editing source string to destination string. The proposed method is first called Markov edit distance by J. Wei[Wei04]. We modify two clique potential functions from J. Wei’s paper to fit license plate comparison and get the finer edit distance when Markov relationship is considered. The modified Markov edit distance is very useful when compared license plates with reshuffling numbers.

    Portable Stolen Vehicle Detector

    No full text
    [[abstract]]In this paper, a portable stolen vehicle detector is proposed. This detector combines with GPS sensor, a color CCD camera, license plate recognition program, and a remote database with a smart query method. The hardware architecture and user operation diagram are discussed. The portable stolen vehicle recognition system is designed for police officers who work on the roadside, so the convenience is very important in design of the system operation. The design of the user operations is intended to be as few as possible. Traditionally, the query key to the database can be partial but must correct, but the recognition program can not guarantee all the results are correct. Thus, we employ the modified Markov edit distance method to calculate the scores to find best candidates. Moreover, we provide the prototype design chart of the detector, and a trial combination using a notebook PC and other hardware is tested. From real outdoor experimental results, our system demonstrates that it can work on the road side to recognize the license plate and verify if it is a stolen vehicle or not.
    corecore