13 research outputs found
Improved Algorithms for Approximate String Matching (Extended Abstract)
The problem of approximate string matching is important in many different
areas such as computational biology, text processing and pattern recognition. A
great effort has been made to design efficient algorithms addressing several
variants of the problem, including comparison of two strings, approximate
pattern identification in a string or calculation of the longest common
subsequence that two strings share.
We designed an output sensitive algorithm solving the edit distance problem
between two strings of lengths n and m respectively in time
O((s-|n-m|)min(m,n,s)+m+n) and linear space, where s is the edit distance
between the two strings. This worst-case time bound sets the quadratic factor
of the algorithm independent of the longest string length and improves existing
theoretical bounds for this problem. The implementation of our algorithm excels
also in practice, especially in cases where the two strings compared differ
significantly in length. Source code of our algorithm is available at
http://www.cs.miami.edu/\~dimitris/edit_distanceComment: 10 page
Pitch- and spectral-based dynamic time warping methods for comparing field recordings of harmonic avian vocalizations
Quantitative measures of acoustic similarity can reveal patterns of shared vocal behavior in social species. Many methods for computing similarity have been developed, but their performance has not been extensively characterized in noisy environments and with vocalizations characterized by complex frequency modulations. This paper describes methods of bioacoustic comparison based on dynamic time warping (DTW) of the fundamental frequency or spectrogram. Fundamental frequency is estimated using a Bayesian particle filter adaptation of harmonic template matching. The methods were tested on field recordings of flight calls from superb starlings, Lamprotornis superbus, for how well they could separate distinct categories of call elements (motifs). The fundamental-frequency-based method performed best, but the spectrogram-based method was less sensitive to noise. Both DTW methods provided better separation of categories than spectrographic cross correlation, likely due to substantial variability in the duration of superb starling flight call motifs
Medical record linkage in health information systems by approximate string matching and clustering
BACKGROUND: Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity. METHODS: The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods. RESULTS: The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records. CONCLUSION: Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity