85 research outputs found

    Search Engine Similarity Analysis: A Combined Content and Rankings Approach

    Full text link
    How different are search engines? The search engine wars are a favorite topic of on-line analysts, as two of the biggest companies in the world, Google and Microsoft, battle for prevalence of the web search space. Differences in search engine popularity can be explained by their effectiveness or other factors, such as familiarity with the most popular first engine, peer imitation, or force of habit. In this work we present a thorough analysis of the affinity of the two major search engines, Google and Bing, along with DuckDuckGo, which goes to great lengths to emphasize its privacy-friendly credentials. To do so, we collected search results using a comprehensive set of 300 unique queries for two time periods in 2016 and 2019, and developed a new similarity metric that leverages both the content and the ranking of search responses. We evaluated the characteristics of the metric against other metrics and approaches that have been proposed in the literature, and used it to (1) investigate the similarities of search engine results, (2) the evolution of their affinity over time, (3) what aspects of the results influence similarity, and (4) how the metric differs over different kinds of search services. We found that Google stands apart, but Bing and DuckDuckGo are largely indistinguishable from each other.Comment: Shorter version of this paper was accepted in the 21st International Conference on Web Information Systems Engineering (WISE 2020). The final authenticated version is available online at https://doi.org/10.1007/978-3-030-62008-0_

    Medical record linkage in health information systems by approximate string matching and clustering

    Get PDF
    BACKGROUND: Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity. METHODS: The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods. RESULTS: The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records. CONCLUSION: Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity

    Estimating parameters for probabilistic linkage of privacy-preserved datasets.

    Get PDF
    Background: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Methods: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Results: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. Conclusions: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets

    Accuracy and completeness of patient pathways – the benefits of national data linkage in Australia

    Get PDF
    Background - The technical challenges associated with national data linkage, and the extent of cross-border population movements, are explored as part of a pioneering research project. The project involved linking state-based hospital admission records and death registrations across Australia for a national study of hospital related deaths. Methods - The project linked over 44 million morbidity and mortality records from four Australian states between 1st July 1999 and 31st December 2009 using probabilistic methods. The accuracy of the linkage was measured through a comparison with jurisdictional keys sourced from individual states. The extent of cross-border population movement between these states was also assessed. Results - Data matching identified almost twelve million individuals across the four Australian states. The percentage of individuals from one state with records found in another ranged from 3-5 %. Using jurisdictional keys to measure linkage quality, results indicate a high matching efficiency (F measure 97 to 99 %), with linkage processing taking only a matter of days. Conclusions - The results demonstrate the feasibility and accuracy of undertaking cross jurisdictional linkage for national research. The benefits are substantial, particularly in relation to capturing the full complement of records in patient pathways as a result of cross-border population movements. The project identified a sizeable ‘mobile’ population with hospital records in more than one state. Research studies that focus on a single jurisdiction will under-enumerate the extent of hospital usage by individuals in the population. It is important that researchers understand and are aware of the impact of this missing hospital activity on their studies. The project highlights the need for an efficient and accurate data linkage system to support national research across Australia

    Parental and infant characteristics and childhood leukemia in Minnesota

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Leukemia is the most common childhood cancer. With the exception of Down syndrome, prenatal radiation exposure, and higher birth weight, particularly for acute lymphoid leukemia (ALL), few risk factors have been firmly established. Translocations present in neonatal blood spots and the young age peak of diagnosis suggest that early-life factors are involved in childhood leukemia etiology.</p> <p>Methods</p> <p>We investigated the association between birth characteristics and childhood leukemia through linkage of the Minnesota birth and cancer registries using a case-cohort study design. Cases included 560 children with ALL and 87 with acute myeloid leukemia (AML) diagnoses from 28 days to 14 years. The comparison group was comprised of 8,750 individuals selected through random sampling of the birth cohort from 1976–2004. Cox proportional hazards regression specific for case-cohort studies was used to compute hazard ratios (HR) and 95% confidence intervals (CIs).</p> <p>Results</p> <p>Male sex (HR = 1.41, 95% CI 1.16–1.70), white race (HR = 2.32, 95% CI 1.13–4.76), and maternal birth interval ≥ 3 years (HR = 1.31, 95% CI 1.01–1.70) increased ALL risk, while maternal age increased AML risk (HR = 1.21/5 year age increase, 95% CI 1.0–1.47). Higher birth weights (>3798 grams) (HRALL = 1.46, 1.08–1.98; HRAML = 1.97, 95% CI 1.07–3.65), and one minute Apgar scores ≤ 7 (HRALL = 1.30, 95% CI 1.05–1.61; HRAML = 1.62, 95% CI 1.01–2.60) increased risk for both types of leukemia. Sex was not a significant modifier of the association between ALL and other covariates, with the exception of maternal education.</p> <p>Conclusion</p> <p>We confirmed known risk factors for ALL: male sex, high birth weight, and white race. We have also provided data that supports an increased risk for AML following higher birth weights, and demonstrated an association with low Apgar scores.</p
    • …
    corecore