100 research outputs found
The Flexible Group Spatial Keyword Query
We present a new class of service for location based social networks, called
the Flexible Group Spatial Keyword Query, which enables a group of users to
collectively find a point of interest (POI) that optimizes an aggregate cost
function combining both spatial distances and keyword similarities. In
addition, our query service allows users to consider the tradeoffs between
obtaining a sub-optimal solution for the entire group and obtaining an
optimimized solution but only for a subgroup.
We propose algorithms to process three variants of the query: (i) the group
nearest neighbor with keywords query, which finds a POI that optimizes the
aggregate cost function for the whole group of size n, (ii) the subgroup
nearest neighbor with keywords query, which finds the optimal subgroup and a
POI that optimizes the aggregate cost function for a given subgroup size m (m
<= n), and (iii) the multiple subgroup nearest neighbor with keywords query,
which finds optimal subgroups and corresponding POIs for each of the subgroup
sizes in the range [m, n]. We design query processing algorithms based on
branch-and-bound and best-first paradigms. Finally, we provide theoretical
bounds and conduct extensive experiments with two real datasets which verify
the effectiveness and efficiency of the proposed algorithms.Comment: 12 page
Bulk Insertions into xBR+ -trees
Bulk insertion refers to the process of updating an existing index by inserting a large batch of new data, treating the items of this batch as a whole and not by inserting these items one-by-one. Bulk insertion is related to bulk loading, which refers to the process of creating a non-existing index from scratch, when the dataset to be indexed is available beforehand. The xBR + -tree is a balanced, disk-resident, Quadtree-based index for point data, which is very efficient for processing spatial queries. In this paper, we present the first algorithm for bulk insertion into xBR+ -trees. This algorithm incorporates extensions of techniques that we have recently developed for bulk loading xBR+ -trees. Moreover, using real and artificial datasets of various cardinalities, we present an experimental comparison of this algorithm vs. inserting items one-by-one for updating xBR+ -trees, regarding performance (I/O and execution time) and the characteristics of the resulting trees. We also present experimental results regarding the query-processing efficiency of xBR+ -trees built by bulk insertions vs. xBR+ -trees built by inserting items one-by-one
Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs
We demonstrate that a graph-based search algorithm-relying on the
construction of an approximate neighborhood graph-can directly work with
challenging non-metric and/or non-symmetric distances without resorting to
metric-space mapping and/or distance symmetrization, which, in turn, lead to
substantial performance degradation. Although the straightforward metrization
and symmetrization is usually ineffective, we find that constructing an index
using a modified, e.g., symmetrized, distance can improve performance. This
observation paves a way to a new line of research of designing index-specific
graph-construction distance functions
Recommended from our members
Anonymisation of geographical distance matrices via Lipschitz embedding
BACKGROUND: Anonymisation of spatially referenced data has received increasing attention in recent years. Whereas the research focus has been on the anonymisation of point locations, the disclosure risk arising from the publishing of inter-point distances and corresponding anonymisation methods have not been studied systematically.
METHODS: We propose a new anonymisation method for the release of geographical distances between records of a microdata file-for example patients in a medical database. We discuss a data release scheme in which microdata without coordinates and an additional distance matrix between the corresponding rows of the microdata set are released. In contrast to most other approaches this method preserves small distances better than larger distances. The distances are modified by a variant of Lipschitz embedding.
RESULTS: The effects of the embedding parameters on the risk of data disclosure are evaluated by linkage experiments using simulated data. The results indicate small disclosure risks for appropriate embedding parameters.
CONCLUSION: The proposed method is useful if published distance information might be misused for the re-identification of records. The method can be used for publishing scientific-use-files and as an additional tool for record-linkage studies
Using metric space indexing for complete and efficient record linkage
Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.Postprin
Neuregulin 1 and susceptibility to schizophrenia
To access full text version of this article. Please click on the hyperlink "View/Open" at the bottom of this pageThe cause of schizophrenia is unknown, but it has a significant genetic component. Pharmacologic studies, studies of gene expression in man, and studies of mouse mutants suggest involvement of glutamate and dopamine neurotransmitter systems. However, so far, strong association has not been found between schizophrenia and variants of the genes encoding components of these systems. Here, we report the results of a genomewide scan of schizophrenia families in Iceland; these results support previous work, done in five populations, showing that schizophrenia maps to chromosome 8p. Extensive fine-mapping of the 8p locus and haplotype-association analysis, supplemented by a transmission/disequilibrium test, identifies neuregulin 1 (NRG1) as a candidate gene for schizophrenia. NRG1 is expressed at central nervous system synapses and has a clear role in the expression and activation of neurotransmitter receptors, including glutamate receptors. Mutant mice heterozygous for either NRG1 or its receptor, ErbB4, show a behavioral phenotype that overlaps with mouse models for schizophrenia. Furthermore, NRG1 hypomorphs have fewer functional NMDA receptors than wild-type mice. We also demonstrate that the behavioral phenotypes of the NRG1 hypomorphs are partially reversible with clozapine, an atypical antipsychotic drug used to treat schizophrenia
Fourteen sequence variants that associate with multiple sclerosis discovered by meta-analysis informed by genetic correlations
To access publisher's full text version of this article, please click on the hyperlink in Additional Links field or click on the hyperlink at the top of the page marked FilesA meta-analysis of publicly available summary statistics on multiple sclerosis combined with three Nordic multiple sclerosis cohorts (21,079 cases, 371,198 controls) revealed seven sequence variants associating with multiple sclerosis, not reported previously. Using polygenic risk scores based on public summary statistics of variants outside the major histocompatibility complex region we quantified genetic overlap between common autoimmune diseases in Icelanders and identified disease clusters characterized by autoantibody presence/absence. As multiple sclerosis-polygenic risk scores captures the risk of primary biliary cirrhosis and vice versa (P = 1.6 x 10(-7), 4.3 x 10(-9)) we used primary biliary cirrhosis as a proxy-phenotype for multiple sclerosis, the idea being that variants conferring risk of primary biliary cirrhosis have a prior probability of conferring risk of multiple sclerosis. We tested 255 variants forming the primary biliary cirrhosis-polygenic risk score and found seven multiple sclerosis-associating variants not correlated with any previously established multiple sclerosis variants. Most of the variants discovered are close to or within immune-related genes. One is a low-frequency missense variant in TYK2, another is a missense variant in MTHFR that reduces the function of the encoded enzyme affecting methionine metabolism, reported to be dysregulated in multiple sclerosis brain.Swedish Research Council
Knut and Alice Wallenberg Foundation
AFA Foundation
Swedish Brain Foundatio
Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts
- …