Search CORE

628 research outputs found

Local Intrinsic Dimensionality Measures for Graphs, with Applications to Graph Embeddings

Author: Kurbalija Vladimir
Radovanović Miloš
Savić Miloš
Publication venue
Publication date: 13/07/2023
Field of study

The notion of local intrinsic dimensionality (LID) is an important advancement in data dimensionality analysis, with applications in data mining, machine learning and similarity search problems. Existing distance-based LID estimators were designed for tabular datasets encompassing data points represented as vectors in a Euclidean space. After discussing their limitations for graph-structured data considering graph embeddings and graph distances, we propose NC-LID, a novel LID-related measure for quantifying the discriminatory power of the shortest-path distance with respect to natural communities of nodes as their intrinsic localities. It is shown how this measure can be used to design LID-aware graph embedding algorithms by formulating two LID-elastic variants of node2vec with personalized hyperparameters that are adjusted according to NC-LID values. Our empirical analysis of NC-LID on a large number of real-world graphs shows that this measure is able to point to nodes with high link reconstruction errors in node2vec embeddings better than node centrality metrics. The experimental evaluation also shows that the proposed LID-elastic node2vec extensions improve node2vec by better preserving graph structure in generated embeddings

arXiv.org e-Print Archive

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Author: E Chávez
G Casanova
H Jégou
H Kriegel
I Jolliffe
K Smith-Miles
Laurent Amsaleg
M Aumüller
ME Houle
RR Curtin
WB Johnson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.Comment: Preprint of the paper accepted at SISAP 201

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

The role of local dimensionality measures in benchmarking nearest neighbor search

Author: Ahle
Amsaleg
Amsaleg
Amsaleg
Asuero
Aumüller
Aumüller
Aumüller
Beyer
Casanova
Chávez
Curtin
François
He
Houle
Houle
Houle
Houle
Iwasaki
Johnson
Johnson
Jolliffe
Karger
Kriegel
Li
Malkov
Smith-Miles
Spring
Xiao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concepts of local intrinsic dimensionality (LID), local relative contrast (RC), and query expansion allow to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of the distribution of these dimensionality measures on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. Interactive visualizations are available on the companion website.1 The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

Visualising the structure of document search results: A comparison of graph theoretic approaches

Author: Busing F.
Chen C.
Coxon A.
Leuski A.
Salton G.
Skupin A.
Timothy Cribbin
Van Rijsbergen C.J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/04/2009
Field of study

This is the post-print of the article - Copyright @ 2010 Sage PublicationsPrevious work has shown that distance-similarity visualisation or ‘spatialisation’ can provide a potentially useful context in which to browse the results of a query search, enabling the user to adopt a simple local foraging or ‘cluster growing’ strategy to navigate through the retrieved document set. However, faithfully mapping feature-space models to visual space can be problematic owing to their inherent high dimensionality and non-linearity. Conventional linear approaches to dimension reduction tend to fail at this kind of task, sacrificing local structural in order to preserve a globally optimal mapping. In this paper the clustering performance of a recently proposed algorithm called isometric feature mapping (Isomap), which deals with non-linearity by transforming dissimilarities into geodesic distances, is compared to that of non-metric multidimensional scaling (MDS). Various graph pruning methods, for geodesic distance estimation, are also compared. Results show that Isomap is significantly better at preserving local structural detail than MDS, suggesting it is better suited to cluster growing and other semantic navigation tasks. Moreover, it is shown that applying a minimum-cost graph pruning criterion can provide a parameter-free alternative to the traditional K-neighbour method, resulting in spatial clustering that is equivalent to or better than that achieved using an optimal-K criterion

Crossref

Brunel University Research Archive

Artificial Text Boundary Detection with Topological Data Analysis and Sliding Window Techniques

Author: Abulkhanov Dmitry
Barannikov Serguei
Gaintseva Tatiana
Kushnareva Laida
Kuznetsov Kristian
Magai German
Nikolenko Sergey
Piontkovskaya Irina
Publication venue
Publication date: 14/11/2023
Field of study

Due to the rapid development of text generation models, people increasingly often encounter texts that may start out as written by a human but then continue as machine-generated results of large language models. Detecting the boundary between human-written and machine-generated parts of such texts is a very challenging problem that has not received much attention in literature. In this work, we consider and compare a number of different approaches for this artificial text boundary detection problem, comparing several predictors over features of different nature. We show that supervised fine-tuning of the RoBERTa model works well for this task in general but fails to generalize in important cross-domain and cross-generator settings, demonstrating a tendency to overfit to spurious properties of the data. Then, we propose novel approaches based on features extracted from a frozen language model's embeddings that are able to outperform both the human accuracy level and previously considered baselines on the Real or Fake Text benchmark. Moreover, we adapt perplexity-based approaches for the boundary detection task and analyze their behaviour. We analyze the robustness of all proposed classifiers in cross-domain and cross-model settings, discovering important properties of the data that can negatively influence the performance of artificial text boundary detection algorithms

arXiv.org e-Print Archive

Applications of Continuous Spatial Models in Multiple Antenna Signal Processing

Author: Dickins Glenn
Publication venue
Publication date: 01/01/2007
Field of study

This thesis covers the investigation and application of continuous spatial models for multiple antenna signal processing. The use of antenna arrays for advanced sensing and communications systems has been facilitated by the rapid increase in the capabilities of digital signal processing systems. The wireless communications channel will vary across space as different signal paths from the same source combine and interfere. This creates a level of spatial diversity that can be exploited to improve the robustness and overall capacity of the wireless channel. Conventional approaches to using spatial diversity have centered on smart, adaptive antennas and spatial beam forming. Recently, the more general theory of multiple input, multiple output (MIMO) systems has been developed to utilise the independent spatial communication modes offered in a scattering environment. ¶ ..

The Australian National University

A geothermal aquifer in the dilation zones on the southern margin of the Dublin Basin

Author: Campanya J
Jones AG
Muller MR
Pasquali R
Vozar J
Yeomans C
Publication venue: 'Oxford University Press (OUP)'
Publication date: 03/12/2019
Field of study

This is the author accepted manuscript. the final version is available from Oxford University Press via the DOI in this recordWe present modelling of the geophysical data from the Newcastle area, west of Dublin, Ireland within the framework of the IRETHERM project. IRETHERM's overarching objective was to facilitate a more thorough strategic understanding of Ireland's geothermal energy potential through integrated modelling of new and existing geophysical, geochemical and geological data. The Newcastle area, one of the target localities, is situated at the southern margin of the Dublin Basin, close to the largest conurbation on the island of Ireland in the City of Dublin and surrounds. As part of IRETHERM, magnetotelluric (MT) soundings were carried out in the highly urbanized Dublin suburb in 2011 and 2012, and a description of MT data acquisition, processing methods, multi-dimensional geoelectrical models and porosity modelling with other geophysical data are presented. The MT time series were heavily noise-contaminated and distorted due to electromagnetic noise from nearby industry and Dublin City tram/railway systems. Time series processing was performed using several modern robust codes to obtain reasonably reliable and interpretable MT impedance and geomagnetic transfer function ‘tipper’ estimates at most of the survey locations. The most ‘quiet’ 3-hour subsets of data during the night time, when the DC ‘LUAS’ tram system was not operating, were used in multi-site and multivariate processing. The final 2-D models underwent examination using a stability technique, and the final two 2-D profiles, with reliability estimations expressed through conductance and resistivity, were derived. In the final stage of this study, 3-D modelling of all magnetotelluric data in the Newcastle area was also undertaken. Comparison of the MT models and their interpretation with existing seismic profiles in the area reveals that the Blackrock to Newcastle Fault (BNF) zone is visible in the models as a conductive feature down to depths of 4 km. The investigated area below Newcastle can be divided into two domains of different depths, formed as depth zones. The first zone, from the surface down to 1–2 km, is dominated by NE-SW oriented conductors connected with shallow faults or folds probably filled with less saline waters. The conductors are also crossing the surface trace of the BNF. The second depth domain can be identified from depths of 2 km to 4 km, where structures are oriented along the BNF and the observed conductivity is lower. The deeper conductive layers are interpreted as geothermal-fluid-bearing rocks. Porosity and permeability estimations from the lithological borehole logs indicate the geothermal potential of the bedrock, to deliver warm water to the surface. The fluid permeability estimation, based on Archie's law for porous structures and synthetic studies of fractured zones, suggests a permeability in the range 100 mD–100 D in the study area, which is prospective for geothermal energy exploitation.Science Foundation Ireland (SFI)Slovak Academy of Sciences (SAS)European Union FP7APVVSlovak Grant Agency VEG

Open Research Exeter

Recommended from our members

Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty

Author: Peng Liping
Publication venue: ScholarWorks@UMass Amherst
Publication date: 23/03/2018
Field of study

Data management is becoming increasingly important in many applications, in particular, in large scientific databases where (1) data can be naturally modeled by continuous random variables, and (2) queries can involve complex predicates and/or be difficult for users to express explicitly. My thesis work aims to provide efficient support to both the data uncertainty and the query uncertainty . When data is uncertain, an important class of queries requires query answers to be returned if their existence probabilities pass a threshold. I start with optimizing such threshold query processing for continuous uncertain data in the relational model by (i) expediting selections by reducing dimensionality of integration and using faster filters, (ii) expediting joins using new indexes on uncertain data, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of my techniques and the dynamic query planning has over 50% performance gains in most cases over a state-of-the-art threshold query optimizer and is very close to the optimal planning in all cases. Next I address uncertain data management in the array model, which has gained popu- larity for scientific data processing recently due to performance benefits. I define the formal semantics of array operations on uncertain data involving both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes, and propose a suite of storage and evaluation strategies for array operators, with a focus on a novel scheme that bounds the overhead of querying by strategically placing a few replicas of the tuples with large variances. Evaluation results show that for common workloads, my best-performing techniques outperform baselines up to 1 to 2 orders of magnitude while incurring only small storage overhead. Finally, to bridge the increasing gap between the fast growth of data and the limited human ability to comprehend data and help the user retrieve high-value content from data more effectively, I propose to build interactive data exploration as a new database service, using an approach called “explore-by-example”. To build an effective system, my work is grounded in a rigorous SVM-based active learning framework and focuses on the following three problems: (i) accuracy-based and convergence-based stopping criteria, (ii) expediting example acquisition in each iteration, and (iii) expediting the final result retrieval. Evaluation results using real-world data and query patterns show that my system significantly outperforms state-of-the-art systems in accuracy (18x accuracy improvement for 4-dimensional workloads) while achieving desired efficiency for interactive exploration (2 to 5 seconds per iteration)

ScholarWorks@UMass Amherst

Spatially indirect excitons in coupled quantum wells

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref