628 research outputs found
Local Intrinsic Dimensionality Measures for Graphs, with Applications to Graph Embeddings
The notion of local intrinsic dimensionality (LID) is an important
advancement in data dimensionality analysis, with applications in data mining,
machine learning and similarity search problems. Existing distance-based LID
estimators were designed for tabular datasets encompassing data points
represented as vectors in a Euclidean space. After discussing their limitations
for graph-structured data considering graph embeddings and graph distances, we
propose NC-LID, a novel LID-related measure for quantifying the discriminatory
power of the shortest-path distance with respect to natural communities of
nodes as their intrinsic localities. It is shown how this measure can be used
to design LID-aware graph embedding algorithms by formulating two LID-elastic
variants of node2vec with personalized hyperparameters that are adjusted
according to NC-LID values. Our empirical analysis of NC-LID on a large number
of real-world graphs shows that this measure is able to point to nodes with
high link reconstruction errors in node2vec embeddings better than node
centrality metrics. The experimental evaluation also shows that the proposed
LID-elastic node2vec extensions improve node2vec by better preserving graph
structure in generated embeddings
The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search
This paper reconsiders common benchmarking approaches to nearest neighbor
search. It is shown that the concept of local intrinsic dimensionality (LID)
allows to choose query sets of a wide range of difficulty for real-world
datasets. Moreover, the effect of different LID distributions on the running
time performance of implementations is empirically studied. To this end,
different visualization concepts are introduced that allow to get a more
fine-grained overview of the inner workings of nearest neighbor search
principles. The paper closes with remarks about the diversity of datasets
commonly used for nearest neighbor search benchmarking. It is shown that such
real-world datasets are not diverse: results on a single dataset predict
results on all other datasets well.Comment: Preprint of the paper accepted at SISAP 201
The role of local dimensionality measures in benchmarking nearest neighbor search
This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concepts of local intrinsic dimensionality (LID), local relative contrast (RC), and query expansion allow to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of the distribution of these dimensionality measures on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. Interactive visualizations are available on the companion website.1 The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well
Visualising the structure of document search results: A comparison of graph theoretic approaches
This is the post-print of the article - Copyright @ 2010 Sage PublicationsPrevious work has shown that distance-similarity visualisation or ‘spatialisation’ can provide a potentially useful context in which to browse the results of a query search, enabling the user to adopt a simple local foraging or ‘cluster growing’ strategy to navigate through the retrieved document set. However, faithfully mapping feature-space models to visual space can be problematic owing to their inherent high dimensionality and non-linearity. Conventional linear approaches to dimension reduction tend to fail at this kind of task, sacrificing local structural in order to preserve a globally optimal mapping. In this paper the clustering performance of a recently proposed algorithm called isometric feature mapping (Isomap), which deals with non-linearity by transforming dissimilarities into geodesic distances, is compared to that of non-metric multidimensional scaling (MDS). Various graph pruning methods, for geodesic distance estimation, are also compared. Results show that Isomap is significantly better at preserving local structural detail than MDS, suggesting it is better suited to cluster growing and other semantic navigation tasks. Moreover, it is shown that applying a minimum-cost graph pruning criterion can provide a parameter-free alternative to the traditional K-neighbour method, resulting in spatial clustering that is equivalent to or better than that achieved using an optimal-K criterion
Artificial Text Boundary Detection with Topological Data Analysis and Sliding Window Techniques
Due to the rapid development of text generation models, people increasingly
often encounter texts that may start out as written by a human but then
continue as machine-generated results of large language models. Detecting the
boundary between human-written and machine-generated parts of such texts is a
very challenging problem that has not received much attention in literature. In
this work, we consider and compare a number of different approaches for this
artificial text boundary detection problem, comparing several predictors over
features of different nature. We show that supervised fine-tuning of the
RoBERTa model works well for this task in general but fails to generalize in
important cross-domain and cross-generator settings, demonstrating a tendency
to overfit to spurious properties of the data. Then, we propose novel
approaches based on features extracted from a frozen language model's
embeddings that are able to outperform both the human accuracy level and
previously considered baselines on the Real or Fake Text benchmark. Moreover,
we adapt perplexity-based approaches for the boundary detection task and
analyze their behaviour. We analyze the robustness of all proposed classifiers
in cross-domain and cross-model settings, discovering important properties of
the data that can negatively influence the performance of artificial text
boundary detection algorithms
Applications of Continuous Spatial Models in Multiple Antenna Signal Processing
This thesis covers the investigation and application of continuous spatial models for multiple antenna signal processing. The use of antenna arrays for advanced sensing and communications systems has been facilitated by the rapid increase in the capabilities of digital signal processing systems. The wireless communications channel will vary across space as different signal paths from the same source combine and interfere. This creates a level of spatial diversity that can be exploited to improve the robustness and overall capacity of the wireless channel. Conventional approaches to using spatial diversity have centered on smart, adaptive antennas and spatial beam forming. Recently, the more general theory of multiple input, multiple output (MIMO) systems has been developed to utilise the independent spatial communication modes offered in a scattering environment. ¶ ..
A geothermal aquifer in the dilation zones on the southern margin of the Dublin Basin
This is the author accepted manuscript. the final version is available from Oxford University Press via the DOI in this recordWe present modelling of the geophysical data from the Newcastle area, west of Dublin, Ireland within the framework of the IRETHERM project. IRETHERM's overarching objective was to facilitate a more thorough strategic understanding of Ireland's geothermal energy potential through integrated modelling of new and existing geophysical, geochemical and geological data. The Newcastle area, one of the target localities, is situated at the southern margin of the Dublin Basin, close to the largest conurbation on the island of Ireland in the City of Dublin and surrounds. As part of IRETHERM, magnetotelluric (MT) soundings were carried out in the highly urbanized Dublin suburb in 2011 and 2012, and a description of MT data acquisition, processing methods, multi-dimensional geoelectrical models and porosity modelling with other geophysical data are presented. The MT time series were heavily noise-contaminated and distorted due to electromagnetic noise from nearby industry and Dublin City tram/railway systems. Time series processing was performed using several modern robust codes to obtain reasonably reliable and interpretable MT impedance and geomagnetic transfer function ‘tipper’ estimates at most of the survey locations. The most ‘quiet’ 3-hour subsets of data during the night time, when the DC ‘LUAS’ tram system was not operating, were used in multi-site and multivariate processing. The final 2-D models underwent examination using a stability technique, and the final two 2-D profiles, with reliability estimations expressed through conductance and resistivity, were derived. In the final stage of this study, 3-D modelling of all magnetotelluric data in the Newcastle area was also undertaken. Comparison of the MT models and their interpretation with existing seismic profiles in the area reveals that the Blackrock to Newcastle Fault (BNF) zone is visible in the models as a conductive feature down to depths of 4 km. The investigated area below Newcastle can be divided into two domains of different depths, formed as depth zones. The first zone, from the surface down to 1–2 km, is dominated by NE-SW oriented conductors connected with shallow faults or folds probably filled with less saline waters. The conductors are also crossing the surface trace of the BNF. The second depth domain can be identified from depths of 2 km to 4 km, where structures are oriented along the BNF and the observed conductivity is lower. The deeper conductive layers are interpreted as geothermal-fluid-bearing rocks. Porosity and permeability estimations from the lithological borehole logs indicate the geothermal potential of the bedrock, to deliver warm water to the surface. The fluid permeability estimation, based on Archie's law for porous structures and synthetic studies of fractured zones, suggests a permeability in the range 100 mD–100 D in the study area, which is prospective for geothermal energy exploitation.Science Foundation Ireland (SFI)Slovak Academy of Sciences (SAS)European Union FP7APVVSlovak Grant Agency VEG
Recommended from our members
Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty
Data management is becoming increasingly important in many applications, in particular, in large scientific databases where (1) data can be naturally modeled by continuous random variables, and (2) queries can involve complex predicates and/or be difficult for users to express explicitly. My thesis work aims to provide efficient support to both the data uncertainty and the query uncertainty .
When data is uncertain, an important class of queries requires query answers to be returned if their existence probabilities pass a threshold. I start with optimizing such threshold query processing for continuous uncertain data in the relational model by (i) expediting selections by reducing dimensionality of integration and using faster filters, (ii) expediting joins using new indexes on uncertain data, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of my techniques and the dynamic query planning has over 50% performance gains in most cases over a state-of-the-art threshold query optimizer and is very close to the optimal planning in all cases.
Next I address uncertain data management in the array model, which has gained popu- larity for scientific data processing recently due to performance benefits. I define the formal semantics of array operations on uncertain data involving both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes, and propose a suite of storage and evaluation strategies for array operators, with a focus on a novel scheme that bounds the overhead of querying by strategically placing a few replicas of the tuples with large variances. Evaluation results show that for common workloads, my best-performing techniques outperform baselines up to 1 to 2 orders of magnitude while incurring only small storage overhead.
Finally, to bridge the increasing gap between the fast growth of data and the limited human ability to comprehend data and help the user retrieve high-value content from data more effectively, I propose to build interactive data exploration as a new database service, using an approach called “explore-by-example”. To build an effective system, my work is grounded in a rigorous SVM-based active learning framework and focuses on the following three problems: (i) accuracy-based and convergence-based stopping criteria, (ii) expediting example acquisition in each iteration, and (iii) expediting the final result retrieval. Evaluation results using real-world data and query patterns show that my system significantly outperforms state-of-the-art systems in accuracy (18x accuracy improvement for 4-dimensional workloads) while achieving desired efficiency for interactive exploration (2 to 5 seconds per iteration)
- …