628 research outputs found

    Local Intrinsic Dimensionality Measures for Graphs, with Applications to Graph Embeddings

    Full text link
    The notion of local intrinsic dimensionality (LID) is an important advancement in data dimensionality analysis, with applications in data mining, machine learning and similarity search problems. Existing distance-based LID estimators were designed for tabular datasets encompassing data points represented as vectors in a Euclidean space. After discussing their limitations for graph-structured data considering graph embeddings and graph distances, we propose NC-LID, a novel LID-related measure for quantifying the discriminatory power of the shortest-path distance with respect to natural communities of nodes as their intrinsic localities. It is shown how this measure can be used to design LID-aware graph embedding algorithms by formulating two LID-elastic variants of node2vec with personalized hyperparameters that are adjusted according to NC-LID values. Our empirical analysis of NC-LID on a large number of real-world graphs shows that this measure is able to point to nodes with high link reconstruction errors in node2vec embeddings better than node centrality metrics. The experimental evaluation also shows that the proposed LID-elastic node2vec extensions improve node2vec by better preserving graph structure in generated embeddings

    The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

    Get PDF
    This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.Comment: Preprint of the paper accepted at SISAP 201

    The role of local dimensionality measures in benchmarking nearest neighbor search

    Get PDF
    This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concepts of local intrinsic dimensionality (LID), local relative contrast (RC), and query expansion allow to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of the distribution of these dimensionality measures on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. Interactive visualizations are available on the companion website.1 The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well

    Visualising the structure of document search results: A comparison of graph theoretic approaches

    Get PDF
    This is the post-print of the article - Copyright @ 2010 Sage PublicationsPrevious work has shown that distance-similarity visualisation or ‘spatialisation’ can provide a potentially useful context in which to browse the results of a query search, enabling the user to adopt a simple local foraging or ‘cluster growing’ strategy to navigate through the retrieved document set. However, faithfully mapping feature-space models to visual space can be problematic owing to their inherent high dimensionality and non-linearity. Conventional linear approaches to dimension reduction tend to fail at this kind of task, sacrificing local structural in order to preserve a globally optimal mapping. In this paper the clustering performance of a recently proposed algorithm called isometric feature mapping (Isomap), which deals with non-linearity by transforming dissimilarities into geodesic distances, is compared to that of non-metric multidimensional scaling (MDS). Various graph pruning methods, for geodesic distance estimation, are also compared. Results show that Isomap is significantly better at preserving local structural detail than MDS, suggesting it is better suited to cluster growing and other semantic navigation tasks. Moreover, it is shown that applying a minimum-cost graph pruning criterion can provide a parameter-free alternative to the traditional K-neighbour method, resulting in spatial clustering that is equivalent to or better than that achieved using an optimal-K criterion

    Artificial Text Boundary Detection with Topological Data Analysis and Sliding Window Techniques

    Full text link
    Due to the rapid development of text generation models, people increasingly often encounter texts that may start out as written by a human but then continue as machine-generated results of large language models. Detecting the boundary between human-written and machine-generated parts of such texts is a very challenging problem that has not received much attention in literature. In this work, we consider and compare a number of different approaches for this artificial text boundary detection problem, comparing several predictors over features of different nature. We show that supervised fine-tuning of the RoBERTa model works well for this task in general but fails to generalize in important cross-domain and cross-generator settings, demonstrating a tendency to overfit to spurious properties of the data. Then, we propose novel approaches based on features extracted from a frozen language model's embeddings that are able to outperform both the human accuracy level and previously considered baselines on the Real or Fake Text benchmark. Moreover, we adapt perplexity-based approaches for the boundary detection task and analyze their behaviour. We analyze the robustness of all proposed classifiers in cross-domain and cross-model settings, discovering important properties of the data that can negatively influence the performance of artificial text boundary detection algorithms

    Applications of Continuous Spatial Models in Multiple Antenna Signal Processing

    No full text
    This thesis covers the investigation and application of continuous spatial models for multiple antenna signal processing. The use of antenna arrays for advanced sensing and communications systems has been facilitated by the rapid increase in the capabilities of digital signal processing systems. The wireless communications channel will vary across space as different signal paths from the same source combine and interfere. This creates a level of spatial diversity that can be exploited to improve the robustness and overall capacity of the wireless channel. Conventional approaches to using spatial diversity have centered on smart, adaptive antennas and spatial beam forming. Recently, the more general theory of multiple input, multiple output (MIMO) systems has been developed to utilise the independent spatial communication modes offered in a scattering environment. ¶ ..

    A geothermal aquifer in the dilation zones on the southern margin of the Dublin Basin

    Get PDF
    This is the author accepted manuscript. the final version is available from Oxford University Press via the DOI in this recordWe present modelling of the geophysical data from the Newcastle area, west of Dublin, Ireland within the framework of the IRETHERM project. IRETHERM's overarching objective was to facilitate a more thorough strategic understanding of Ireland's geothermal energy potential through integrated modelling of new and existing geophysical, geochemical and geological data. The Newcastle area, one of the target localities, is situated at the southern margin of the Dublin Basin, close to the largest conurbation on the island of Ireland in the City of Dublin and surrounds. As part of IRETHERM, magnetotelluric (MT) soundings were carried out in the highly urbanized Dublin suburb in 2011 and 2012, and a description of MT data acquisition, processing methods, multi-dimensional geoelectrical models and porosity modelling with other geophysical data are presented. The MT time series were heavily noise-contaminated and distorted due to electromagnetic noise from nearby industry and Dublin City tram/railway systems. Time series processing was performed using several modern robust codes to obtain reasonably reliable and interpretable MT impedance and geomagnetic transfer function ‘tipper’ estimates at most of the survey locations. The most ‘quiet’ 3-hour subsets of data during the night time, when the DC ‘LUAS’ tram system was not operating, were used in multi-site and multivariate processing. The final 2-D models underwent examination using a stability technique, and the final two 2-D profiles, with reliability estimations expressed through conductance and resistivity, were derived. In the final stage of this study, 3-D modelling of all magnetotelluric data in the Newcastle area was also undertaken. Comparison of the MT models and their interpretation with existing seismic profiles in the area reveals that the Blackrock to Newcastle Fault (BNF) zone is visible in the models as a conductive feature down to depths of 4 km. The investigated area below Newcastle can be divided into two domains of different depths, formed as depth zones. The first zone, from the surface down to 1–2 km, is dominated by NE-SW oriented conductors connected with shallow faults or folds probably filled with less saline waters. The conductors are also crossing the surface trace of the BNF. The second depth domain can be identified from depths of 2 km to 4 km, where structures are oriented along the BNF and the observed conductivity is lower. The deeper conductive layers are interpreted as geothermal-fluid-bearing rocks. Porosity and permeability estimations from the lithological borehole logs indicate the geothermal potential of the bedrock, to deliver warm water to the surface. The fluid permeability estimation, based on Archie's law for porous structures and synthetic studies of fractured zones, suggests a permeability in the range 100 mD–100 D in the study area, which is prospective for geothermal energy exploitation.Science Foundation Ireland (SFI)Slovak Academy of Sciences (SAS)European Union FP7APVVSlovak Grant Agency VEG

    Spatially indirect excitons in coupled quantum wells

    Full text link
    corecore