142 research outputs found
Pruning based Distance Sketches with Provable Guarantees on Random Graphs
Measuring the distances between vertices on graphs is one of the most
fundamental components in network analysis. Since finding shortest paths
requires traversing the graph, it is challenging to obtain distance information
on large graphs very quickly. In this work, we present a preprocessing
algorithm that is able to create landmark based distance sketches efficiently,
with strong theoretical guarantees. When evaluated on a diverse set of social
and information networks, our algorithm significantly improves over existing
approaches by reducing the number of landmarks stored, preprocessing time, or
stretch of the estimated distances.
On Erd\"{o}s-R\'{e}nyi graphs and random power law graphs with degree
distribution exponent , our algorithm outputs an exact distance
data structure with space between and
depending on the value of , where is the number of vertices. We
complement the algorithm with tight lower bounds for Erdos-Renyi graphs and the
case when is close to two.Comment: Full version for the conference paper to appear in The Web
Conference'1
Bridging Dense and Sparse Maximum Inner Product Search
Maximum inner product search (MIPS) over dense and sparse vectors have
progressed independently in a bifurcated literature for decades; the latter is
better known as top- retrieval in Information Retrieval. This duality exists
because sparse and dense vectors serve different end goals. That is despite the
fact that they are manifestations of the same mathematical problem. In this
work, we ask if algorithms for dense vectors could be applied effectively to
sparse vectors, particularly those that violate the assumptions underlying
top- retrieval methods. We study IVF-based retrieval where vectors are
partitioned into clusters and only a fraction of clusters are searched during
retrieval. We conduct a comprehensive analysis of dimensionality reduction for
sparse vectors, and examine standard and spherical KMeans for partitioning. Our
experiments demonstrate that IVF serves as an efficient solution for sparse
MIPS. As byproducts, we identify two research opportunities and demonstrate
their potential. First, we cast the IVF paradigm as a dynamic pruning technique
and turn that insight into a novel organization of the inverted index for
approximate MIPS for general sparse vectors. Second, we offer a unified regime
for MIPS over vectors that have dense and sparse subspaces, and show its
robustness to query distributions
Graph enabled cross-domain knowledge transfer
The world has never been more connected, led by the information technology revolution in the past decades that has fundamentally changed the way people interact with each other using social networks. Consequently, enormous human activity data are collected from the business world and machine learning techniques are widely adopted to aid our decision processes. Despite of the success of machine learning in various application scenarios, there are still many questions that need to be well answered, such as optimizing machine learning outcomes when desired knowledge cannot be extracted from the available data. This naturally drives us to ponder if one can leverage some side information to populate the knowledge domain of their interest, such that the problems within that knowledge domain can be better tackled.
In this work, such problems are investigated and practical solutions are proposed. To leverage machine learning in any decision-making process, one must convert the given knowledge (for example, natural language, unstructured text) into representation vectors that can be understood and processed by machine learning model in their compatible language and data format. The frequently encountered difficulty is, however, the given knowledge is not rich or reliable enough in the first place. In such cases, one seeks to fuse side information from a separate domain to mitigate the gap between good representation learning and the scarce knowledge in the domain of interest. This approach is named Cross-Domain Knowledge Transfer. It is crucial to study the problem because of the commonality of scarce knowledge in many scenarios, from online healthcare platform analyses to financial market risk quantification, leaving an obstacle in front of us benefiting from automated decision making. From the machine learning perspective, the paradigm of semi-supervised learning takes advantage of large amount of data without ground truth and achieves impressive learning performance improvement. It is adopted in this dissertation for cross-domain knowledge transfer.
Furthermore, graph learning techniques are indispensable given that networks commonly exist in real word, such as taxonomy networks and scholarly article citation networks. These networks contain additional useful knowledge and are ought to be incorporated in the learning process, which serve as an important lever in solving the problem of cross-domain knowledge transfer. This dissertation proposes graph-based learning solutions and demonstrates their practical usage via empirical studies on real-world applications. Another line of effort in this work lies in leveraging the rich capacity of neural networks to improve the learning outcomes, as we are in the era of big data.
In contrast to many Graph Neural Networks that directly iterate on the graph adjacency to approximate graph convolution filters, this work also proposes an efficient Eigenvalue learning method that directly optimizes the graph convolution in the spectral space. This work articulates the importance of network spectrum and provides detailed analyses on the spectral properties in the proposed EigenLearn method, which well aligns with a series of CNN models that attempt to have meaningful spectral interpretation in designing graph neural networks. The disser-tation also addresses the efficiency, which can be categorized in two folds. First, by adopting approximate solutions it mitigates the complexity concerns for graph related algorithms, which are naturally quadratic in most cases and do not scale to large datasets. Second, it mitigates the storage and computation overhead in deep neural network, such that they can be deployed on many light-weight devices and significantly broaden the applicability. Finally, the dissertation is concluded by future endeavors
Toward certifiable optimal motion planning for medical steerable needles
Medical steerable needles can follow 3D curvilinear trajectories to avoid anatomical obstacles and reach clinically significant targets inside the human body. Automating steerable needle procedures can enable physicians and patients to harness the full potential of steerable needles by maximally leveraging their steerability to safely and accurately reach targets for medical procedures such as biopsies. For the automation of medical procedures to be clinically accepted, it is critical from a patient care, safety, and regulatory perspective to certify the correctness and effectiveness of the planning algorithms involved in procedure automation. In this paper, we take an important step toward creating a certifiable optimal planner for steerable needles. We present an efficient, resolution-complete motion planner for steerable needles based on a novel adaptation of multi-resolution planning. This is the first motion planner for steerable needles that guarantees to compute in finite time an obstacle-avoiding plan (or notify the user that no such plan exists), under clinically appropriate assumptions. Based on this planner, we then develop the first resolution-optimal motion planner for steerable needles that further provides theoretical guarantees on the quality of the computed motion plan, that is, global optimality, in finite time. Compared to state-of-the-art steerable needle motion planners, we demonstrate with clinically realistic simulations that our planners not only provide theoretical guarantees but also have higher success rates, have lower computation times, and result in higher quality plans
Algorithms for learning from spatial and mobility data
Data from the numerous mobile devices, location-based applications, and collection
sensors used currently can provide important insights about human and natural processes. These insights can inform decision making in designing and optimising in frastructure such as transportation or energy. However, extracting patterns related to
spatial properties is challenging due to the large quantity of the data produced and the
complexity of the processes it describes. We propose scalable, multi-resolution approximation and heuristic algorithms that make use of spatial proximity properties to
solve fundamental data mining and optimisation problems with a better running time
and accuracy. We observe that abstracting from individual data points and working
with units of neighbouring points based on various measures on similarity, improves
computational efficiency and diminishes the effects of noise and overfitting. We consider applications in: mobility data compression, transit network planning, and solar
power output prediction.
Firstly, in order to understand transportation needs, it is essential to have efficient ways
to represent large amounts of travel data. In analysing spatial trajectories (for example
taxis travelling in a city), one of the main challenges is computing distances between
trajectories efficiently; due to their size and complexity this task is computationally
expensive. We build data structures and algorithms to sketch trajectory data that make
queries such as distance computation, nearest neighbour search and clustering, which
are key to finding mobility patterns, more computationally efficient. We use locality
sensitive hashing, a technique that associates similar objects to the same hash.
Secondly, to build efficient infrastructure it is necessary to satisfy travel demand by
placing resources optimally. This is difficult due to external constraints (such as limits
on budget) and the complexity of existing road networks that allow for a large number
of candidate locations. For this purpose, we present heuristic algorithms for efficient
transit network design with a case study on cycling lane placement. The heuristic is
based on a new type of clustering by projection, that is both computationally efficient
and gives good results in practice.
Lastly, we devise a novel method to forecast solar power output based on numerical
weather predictions, clear sky predictions and persistence data. The ensemble of a
multivariate linear regression model, support vector machines model, and an artificial neural network gives more accurate predictions than any of the individual models.
Analysing the performance of the models in a suite of frameworks reveals that building
separate models for each self-contained area based on weather patterns gives a better
accuracy than a single model that predicts the total. The ensemble can be further improved by giving performance-based weights to the individual models. This suggests
that the models identify different patterns in the data, which motivated the choice of an
ensemble architecture
Big Networks: Analysis and Optimal Control
The study of networks has seen a tremendous breed of researches due to the explosive spectrum of practical problems that involve networks as the access point. Those problems widely range from detecting functionally correlated proteins in biology to finding people to give discounts and gain maximum popularity of a product in economics. Thus, understanding and further being able to manipulate/control the development and evolution of the networks become critical tasks for network scientists. Despite the vast research effort putting towards these studies, the present state-of-the-arts largely either lack of high quality solutions or require excessive amount of time in real-world `Big Data\u27 requirement.
This research aims at affirmatively boosting the modern algorithmic efficiency to approach practical requirements. That is developing a ground-breaking class of algorithms that provide simultaneously both provably good solution qualities and low time and space complexities. Specifically, I target the important yet challenging problems in the three main areas:
Information Diffusion: Analyzing and maximizing the influence in networks and extending results for different variations of the problems.
Community Detection: Finding communities from multiple sources of information.
Security and Privacy: Assessing organization vulnerability under targeted-cyber attacks via social networks
Worst-Case to Average-Case Reductions for the SIS Problem: Tightness and Security
We present a framework for evaluating the concrete security assurances of cryptographic constructions given by the worst-case SIVP_γ to average-case SIS_{n,m,q,β} reductions. As part of this analysis, we present the tightness gaps for three worst-case SIVP_γ to average-case SIS_{n,m,q,β} reductions. We also analyze the hardness of worst-case SIVP_γ instances.
We apply our methodology to two SIS-based signature schemes, and compute the security guarantees that these systems get from reductions to worst-case SIVP_γ. We find that most of the presented reductions do not apply to the chosen parameter sets for the signature schemes. We propose modifications to the schemes to make the reductions applicable, and find that the worst-case security assurances of the (modified) signature schemes are, for both signature schemes, significantly lower than the amount of security previously claimed
- …