1,198 research outputs found
A Survey of Source Code Search: A 3-Dimensional Perspective
(Source) code search is widely concerned by software engineering researchers
because it can improve the productivity and quality of software development.
Given a functionality requirement usually described in a natural language
sentence, a code search system can retrieve code snippets that satisfy the
requirement from a large-scale code corpus, e.g., GitHub. To realize effective
and efficient code search, many techniques have been proposed successively.
These techniques improve code search performance mainly by optimizing three
core components, including query understanding component, code understanding
component, and query-code matching component. In this paper, we provide a
3-dimensional perspective survey for code search. Specifically, we categorize
existing code search studies into query-end optimization techniques, code-end
optimization techniques, and match-end optimization techniques according to the
specific components they optimize. Considering that each end can be optimized
independently and contributes to the code search performance, we treat each end
as a dimension. Therefore, this survey is 3-dimensional in nature, and it
provides a comprehensive summary of each dimension in detail. To understand the
research trends of the three dimensions in existing code search studies, we
systematically review 68 relevant literatures. Different from existing code
search surveys that only focus on the query end or code end or introduce
various aspects shallowly (including codebase, evaluation metrics, modeling
technique, etc.), our survey provides a more nuanced analysis and review of the
evolution and development of the underlying techniques used in the three ends.
Based on a systematic review and summary of existing work, we outline several
open challenges and opportunities at the three ends that remain to be addressed
in future work.Comment: submitted to ACM Transactions on Software Engineering and Methodolog
Query-driven learning for predictive analytics of data subspace cardinality
Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches
Kernelized Cost-Sensitive Listwise Ranking
This thesis research aims to conduct a study on a cost-sensitive listwise approach to learning to rank.
Learning to Rank is an area of application in machine learning, typically supervised, to build ranking models for Information Retrieval systems. The training data consists of lists of items with some partial order specified induced by an ordinal score or a binary judgment (relevant/not relevant). The model purpose is to produce a permutation of the items in this list in a way which is close to the rankings in the training data. This technique has been
successfully applied to ranking, and several approaches have been proposed since then, including the listwise approach.
A cost-sensitive version of that is an adaptation of this framework which treats the documents within a list with different probabilities, i.e. attempt to impose weights for the documents with higher cost. We then take this algorithm to the next level by kernelizing the loss and exploring the optimization in different spaces.
Among the different existing likelihood algorithms, we choose ListMLE as primary focus of experimentation, since it has been shown to be the approach with the best empirical performance. The theoretical framework is given along with its mathematical properties.
Experimentation is done on the benchmark LETOR dataset. They contain queries and some characteristics of the retrieved documents and its human judgments on the relevance of the documents on the queries.
Based on that we will show how the Kernel Cost-Sensitive ListMLE performs compared to the baseline Plain Cost-Sensitive ListMLE, ListNet, and RankSVM and show different aspects of the proposed loss function within different families of kernels
Fine-Grained Complexity Analysis of Two Classic TSP Variants
We analyze two classic variants of the Traveling Salesman Problem using the
toolkit of fine-grained complexity. Our first set of results is motivated by
the Bitonic TSP problem: given a set of points in the plane, compute a
shortest tour consisting of two monotone chains. It is a classic
dynamic-programming exercise to solve this problem in time. While the
near-quadratic dependency of similar dynamic programs for Longest Common
Subsequence and Discrete Frechet Distance has recently been proven to be
essentially optimal under the Strong Exponential Time Hypothesis, we show that
bitonic tours can be found in subquadratic time. More precisely, we present an
algorithm that solves bitonic TSP in time and its bottleneck
version in time. Our second set of results concerns the popular
-OPT heuristic for TSP in the graph setting. More precisely, we study the
-OPT decision problem, which asks whether a given tour can be improved by a
-OPT move that replaces edges in the tour by new edges. A simple
algorithm solves -OPT in time for fixed . For 2-OPT, this is
easily seen to be optimal. For we prove that an algorithm with a runtime
of the form exists if and only if All-Pairs
Shortest Paths in weighted digraphs has such an algorithm. The results for
may suggest that the actual time complexity of -OPT is
. We show that this is not the case, by presenting an algorithm
that finds the best -move in time for
fixed . This implies that 4-OPT can be solved in time,
matching the best-known algorithm for 3-OPT. Finally, we show how to beat the
quadratic barrier for in two important settings, namely for points in the
plane and when we want to solve 2-OPT repeatedly.Comment: Extended abstract appears in the Proceedings of the 43rd
International Colloquium on Automata, Languages, and Programming (ICALP 2016
Chameleon: A Hybrid Secure Computation Framework for Machine Learning Applications
We present Chameleon, a novel hybrid (mixed-protocol) framework for secure
function evaluation (SFE) which enables two parties to jointly compute a
function without disclosing their private inputs. Chameleon combines the best
aspects of generic SFE protocols with the ones that are based upon additive
secret sharing. In particular, the framework performs linear operations in the
ring using additively secret shared values and nonlinear
operations using Yao's Garbled Circuits or the Goldreich-Micali-Wigderson
protocol. Chameleon departs from the common assumption of additive or linear
secret sharing models where three or more parties need to communicate in the
online phase: the framework allows two parties with private inputs to
communicate in the online phase under the assumption of a third node generating
correlated randomness in an offline phase. Almost all of the heavy
cryptographic operations are precomputed in an offline phase which
substantially reduces the communication overhead. Chameleon is both scalable
and significantly more efficient than the ABY framework (NDSS'15) it is based
on. Our framework supports signed fixed-point numbers. In particular,
Chameleon's vector dot product of signed fixed-point numbers improves the
efficiency of mining and classification of encrypted data for algorithms based
upon heavy matrix multiplications. Our evaluation of Chameleon on a 5 layer
convolutional deep neural network shows 133x and 4.2x faster executions than
Microsoft CryptoNets (ICML'16) and MiniONN (CCS'17), respectively
Efficient Scalable Accurate Regression Queries in In-DBMS Analytics
Recent trends aim to incorporate advanced data analytics capabilities within DBMSs. Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute a novel predictive analytics model and associated regression query processing algorithms, which are efficient, scalable and accurate. We focus on predicting the answers to two key query types that reveal dependencies between the values of different attributes: (i) mean-value queries and (ii) multivariate linear regression queries, both within specific data subspaces defined based on the values of other attributes. Our algorithms achieve many orders of magnitude improvement in query processing efficiency and nearperfect approximations of the underlying relationships among data attributes
ABC Analysis in an Internet Shop: A New Set of Criteria
This article presents a model of ABC analysis tailored for internet shops. The standard set of criteria is expanded to cover e-commerce specific characteristics, such as the number of product views, search engine rankings and product links via a recommendation system..The proposed new methodology is applied to real data from an internet bookstore in Poland. A comparison with the results of a standard, not internet-oriented ABC analysis shows the advantage of using the new set of criteria.ABC analysis, internet shop, inventory control
Sparse Modelling and Multi-exponential Analysis
The research fields of harmonic analysis, approximation theory and computer algebra are seemingly different domains and are studied by seemingly separated research communities. However, all of these are connected to each other in many ways. The connection between harmonic analysis and approximation theory is not accidental: several constructions among which wavelets and Fourier series, provide major insights into central problems in approximation theory. And the intimate connection between approximation theory and computer algebra exists even longer: polynomial interpolation is a long-studied and important problem in both symbolic and numeric computing, in the former to counter expression swell and in the latter to construct a simple data model. A common underlying problem statement in many applications is that of determining the number of components, and for each component the value of the frequency, damping factor, amplitude and phase in a multi-exponential model. It occurs, for instance, in magnetic resonance and infrared spectroscopy, vibration analysis, seismic data analysis, electronic odour recognition, keystroke recognition, nuclear science, music signal processing, transient detection, motor fault diagnosis, electrophysiology, drug clearance monitoring and glucose tolerance testing, to name just a few. The general technique of multi-exponential modeling is closely related to what is commonly known as the Padé-Laplace method in approximation theory, and the technique of sparse interpolation in the field of computer algebra. The problem statement is also solved using a stochastic perturbation method in harmonic analysis. The problem of multi-exponential modeling is an inverse problem and therefore may be severely ill-posed, depending on the relative location of the frequencies and phases. Besides the reliability of the estimated parameters, the sparsity of the multi-exponential representation has become important. A representation is called sparse if it is a combination of only a few elements instead of all available generating elements. In sparse interpolation, the aim is to determine all the parameters from only a small amount of data samples, and with a complexity proportional to the number of terms in the representation. Despite the close connections between these fields, there is a clear lack of communication in the scientific literature. The aim of this seminar is to bring researchers together from the three mentioned fields, with scientists from the varied application domains.Output Type: Meeting Repor
- …