1,198 research outputs found

    A Survey of Source Code Search: A 3-Dimensional Perspective

    Full text link
    (Source) code search is widely concerned by software engineering researchers because it can improve the productivity and quality of software development. Given a functionality requirement usually described in a natural language sentence, a code search system can retrieve code snippets that satisfy the requirement from a large-scale code corpus, e.g., GitHub. To realize effective and efficient code search, many techniques have been proposed successively. These techniques improve code search performance mainly by optimizing three core components, including query understanding component, code understanding component, and query-code matching component. In this paper, we provide a 3-dimensional perspective survey for code search. Specifically, we categorize existing code search studies into query-end optimization techniques, code-end optimization techniques, and match-end optimization techniques according to the specific components they optimize. Considering that each end can be optimized independently and contributes to the code search performance, we treat each end as a dimension. Therefore, this survey is 3-dimensional in nature, and it provides a comprehensive summary of each dimension in detail. To understand the research trends of the three dimensions in existing code search studies, we systematically review 68 relevant literatures. Different from existing code search surveys that only focus on the query end or code end or introduce various aspects shallowly (including codebase, evaluation metrics, modeling technique, etc.), our survey provides a more nuanced analysis and review of the evolution and development of the underlying techniques used in the three ends. Based on a systematic review and summary of existing work, we outline several open challenges and opportunities at the three ends that remain to be addressed in future work.Comment: submitted to ACM Transactions on Software Engineering and Methodolog

    Query-driven learning for predictive analytics of data subspace cardinality

    Get PDF
    Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches

    Kernelized Cost-Sensitive Listwise Ranking

    Get PDF
    This thesis research aims to conduct a study on a cost-sensitive listwise approach to learning to rank. Learning to Rank is an area of application in machine learning, typically supervised, to build ranking models for Information Retrieval systems. The training data consists of lists of items with some partial order specified induced by an ordinal score or a binary judgment (relevant/not relevant). The model purpose is to produce a permutation of the items in this list in a way which is close to the rankings in the training data. This technique has been successfully applied to ranking, and several approaches have been proposed since then, including the listwise approach. A cost-sensitive version of that is an adaptation of this framework which treats the documents within a list with different probabilities, i.e. attempt to impose weights for the documents with higher cost. We then take this algorithm to the next level by kernelizing the loss and exploring the optimization in different spaces. Among the different existing likelihood algorithms, we choose ListMLE as primary focus of experimentation, since it has been shown to be the approach with the best empirical performance. The theoretical framework is given along with its mathematical properties. Experimentation is done on the benchmark LETOR dataset. They contain queries and some characteristics of the retrieved documents and its human judgments on the relevance of the documents on the queries. Based on that we will show how the Kernel Cost-Sensitive ListMLE performs compared to the baseline Plain Cost-Sensitive ListMLE, ListNet, and RankSVM and show different aspects of the proposed loss function within different families of kernels

    Fine-Grained Complexity Analysis of Two Classic TSP Variants

    Get PDF
    We analyze two classic variants of the Traveling Salesman Problem using the toolkit of fine-grained complexity. Our first set of results is motivated by the Bitonic TSP problem: given a set of nn points in the plane, compute a shortest tour consisting of two monotone chains. It is a classic dynamic-programming exercise to solve this problem in O(n2)O(n^2) time. While the near-quadratic dependency of similar dynamic programs for Longest Common Subsequence and Discrete Frechet Distance has recently been proven to be essentially optimal under the Strong Exponential Time Hypothesis, we show that bitonic tours can be found in subquadratic time. More precisely, we present an algorithm that solves bitonic TSP in O(nlog2n)O(n \log^2 n) time and its bottleneck version in O(nlog3n)O(n \log^3 n) time. Our second set of results concerns the popular kk-OPT heuristic for TSP in the graph setting. More precisely, we study the kk-OPT decision problem, which asks whether a given tour can be improved by a kk-OPT move that replaces kk edges in the tour by kk new edges. A simple algorithm solves kk-OPT in O(nk)O(n^k) time for fixed kk. For 2-OPT, this is easily seen to be optimal. For k=3k=3 we prove that an algorithm with a runtime of the form O~(n3ϵ)\tilde{O}(n^{3-\epsilon}) exists if and only if All-Pairs Shortest Paths in weighted digraphs has such an algorithm. The results for k=2,3k=2,3 may suggest that the actual time complexity of kk-OPT is Θ(nk)\Theta(n^k). We show that this is not the case, by presenting an algorithm that finds the best kk-move in O(n2k/3+1)O(n^{\lfloor 2k/3 \rfloor + 1}) time for fixed k3k \geq 3. This implies that 4-OPT can be solved in O(n3)O(n^3) time, matching the best-known algorithm for 3-OPT. Finally, we show how to beat the quadratic barrier for k=2k=2 in two important settings, namely for points in the plane and when we want to solve 2-OPT repeatedly.Comment: Extended abstract appears in the Proceedings of the 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016

    36th International Symposium on Theoretical Aspects of Computer Science: STACS 2019, March 13-16, 2019, Berlin, Germany

    Get PDF

    Chameleon: A Hybrid Secure Computation Framework for Machine Learning Applications

    Get PDF
    We present Chameleon, a novel hybrid (mixed-protocol) framework for secure function evaluation (SFE) which enables two parties to jointly compute a function without disclosing their private inputs. Chameleon combines the best aspects of generic SFE protocols with the ones that are based upon additive secret sharing. In particular, the framework performs linear operations in the ring Z2l\mathbb{Z}_{2^l} using additively secret shared values and nonlinear operations using Yao's Garbled Circuits or the Goldreich-Micali-Wigderson protocol. Chameleon departs from the common assumption of additive or linear secret sharing models where three or more parties need to communicate in the online phase: the framework allows two parties with private inputs to communicate in the online phase under the assumption of a third node generating correlated randomness in an offline phase. Almost all of the heavy cryptographic operations are precomputed in an offline phase which substantially reduces the communication overhead. Chameleon is both scalable and significantly more efficient than the ABY framework (NDSS'15) it is based on. Our framework supports signed fixed-point numbers. In particular, Chameleon's vector dot product of signed fixed-point numbers improves the efficiency of mining and classification of encrypted data for algorithms based upon heavy matrix multiplications. Our evaluation of Chameleon on a 5 layer convolutional deep neural network shows 133x and 4.2x faster executions than Microsoft CryptoNets (ICML'16) and MiniONN (CCS'17), respectively

    Efficient Scalable Accurate Regression Queries in In-DBMS Analytics

    Get PDF
    Recent trends aim to incorporate advanced data analytics capabilities within DBMSs. Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute a novel predictive analytics model and associated regression query processing algorithms, which are efficient, scalable and accurate. We focus on predicting the answers to two key query types that reveal dependencies between the values of different attributes: (i) mean-value queries and (ii) multivariate linear regression queries, both within specific data subspaces defined based on the values of other attributes. Our algorithms achieve many orders of magnitude improvement in query processing efficiency and nearperfect approximations of the underlying relationships among data attributes

    ABC Analysis in an Internet Shop: A New Set of Criteria

    Get PDF
    This article presents a model of ABC analysis tailored for internet shops. The standard set of criteria is expanded to cover e-commerce specific characteristics, such as the number of product views, search engine rankings and product links via a recommendation system..The proposed new methodology is applied to real data from an internet bookstore in Poland. A comparison with the results of a standard, not internet-oriented ABC analysis shows the advantage of using the new set of criteria.ABC analysis, internet shop, inventory control

    Sparse Modelling and Multi-exponential Analysis

    Get PDF
    The research fields of harmonic analysis, approximation theory and computer algebra are seemingly different domains and are studied by seemingly separated research communities. However, all of these are connected to each other in many ways. The connection between harmonic analysis and approximation theory is not accidental: several constructions among which wavelets and Fourier series, provide major insights into central problems in approximation theory. And the intimate connection between approximation theory and computer algebra exists even longer: polynomial interpolation is a long-studied and important problem in both symbolic and numeric computing, in the former to counter expression swell and in the latter to construct a simple data model. A common underlying problem statement in many applications is that of determining the number of components, and for each component the value of the frequency, damping factor, amplitude and phase in a multi-exponential model. It occurs, for instance, in magnetic resonance and infrared spectroscopy, vibration analysis, seismic data analysis, electronic odour recognition, keystroke recognition, nuclear science, music signal processing, transient detection, motor fault diagnosis, electrophysiology, drug clearance monitoring and glucose tolerance testing, to name just a few. The general technique of multi-exponential modeling is closely related to what is commonly known as the Padé-Laplace method in approximation theory, and the technique of sparse interpolation in the field of computer algebra. The problem statement is also solved using a stochastic perturbation method in harmonic analysis. The problem of multi-exponential modeling is an inverse problem and therefore may be severely ill-posed, depending on the relative location of the frequencies and phases. Besides the reliability of the estimated parameters, the sparsity of the multi-exponential representation has become important. A representation is called sparse if it is a combination of only a few elements instead of all available generating elements. In sparse interpolation, the aim is to determine all the parameters from only a small amount of data samples, and with a complexity proportional to the number of terms in the representation. Despite the close connections between these fields, there is a clear lack of communication in the scientific literature. The aim of this seminar is to bring researchers together from the three mentioned fields, with scientists from the varied application domains.Output Type: Meeting Repor
    corecore