2,029 research outputs found

    From average case complexity to improper learning complexity

    Full text link
    The basic problem in the PAC model of computational learning theory is to determine which hypothesis classes are efficiently learnable. There is presently a dearth of results showing hardness of learning problems. Moreover, the existing lower bounds fall short of the best known algorithms. The biggest challenge in proving complexity results is to establish hardness of {\em improper learning} (a.k.a. representation independent learning).The difficulty in proving lower bounds for improper learning is that the standard reductions from NP\mathbf{NP}-hard problems do not seem to apply in this context. There is essentially only one known approach to proving lower bounds on improper learning. It was initiated in (Kearns and Valiant 89) and relies on cryptographic assumptions. We introduce a new technique for proving hardness of improper learning, based on reductions from problems that are hard on average. We put forward a (fairly strong) generalization of Feige's assumption (Feige 02) about the complexity of refuting random constraint satisfaction problems. Combining this assumption with our new technique yields far reaching implications. In particular, 1. Learning DNF\mathrm{DNF}'s is hard. 2. Agnostically learning halfspaces with a constant approximation ratio is hard. 3. Learning an intersection of ω(1)\omega(1) halfspaces is hard.Comment: 34 page

    Fake View Analytics in Online Video Services

    Full text link
    Online video-on-demand(VoD) services invariably maintain a view count for each video they serve, and it has become an important currency for various stakeholders, from viewers, to content owners, advertizers, and the online service providers themselves. There is often significant financial incentive to use a robot (or a botnet) to artificially create fake views. How can we detect the fake views? Can we detect them (and stop them) using online algorithms as they occur? What is the extent of fake views with current VoD service providers? These are the questions we study in the paper. We develop some algorithms and show that they are quite effective for this problem.Comment: 25 pages, 15 figure

    Second-Generation Objects in the Universe: Radiative Cooling and Collapse of Halos with Virial Temperatures Above 10^4 Kelvin

    Full text link
    The first generation of protogalaxies likely formed out of primordial gas via H2-cooling in cosmological minihalos with virial temperatures of a few 1000K. However, their abundance is likely to have been severely limited by feedback processes which suppressed H2 formation. The formation of the protogalaxies responsible for reionization and metal-enrichment of the intergalactic medium, then had to await the collapse of larger halos. Here we investigate the radiative cooling and collapse of gas in halos with virial temperatures Tvir > 10^4K. In these halos, efficient atomic line radiation allows rapid cooling of the gas to 8000 K; subsequently the gas can contract nearly isothermally at this temperature. Without an additional coolant, the gas would likely settle into a locally gravitationally stable disk; only disks with unusually low spin would be unstable. However, we find that the initial atomic line cooling leaves a large, out-of-equilibrium residual free electron fraction. This allows the molecular fraction to build up to a universal value of about x(H2) = 10^-3, almost independently of initial density and temperature. We show that this is a non--equilibrium freezeout value that can be understood in terms of timescale arguments. Furthermore, unlike in less massive halos, H2 formation is largely impervious to feedback from external UV fields, due to the high initial densities achieved by atomic cooling. The H2 molecules cool the gas further to about 100K, and allow the gas to fragment on scales of a few 100 Msun. We investigate the importance of various feedback effects such as H2-photodissociation from internal UV fields and radiation pressure due to Ly-alpha photon trapping, which are likely to regulate the efficiency of star formation.Comment: Revised version accepted by ApJ; some reorganization for clarit

    Detecting Sockpuppets in Deceptive Opinion Spam

    Full text link
    This paper explores the problem of sockpuppet detection in deceptive opinion spam using authorship attribution and verification approaches. Two methods are explored. The first is a feature subsampling scheme that uses the KL-Divergence on stylistic language models of an author to find discriminative features. The second is a transduction scheme, spy induction that leverages the diversity of authors in the unlabeled test set by sending a set of spies (positive samples) from the training set to retrieve hidden samples in the unlabeled test set using nearest and farthest neighbors. Experiments using ground truth sockpuppet data show the effectiveness of the proposed schemes.Comment: 18 pages, Accepted at CICLing 2017, 18th International Conference on Intelligent Text Processing and Computational Linguistic

    Near-optimal Linear Decision Trees for k-SUM and Related Problems

    Get PDF
    We construct near-optimal linear decision trees for a variety of decision problems in combinatorics and discrete geometry. For example, for any constant k , we construct linear decision trees that solve the k -SUM problem on n elements using O ( n log 2 n ) linear queries. Moreover, the queries we use are comparison queries, which compare the sums of two k -subsets; when viewed as linear queries, comparison queries are 2 k -sparse and have only { −1,0,1} coefficients. We give similar constructions for sorting sumsets A+B and for solving the SUBSET-SUM problem, both with optimal number of queries, up to poly-logarithmic terms. Our constructions are based on the notion of “inference dimension,” recently introduced by the authors in the context of active classification with comparison queries. This can be viewed as another contribution to the fruitful link between machine learning and discrete geometry, which goes back to the discovery of the VC dimension

    Subsampling in Smoothed Range Spaces

    Full text link
    We consider smoothed versions of geometric range spaces, so an element of the ground set (e.g. a point) can be contained in a range with a non-binary value in [0,1][0,1]. Similar notions have been considered for kernels; we extend them to more general types of ranges. We then consider approximations of these range spaces through Δ\varepsilon -nets and Δ\varepsilon -samples (aka Δ\varepsilon-approximations). We characterize when size bounds for Δ\varepsilon -samples on kernels can be extended to these more general smoothed range spaces. We also describe new generalizations for Δ\varepsilon -nets to these range spaces and show when results from binary range spaces can carry over to these smoothed ones.Comment: This is the full version of the paper which appeared in ALT 2015. 16 pages, 3 figures. In Algorithmic Learning Theory, pp. 224-238. Springer International Publishing, 201

    Optimal estimation for Large-Eddy Simulation of turbulence and application to the analysis of subgrid models

    Get PDF
    The tools of optimal estimation are applied to the study of subgrid models for Large-Eddy Simulation of turbulence. The concept of optimal estimator is introduced and its properties are analyzed in the context of applications to a priori tests of subgrid models. Attention is focused on the Cook and Riley model in the case of a scalar field in isotropic turbulence. Using DNS data, the relevance of the beta assumption is estimated by computing (i) generalized optimal estimators and (ii) the error brought by this assumption alone. Optimal estimators are computed for the subgrid variance using various sets of variables and various techniques (histograms and neural networks). It is shown that optimal estimators allow a thorough exploration of models. Neural networks are proved to be relevant and very efficient in this framework, and further usages are suggested

    A preliminary approach to the multilabel classification problem of Portuguese juridical documents

    Get PDF
    Portuguese juridical documents from Supreme Courts and the Attorney General’s Office are manually classified by juridical experts into a set of classes belonging to a taxonomy of concepts. In this paper, a preliminary approach to develop techniques to automat- ically classify these juridical documents, is proposed. As basic strategy, the integration of natural language processing techniques with machine learning ones is used. Support Vector Machines (SVM) are used as learn- ing algorithm and the obtained results are presented and compared with other approaches, such as C4.5 and Naive Bayes

    Learning from Minimum Entropy Queries in a Large Committee Machine

    Full text link
    In supervised learning, the redundancy contained in random examples can be avoided by learning from queries. Using statistical mechanics, we study learning from minimum entropy queries in a large tree-committee machine. The generalization error decreases exponentially with the number of training examples, providing a significant improvement over the algebraic decay for random examples. The connection between entropy and generalization error in multi-layer networks is discussed, and a computationally cheap algorithm for constructing queries is suggested and analysed.Comment: 4 pages, REVTeX, multicol, epsf, two postscript figures. To appear in Physical Review E (Rapid Communications
    • 

    corecore