4,136 research outputs found

    Measuring efficiency in high-accuracy, broad-coverage statistical parsing

    Full text link
    Very little attention has been paid to the comparison of efficiency between high accuracy statistical parsers. This paper proposes one machine-independent metric that is general enough to allow comparisons across very different parsing architectures. This metric, which we call ``events considered'', measures the number of ``events'', however they are defined for a particular parser, for which a probability must be calculated, in order to find the parse. It is applicable to single-pass or multi-stage parsers. We discuss the advantages of the metric, and demonstrate its usefulness by using it to compare two parsers which differ in several fundamental ways.Comment: 8 pages, 4 figures, 2 table

    Optimal Information Retrieval with Complex Utility Functions

    Get PDF
    Existing retrieval models all attempt to optimize one single utility function, which is often based on the topical relevance of a document with respect to a query. In real applications, retrieval involves more complex utility functions that may involve preferences on several different dimensions. In this paper, we present a general optimization framework for retrieval with complex utility functions. A query language is designed according to this framework to enable users to submit complex queries. We propose an efficient algorithm for retrieval with complex utility functions based on the a-priori algorithm. As a case study, we apply our algorithm to a complex utility retrieval problem in distributed IR. Experiment results show that our algorithm allows for flexible tradeoff between multiple retrieval criteria. Finally, we study the efficiency issue of our algorithm on simulated data

    Incremental construction of classifier and discriminant ensembles

    Get PDF
    We discuss approaches to incrementally construct an ensemble. The first constructs an ensemble of classifiers choosing a subset from a larger set, and the second constructs an ensemble of discriminants, where a classifier is used for some classes only. We investigate criteria including accuracy, significant improvement, diversity, correlation, and the role of search direction. For discriminant ensembles, we test subset selection and trees. Fusion is by voting or by a linear model. Using 14 classifiers on 38 data sets. incremental search finds small, accurate ensembles in polynomial time. The discriminant ensemble uses a subset of discriminants and is simpler, interpretable, and accurate. We see that an incremental ensemble has higher accuracy than bagging and random subspace method; and it has a comparable accuracy to AdaBoost. but fewer classifiers.We would like to thank the three anonymous referees and the editor for their constructive comments, pointers to related literature, and pertinent questions which allowed us to better situate our work as well as organize the ms and improve the presentation. This work has been supported by the Turkish Academy of Sciences in the framework of the Young Scientist Award Program (EA-TUBA-GEBIP/2001-1-1), Bogazici University Scientific Research Project 05HA101 and Turkish Scientific Technical Research Council TUBITAK EEEAG 104EO79Publisher's VersionAuthor Pre-Prin

    Mathematical statistics vs machine learning: towards an intelligent modeling framework for soil and plant growth processes

    Get PDF
    Mestrado de dupla diplomação com a Kuban State Agrarian UniversityThe work described in this dissertation focuses on the methods for analyzing MS and ML that are used in PF. The purpose of the work is to investigate these methods on their practical application to a specific set of data. In the course of the work, the following tasks were completed: the current state of affairs in the field of PF was investigated, the theoretical foundations of the methods of MS and ML were investigated, which were subjected to practical tests on a specific set of data. Conclusions were drawn about the advantages and disadvantages of these methods. A selection of works of scientists engaged in research on the introduction of a specific set of nutrients into the soil was also investigated. The most important contributions to this work are the practical application of various methods of analysis, as well as the design of a DST designed to help farmers integrate PF into their pilot training farms.O trabalho descrito nesta dissertação versa sobre métodos e técnicas no âmbito da Estatística Matemática e de ML usados para efeitos de previsão de colheitas e tratamento de solos em agricultura de precisão. O objetivo do trabalho é investigar esses métodos em sua aplicação prática a um conjunto específico de dados. No decorrer do trabalho, foram realizadas as seguintes tarefas: investigou-se a situação atual no campo da agricultura de precisão, investigaram-se os fundamentos teóricos dos métodos e técnicas da estatística matemática e de ML. Estes métodos e técnicas foram submetidos a testes práticos em um conjunto específico de dados. Foram tiradas conclusões sobre as vantagens e desvantagens desses métodos: Uma seslção de trabalhos científicos relacionados com a investigação sobre a introdução de um conjunto específico de nutrientes no solo foram também investigados. As contribuições mais importantes para este trabalho são a aplicação prática de vários métodos de análise, bem como o projeto de uma ferramenta de apoio à decisão projetada para ajudar os agricultores a integrar a agricultura de precisão nas suas propriedades agrícolas

    IO-Top-k: index-access optimized top-k query processing

    No full text
    Top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. Top-k queries operate on index lists for a query's elementary conditions and aggregate scores for result candidates. One of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. This procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. This entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. The prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. The current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. Our main contributions are new, principled, scheduling methods based on a Knapsack-related optimization for sequential accesses and a cost model for random accesses. The methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. We also discuss efficient implementation techniques for the underlying data structures. In performance experiments with three different datasets (TREC Terabyte, HTTP server logs, and IMDB), our methods achieved significant performance gains compared to the best previously known methods: a factor of up to 3 in terms of execution costs, and a factor of 5 in terms of absolute run-times of our implementation. Our best techniques are close to a lower bound for the execution cost of the considered class of threshold algorithms

    Case study of Hyperparameter Optimization framework Optuna on a Multi-column Convolutional Neural Network

    Get PDF
    To observe the condition of the flower growth during the blooming period and estimate the harvest forecast of the Canola crops, the ‘Flower Counter’ application has been developed by the researchers ofP2IRC at the University of Saskatchewan. The model has been developed using a Deep Learning based Multi-column Convolutional Neural Network (MCNN) algorithm and the TensorFlow framework, in order to count the Canola flowers from the images based on the learning from a given set of training images. To ensure better accuracy score with respect to flower prediction, proper training of the model is essential involving appropriate values of hyperparameters. Among numerous possible values of these hyperparameters, selecting the suitable ones is certainly a time-consuming and tedious task for humans. Ongoing research for developing Automated Hyperparameter Optimization (HPO) frameworks has attracted researchers and practitioners to develop and utilize such frameworks to give directions towards finding better hyperparameters according to their applications. The primary goal of this research work is to apply the Automated HPO Optuna on the Flower Counterapplication with the purpose of directing the researchers towards among the best observed hyperparameter configurations for good overall performance in terms of prediction accuracy and resource utilization. This work would help the researchers and plant scientists gain knowledge about the practicality of Optuna while treating it as a black-box and apply it for this application as well as other similar applications. In order to achieve this goal, three essential hyperparameters, batch size, learning rate and number of epochs, have been chosen for assessing their individual and combined impacts. Since the training of the model depends on the datasets collected during diverse weather conditions, there could be factors that could impact Optuna’s functionality and performance. The analysis of the results of the current work and comparison of the accuracy scores with the previous work have yielded almost equal scores while testing the model’s performance on different test populations. Moreover, for the tuned version of the model, the current work has shown the potential for achieving that result with substantially lower resource utilization. The findings have provided useful concepts about making the better usage of Optuna; the search space can be restricted ormore complicated objective functions can be implemented to ensure better stability of the models obtained when chosen parameters are used in trainin

    Discrete optimization algorithms for marker-assisted plant breeding

    Get PDF

    Learning to predict under a budget

    Get PDF
    Prediction-time budgets in machine learning applications can arise due to monetary or computational costs associated with acquiring information; they also arise due to latency and power consumption costs in evaluating increasingly more complex models. The goal in such budgeted prediction problems is to learn decision systems that maintain high prediction accuracy while meeting average cost constraints during prediction-time. Such decision systems can potentially adapt to the input examples, predicting most of them at low cost while allocating more budget for the few "hard" examples. In this thesis, I will present several learning methods to better trade-off cost and error during prediction. The conceptual contribution of this thesis is to develop a new paradigm of bottom-up approach instead of the traditional top-down approach. A top-down approach attempts to build out the model by selectively adding the most cost-effective features to improve accuracy. In contrast, a bottom-up approach first learns a highly accurate model and then prunes or adaptively approximates it to trade-off cost and error. Training top-down models in case of feature acquisition costs leads to fundamental combinatorial issues in multi-stage search over all feature subsets. In contrast, we show that the bottom-up methods bypass many of such issues. To develop this theme, we first propose two top-down methods and then two bottom-up methods. The first top-down method uses margin information from training data in the partial feature neighborhood of a test point to either select the next best feature in a greedy fashion or to stop and make prediction. The second top-down method is a variant of random forest (RF) algorithm. We grow decision trees with low acquisition cost and high strength based on greedy mini-max cost-weighted impurity splits. Theoretically, we establish near-optimal acquisition cost guarantees for our algorithm. The first bottom-up method we propose is based on pruning RFs to optimize expected feature cost and accuracy. Given a RF as input, we pose pruning as a novel 0-1 integer program and show that it can be solved exactly via LP relaxation. We further develop a fast primal-dual algorithm that scales to large datasets. The second bottom-up method is adaptive approximation, which significantly generalizes the RF pruning to accommodate more models and other types of costs besides feature acquisition cost. We first train a high-accuracy, high-cost model. We then jointly learn a low-cost gating function together with a low-cost prediction model to adaptively approximate the high-cost model. The gating function identifies the regions of the input space where the low-cost model suffices for making highly accurate predictions. We demonstrate empirical performance of these methods and compare them to the state-of-the-arts. Finally, we study adaptive approximation in the on-line setting to obtain regret guarantees and discuss future work.2019-07-02T00:00:00
    corecore