2,172 research outputs found

    Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

    Full text link
    In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10^-4) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F1-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.Comment: Accepted at EMNLP 201

    Distribution matching for transduction

    Get PDF
    Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach.

    Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks

    Full text link
    Selecting optimal parameters for a neural network architecture can often make the difference between mediocre and state-of-the-art performance. However, little is published which parameters and design choices should be evaluated or selected making the correct hyperparameter optimization often a "black art that requires expert experiences" (Snoek et al., 2012). In this paper, we evaluate the importance of different network design choices and hyperparameters for five common linguistic sequence tagging tasks (POS, Chunking, NER, Entity Recognition, and Event Detection). We evaluated over 50.000 different setups and found, that some parameters, like the pre-trained word embeddings or the last layer of the network, have a large impact on the performance, while other parameters, for example the number of LSTM layers or the number of recurrent units, are of minor importance. We give a recommendation on a configuration that performs well among different tasks.Comment: 34 pages. 9 page version of this paper published at EMNLP 201

    Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

    Full text link
    Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26% of the cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions

    Comparing knowledge sources for nominal anaphora resolution

    Get PDF
    We compare two ways of obtaining lexical knowledge for antecedent selection in other-anaphora and definite noun phrase coreference. Specifically, we compare an algorithm that relies on links encoded in the manually created lexical hierarchy WordNet and an algorithm that mines corpora by means of shallow lexico-semantic patterns. As corpora we use the British National Corpus (BNC), as well as the Web, which has not been previously used for this task. Our results show that (a) the knowledge encoded in WordNet is often insufficient, especially for anaphor-antecedent relations that exploit subjective or context-dependent knowledge; (b) for other-anaphora, the Web-based method outperforms the WordNet-based method; (c) for definite NP coreference, the Web-based method yields results comparable to those obtained using WordNet over the whole dataset and outperforms the WordNet-based method on subsets of the dataset; (d) in both case studies, the BNC-based method is worse than the other methods because of data sparseness. Thus, in our studies, the Web-based method alleviated the lexical knowledge gap often encountered in anaphora resolution, and handled examples with context-dependent relations between anaphor and antecedent. Because it is inexpensive and needs no hand-modelling of lexical knowledge, it is a promising knowledge source to integrate in anaphora resolution systems

    Machine learning and its applications in reliability analysis systems

    Get PDF
    In this thesis, we are interested in exploring some aspects of Machine Learning (ML) and its application in the Reliability Analysis systems (RAs). We begin by investigating some ML paradigms and their- techniques, go on to discuss the possible applications of ML in improving RAs performance, and lastly give guidelines of the architecture of learning RAs. Our survey of ML covers both levels of Neural Network learning and Symbolic learning. In symbolic process learning, five types of learning and their applications are discussed: rote learning, learning from instruction, learning from analogy, learning from examples, and learning from observation and discovery. The Reliability Analysis systems (RAs) presented in this thesis are mainly designed for maintaining plant safety supported by two functions: risk analysis function, i.e., failure mode effect analysis (FMEA) ; and diagnosis function, i.e., real-time fault location (RTFL). Three approaches have been discussed in creating the RAs. According to the result of our survey, we suggest currently the best design of RAs is to embed model-based RAs, i.e., MORA (as software) in a neural network based computer system (as hardware). However, there are still some improvement which can be made through the applications of Machine Learning. By implanting the 'learning element', the MORA will become learning MORA (La MORA) system, a learning Reliability Analysis system with the power of automatic knowledge acquisition and inconsistency checking, and more. To conclude our thesis, we propose an architecture of La MORA

    Herbert Simon's decision-making approach: Investigation of cognitive processes in experts

    Get PDF
    This is a post print version of the article. The official published can be obtained from the links below - PsycINFO Database Record (c) 2010 APA, all rights reserved.Herbert Simon's research endeavor aimed to understand the processes that participate in human decision making. However, despite his effort to investigate this question, his work did not have the impact in the “decision making” community that it had in other fields. His rejection of the assumption of perfect rationality, made in mainstream economics, led him to develop the concept of bounded rationality. Simon's approach also emphasized the limitations of the cognitive system, the change of processes due to expertise, and the direct empirical study of cognitive processes involved in decision making. In this article, we argue that his subsequent research program in problem solving and expertise offered critical tools for studying decision-making processes that took into account his original notion of bounded rationality. Unfortunately, these tools were ignored by the main research paradigms in decision making, such as Tversky and Kahneman's biased rationality approach (also known as the heuristics and biases approach) and the ecological approach advanced by Gigerenzer and others. We make a proposal of how to integrate Simon's approach with the main current approaches to decision making. We argue that this would lead to better models of decision making that are more generalizable, have higher ecological validity, include specification of cognitive processes, and provide a better understanding of the interaction between the characteristics of the cognitive system and the contingencies of the environment
    • …
    corecore