1,466 research outputs found

    Children as Models for Computers: Natural Language Acquisition for Machine Learning

    No full text
    International audienceThis paper focuses on a subeld of machine learning, the so- called grammatical inference. Roughly speaking, grammatical inference deals with the problem of inferring a grammar that generates a given set of sample sentences in some manner that is supposed to be realized by some inference algorithm. We discuss how the analysis and formalization of the main features of the process of human natural language acquisition may improve results in the area of grammatical inference

    DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS Channel and DNA Storage

    Full text link
    Recently, DNA storage has emerged as a promising data storage solution, offering significant advantages in storage density, maintenance cost efficiency, and parallel replication capability. Mathematically, the DNA storage pipeline can be viewed as an insertion, deletion, and substitution (IDS) channel. Because of the mathematical terra incognita of the Levenshtein distance, designing an IDS-correcting code is still a challenge. In this paper, we propose an innovative approach that utilizes deep Levenshtein distance embedding to bypass these mathematical challenges. By representing the Levenshtein distance between two sequences as a conventional distance between their corresponding embedding vectors, the inherent structural property of Levenshtein distance is revealed in the friendly embedding space. Leveraging this embedding space, we introduce the DoDo-Code, an IDS-correcting code that incorporates deep embedding of Levenshtein distance, deep embedding-based codeword search, and deep embedding-based segment correcting. To address the requirements of DNA storage, we also present a preliminary algorithm for long sequence decoding. As far as we know, the DoDo-Code is the first IDS-correcting code designed using plausible deep learning methodologies, potentially paving the way for a new direction in error-correcting code research. It is also the first IDS code that exhibits characteristics of being `optimal' in terms of redundancy, significantly outperforming the mainstream IDS-correcting codes of the Varshamov-Tenengolts code family in code rate

    Computers in Support of Musical Expression

    Get PDF

    Structure Preserving Encoding of Non-euclidean Similarity Data

    Get PDF
    Domain-specific proximity measures, like divergence measures in signal processing or alignment scores in bioinformatics, often lead to non-metric, indefinite similarities or dissimilarities. However, many classical learning algorithms like kernel machines assume metric properties and struggle with such metric violations. For example, the classical support vector machine is no longer able to converge to an optimum. One possible direction to solve the indefiniteness problem is to transform the non-metric (dis-)similarity data into positive (semi-)definite matrices. For this purpose, many approaches have been proposed that adapt the eigenspectrum of the given data such that positive definiteness is ensured. Unfortunately, most of these approaches modify the eigenspectrum in such a strong manner that valuable information is removed or noise is added to the data. In particular, the shift operation has attracted a lot of interest in the past few years despite its frequently reoccurring disadvantages. In this work, we propose a modified advanced shift correction method that enables the preservation of the eigenspectrum structure of the data by means of a low-rank approximated nullspace correction. We compare our advanced shift to classical eigenvalue corrections like eigenvalue clipping, flipping, squaring, and shifting on several benchmark data. The impact of a low-rank approximation on the data’s eigenspectrum is analyzed.</p

    Automating data preparation with statistical analysis

    Get PDF
    Data preparation is the process of transforming raw data into a clean and consumable format. It is widely known as the bottleneck to extract value and insights from data, due to the number of possible tasks in the pipeline and factors that can largely affect the results, such as human expertise, application scenarios, and solution methodology. Researchers and practitioners devised a great variety of techniques and tools over the decades, while many of them still place a significant burden on human’s side to configure the suitable input rules and parameters. In this thesis, with the goal of reducing human manual effort, we explore using the power of statistical analysis techniques to automate three subtasks in the data preparation pipeline: data enrichment, error detection, and entity matching. Statistical analysis is the process of discovering underlying patterns and trends from data and deducing properties of an underlying distribution of probability from a sample, for example, testing hypotheses and deriving estimates. We first discuss CrawlEnrich, which automatically figures out the queries for data enrichment via web API data, by estimating the potential benefit of issuing a certain query. Then we study how to derive reusable error detection configuration rules from a web table corpus, so that end-users get results with no efforts. Finally, we introduce AutoML-EM, aiming to automate the entity matching model development process. Entity matching is to find the identical entities in real-world. Our work provides powerful angles to automate the process of various data preparation steps, and we conclude this thesis by discussing future directions

    Voice Recognition and Mobility in the Legal Industry

    Get PDF
    In a typical legal work environment, attorneys work with their staff to generate and send case related legal documents and communications. Traditionally the attorney will dictate to a device capable of recording audio and the legal assistant will transcribe the audio directly from the source. In the early days of recorded dictation audio was recorded and saved to analog tape. Once the technology became available, dictation was saved digitally to flash memory and transmitted to hard disk for playback by the legal assistant. It has been this way for years, and due to advances in voice recognition technology and computer processing there are alternative options to the traditional dictation/transcription process. The focus of this paper is to examine the traditional process of dictation/transcription and how it compares to the process of using voice recognition software. Analysis of each process as well as an evaluation of voice recognition software will be performed. The document generation process will also be examined as it relates to transcription and creating a document, regardless of the content. The most efficient solution which benefits a small to medium size law firm will be recommended. According to Understanding How Law Offices Do Business, a small law firm has between one and ten lawyers and a mid-size law firm has up to 50 lawyers. These firms are the target audience. The goal of this paper is to determine if the use of voice recognition software can help an attorney and their staff be more efficient, and if so, which voice recognition software and methods work the best. Tests will be performed analyzing both Dragon Naturally Speaking 12 Professional and Windows 7 voice recognition software on the desktop. The software with the higher accuracy rate based on our tests will be used to evaluate voice recognition processes throughout this paper

    Volume CXXII, Number 3, October 1, 2004

    Get PDF

    Comprehensive Evaluation of Machine Learning Experiments: Algorithm Comparison, Algorithm Performance and Inferential Reproducibility

    Get PDF
    This doctoral thesis addresses critical methodological aspects within machine learning experimentation, focusing on enhancing the evaluation and analysis of algorithm performance. The established "train-dev-test paradigm" commonly guides machine learning practitioners, involving nested optimization processes to optimize model parameters and meta-parameters and benchmarking against test data. However, this paradigm overlooks crucial aspects, such as algorithm variability and the intricate relationship between algorithm performance and meta-parameters. This work introduces a comprehensive framework that employs statistical techniques to bridge these gaps, advancing the methodological standards in empirical machine learning research. The foundational premise of this thesis lies in differentiating between algorithms and classifiers, recognizing that an algorithm may yield multiple classifiers due to inherent stochasticity or design choices. Consequently, algorithm performance becomes inherently probabilistic and cannot be captured by a single metric. The contributions of this work are structured around three core themes: Algorithm Comparison: A fundamental aim of empirical machine learning research is algorithm comparison. To this end, the thesis proposes utilizing Linear Mixed Effects Models (LMEMs) for analyzing evaluation data. LMEMs offer distinct advantages by accommodating complex data structures beyond the typical independent and identically distributed (iid) assumption. Thus LMEMs enable a holistic analysis of algorithm instances and facilitate the construction of nuanced conditional models of expected risk, supporting algorithm comparisons based on diverse data properties. Algorithm Performance Analysis: Contemporary evaluation practices often treat algorithms and classifiers as black boxes, hindering insights into their performance and parameter dependencies. Leveraging LMEMs, specifically implementing Variance Component Analysis, the thesis introduces methods from psychometrics to quantify algorithm performance homogeneity (reliability) and assess the influence of meta-parameters on performance. The flexibility of LMEMs allows a granular analysis of this relationship and extends these techniques to analyze data annotation processes linked to algorithm performance. Inferential Reproducibility: Building upon the preceding chapters, this section showcases a unified approach to analyze machine learning experiments comprehensively. By leveraging the full range of generated model instances, the analysis provides a nuanced understanding of competing algorithms. The outcomes offer implementation guidelines for algorithmic modifications and consolidate incongruent findings across diverse datasets, contributing to a coherent empirical perspective on algorithmic effects. This work underscores the significance of addressing algorithmic variability, meta-parameter impact, and the probabilistic nature of algorithm performance. This thesis aims to enhance machine learning experiments' transparency, reproducibility, and interpretability by introducing robust statistical methodologies facilitating extensive empirical analysis. It extends beyond conventional guidelines, offering a principled approach to advance the understanding and evaluation of algorithms in the evolving landscape of machine learning and data science
    • …
    corecore