Search CORE

1,466 research outputs found

Children as Models for Computers: Natural Language Acquisition for Machine Learning

Author: Becerra-Bonache Leonor
Jiménez-López Maria Dolores
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceThis paper focuses on a subﬁeld of machine learning, the so- called grammatical inference. Roughly speaking, grammatical inference deals with the problem of inferring a grammar that generates a given set of sample sentences in some manner that is supposed to be realized by some inference algorithm. We discuss how the analysis and formalization of the main features of the process of human natural language acquisition may improve results in the area of grammatical inference

CiteSeerX

HAL-UJM

DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS Channel and DNA Storage

Author: Chen Xin
Guo Alan J. X.
Sun Sihan
Wei Mengyi
Wei Xiang
Publication venue
Publication date: 19/12/2023
Field of study

Recently, DNA storage has emerged as a promising data storage solution, offering significant advantages in storage density, maintenance cost efficiency, and parallel replication capability. Mathematically, the DNA storage pipeline can be viewed as an insertion, deletion, and substitution (IDS) channel. Because of the mathematical terra incognita of the Levenshtein distance, designing an IDS-correcting code is still a challenge. In this paper, we propose an innovative approach that utilizes deep Levenshtein distance embedding to bypass these mathematical challenges. By representing the Levenshtein distance between two sequences as a conventional distance between their corresponding embedding vectors, the inherent structural property of Levenshtein distance is revealed in the friendly embedding space. Leveraging this embedding space, we introduce the DoDo-Code, an IDS-correcting code that incorporates deep embedding of Levenshtein distance, deep embedding-based codeword search, and deep embedding-based segment correcting. To address the requirements of DNA storage, we also present a preliminary algorithm for long sequence decoding. As far as we know, the DoDo-Code is the first IDS-correcting code designed using plausible deep learning methodologies, potentially paving the way for a new direction in error-correcting code research. It is also the first IDS code that exhibits characteristics of being `optimal' in terms of redundancy, significantly outperforming the mainstream IDS-correcting codes of the Varshamov-Tenengolts code family in code rate

arXiv.org e-Print Archive

Computers in Support of Musical Expression

Author: Bryan-Kinns Nick
Publication venue
Publication date: 30/12/2013
Field of study

Queen Mary Research Online

Structure Preserving Encoding of Non-euclidean Similarity Data

Author: Biehl Michael
Münch Maximilian
Raab Christoph
Schleif Frank-Michael
Publication venue: 'Scitepress'
Publication date: 01/01/2020
Field of study

Domain-specific proximity measures, like divergence measures in signal processing or alignment scores in bioinformatics, often lead to non-metric, indefinite similarities or dissimilarities. However, many classical learning algorithms like kernel machines assume metric properties and struggle with such metric violations. For example, the classical support vector machine is no longer able to converge to an optimum. One possible direction to solve the indefiniteness problem is to transform the non-metric (dis-)similarity data into positive (semi-)definite matrices. For this purpose, many approaches have been proposed that adapt the eigenspectrum of the given data such that positive definiteness is ensured. Unfortunately, most of these approaches modify the eigenspectrum in such a strong manner that valuable information is removed or noise is added to the data. In particular, the shift operation has attracted a lot of interest in the past few years despite its frequently reoccurring disadvantages. In this work, we propose a modified advanced shift correction method that enables the preservation of the eigenspectrum structure of the data by means of a low-rank approximated nullspace correction. We compare our advanced shift to classical eigenvalue corrections like eigenvalue clipping, flipping, squaring, and shifting on several benchmark data. The impact of a low-rank approximation on the data’s eigenspectrum is analyzed.</p

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Automating data preparation with statistical analysis

Author: Wang Pei
Publication venue
Publication date: 18/03/2021
Field of study

Data preparation is the process of transforming raw data into a clean and consumable format. It is widely known as the bottleneck to extract value and insights from data, due to the number of possible tasks in the pipeline and factors that can largely affect the results, such as human expertise, application scenarios, and solution methodology. Researchers and practitioners devised a great variety of techniques and tools over the decades, while many of them still place a significant burden on human’s side to configure the suitable input rules and parameters. In this thesis, with the goal of reducing human manual effort, we explore using the power of statistical analysis techniques to automate three subtasks in the data preparation pipeline: data enrichment, error detection, and entity matching. Statistical analysis is the process of discovering underlying patterns and trends from data and deducing properties of an underlying distribution of probability from a sample, for example, testing hypotheses and deriving estimates. We first discuss CrawlEnrich, which automatically figures out the queries for data enrichment via web API data, by estimating the potential benefit of issuing a certain query. Then we study how to derive reusable error detection configuration rules from a web table corpus, so that end-users get results with no efforts. Finally, we introduce AutoML-EM, aiming to automate the entity matching model development process. Entity matching is to find the identical entities in real-world. Our work provides powerful angles to automate the process of various data preparation steps, and we conclude this thesis by discussing future directions

Simon Fraser University Institutional Repository

Voice Recognition and Mobility in the Legal Industry

Author: Pillar Brian
Solomon David
Publication venue: La Salle University Digital Commons
Publication date: 18/05/2015
Field of study

In a typical legal work environment, attorneys work with their staff to generate and send case related legal documents and communications. Traditionally the attorney will dictate to a device capable of recording audio and the legal assistant will transcribe the audio directly from the source. In the early days of recorded dictation audio was recorded and saved to analog tape. Once the technology became available, dictation was saved digitally to flash memory and transmitted to hard disk for playback by the legal assistant. It has been this way for years, and due to advances in voice recognition technology and computer processing there are alternative options to the traditional dictation/transcription process. The focus of this paper is to examine the traditional process of dictation/transcription and how it compares to the process of using voice recognition software. Analysis of each process as well as an evaluation of voice recognition software will be performed. The document generation process will also be examined as it relates to transcription and creating a document, regardless of the content. The most efficient solution which benefits a small to medium size law firm will be recommended. According to Understanding How Law Offices Do Business, a small law firm has between one and ten lawyers and a mid-size law firm has up to 50 lawyers. These firms are the target audience. The goal of this paper is to determine if the use of voice recognition software can help an attorney and their staff be more efficient, and if so, which voice recognition software and methods work the best. Tests will be performed analyzing both Dragon Naturally Speaking 12 Professional and Windows 7 voice recognition software on the desktop. The software with the higher accuracy rate based on our tests will be used to evaluate voice recognition processes throughout this paper

La Salle University Digital Commons

Recommended from our members

Fall 2013

Author: CSUSB
Publication venue: CSUSB ScholarWorks
Publication date: 01/10/2013
Field of study

Features: Fierce compassion......Page 5 He has come a long way in school and now earned a prestigious college scholarship. But his success really began at home. String theory......Page 7 A master guitar stringmaker for many music greats is making big dreams come true for many of its employees. The best life lab......Page 10 How can lives change so much just by playing what amounts to a simple video game? The amazing work of neurofeedback. Hikes into history......Page 14 Five CSUSB students map the tunnels of old Mojave mines, and what in the world is in those rocks? Out of the reeds......Page 17 CSUSB alumnus David McCabe\u27s new novel on human trafficking takes readers on a journey of courage and conscience. Getting started......Page 18 In the beginning, there were small, medium, and large computers, and there was Richard Botting.https://scholarworks.lib.csusb.edu/alumni-mag/1029/thumbnail.jp

CSUSB ScholarWorks

Volume CXXII, Number 3, October 1, 2004

Author: Lawrence University
Publication venue: 'Atelier Fluxus Virus'
Publication date: 01/10/2004
Field of study

Lawrence University

Comprehensive Evaluation of Machine Learning Experiments: Algorithm Comparison, Algorithm Performance and Inferential Reproducibility

Author: Hagmann Michael
Publication venue
Publication date: 01/01/2023
Field of study

This doctoral thesis addresses critical methodological aspects within machine learning experimentation, focusing on enhancing the evaluation and analysis of algorithm performance. The established "train-dev-test paradigm" commonly guides machine learning practitioners, involving nested optimization processes to optimize model parameters and meta-parameters and benchmarking against test data. However, this paradigm overlooks crucial aspects, such as algorithm variability and the intricate relationship between algorithm performance and meta-parameters. This work introduces a comprehensive framework that employs statistical techniques to bridge these gaps, advancing the methodological standards in empirical machine learning research. The foundational premise of this thesis lies in differentiating between algorithms and classifiers, recognizing that an algorithm may yield multiple classifiers due to inherent stochasticity or design choices. Consequently, algorithm performance becomes inherently probabilistic and cannot be captured by a single metric. The contributions of this work are structured around three core themes: Algorithm Comparison: A fundamental aim of empirical machine learning research is algorithm comparison. To this end, the thesis proposes utilizing Linear Mixed Effects Models (LMEMs) for analyzing evaluation data. LMEMs offer distinct advantages by accommodating complex data structures beyond the typical independent and identically distributed (iid) assumption. Thus LMEMs enable a holistic analysis of algorithm instances and facilitate the construction of nuanced conditional models of expected risk, supporting algorithm comparisons based on diverse data properties. Algorithm Performance Analysis: Contemporary evaluation practices often treat algorithms and classifiers as black boxes, hindering insights into their performance and parameter dependencies. Leveraging LMEMs, specifically implementing Variance Component Analysis, the thesis introduces methods from psychometrics to quantify algorithm performance homogeneity (reliability) and assess the influence of meta-parameters on performance. The flexibility of LMEMs allows a granular analysis of this relationship and extends these techniques to analyze data annotation processes linked to algorithm performance. Inferential Reproducibility: Building upon the preceding chapters, this section showcases a unified approach to analyze machine learning experiments comprehensively. By leveraging the full range of generated model instances, the analysis provides a nuanced understanding of competing algorithms. The outcomes offer implementation guidelines for algorithmic modifications and consolidate incongruent findings across diverse datasets, contributing to a coherent empirical perspective on algorithmic effects. This work underscores the significance of addressing algorithmic variability, meta-parameter impact, and the probabilistic nature of algorithm performance. This thesis aims to enhance machine learning experiments' transparency, reproducibility, and interpretability by introducing robust statistical methodologies facilitating extensive empirical analysis. It extends beyond conventional guidelines, offering a principled approach to advance the understanding and evaluation of algorithms in the evolving landscape of machine learning and data science

Heidelberger Dokumentenserver