Search CORE

83,363 research outputs found

Machine Reading the Primeros Libros

Author: Hannah Alpert-Abrams
Publication venue: 'Modern Language Association'
Publication date: 01/01/2016
Field of study

Early modern printed books pose particular challenges for automatic transcription: uneven inking, irregular orthographies, radically multilingual texts. As a result, modern efforts to transcribe these documents tend to produce the textual gibberish commonly known as "dirty OCR" (Optical Character Recognition). This noisy output is most frequently seen as a barrier to access for scholars interested in the computational analysis or digital display of transcribed documents. This article, however, proposes that a closer analysis of dirty OCR can reveal both historical and cultural factors at play in the practice of automatic transcription. To make this argument, it focuses on tools developed for the automatic transcription of the Primeros Libros collection of sixteenth century Mexican printed books. By bringing together the history of the collection with that of the OCR tool, it illustrates how the colonial history of these documents is embedded in, and transformed by, the statistical models used for automatic transcription. It argues that automatic transcription, itself a mechanical and practical tool, also has an interpretive effect on transcribed texts that can have practical consequences for scholarly work

Humanities Commons

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Author: Blase Jennifer
Chu Xu
Li Peng
Rao Xi
Zhang Ce
Zhang Yue
Publication venue
Publication date: 01/01/2020
Field of study

Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

A Scalable and Extensible Framework for Superposition-Structured Models

Author: Xie Cong
Zhang Zhihua
Zhao Shenjian
Publication venue
Publication date: 02/03/2016
Field of study

In many learning tasks, structural models usually lead to better interpretability and higher generalization performance. In recent years, however, the simple structural models such as lasso are frequently proved to be insufficient. Accordingly, there has been a lot of work on "superposition-structured" models where multiple structural constraints are imposed. To efficiently solve these "superposition-structured" statistical models, we develop a framework based on a proximal Newton-type method. Employing the smoothed conic dual approach with the LBFGS updating formula, we propose a scalable and extensible proximal quasi-Newton (SEP-QN) framework. Empirical analysis on various datasets shows that our framework is potentially powerful, and achieves super-linear convergence rate for optimizing some popular "superposition-structured" statistical models such as the fused sparse group lasso

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Particle-Hole Symmetry and the Bose Glass to Superfluid Transition

Author: Mukhopadhyay Ranjan
Weichman Peter B.
Publication venue: 'American Physical Society (APS)'
Publication date: 15/04/1996
Field of study

The generic Hamiltonian describing the zero temperature transition between the insulating Bose glass phase and the superfluid phase lacks particle-hole symmetry, but a statistical version of this symmetry is believed to be restored at the critical point. We show that the renormalization group relevance of particle-hole asymmetry may be explored in a controlled fashion only for small time dimensions, ετ≪1, where we find a stable particle-hole asymmetric and an unstable particle-hole symmetric fixed point, but we provide evidence that the two merge for some finite ετ≈2/3, which tends to confirm symmetry restoration at the physical ετ = 1

Caltech Authors

Relational Approach to Knowledge Engineering for POMDP-based Assistance Systems as a Translation of a Psychological Model

Author: Czarnuch Stephen
Grzes Marek
Hoey Jesse
Jackson Dan
Khan Shehroz
Mihailidis Alex
Monk Andrew
Publication venue: 'Elsevier BV'
Publication date: 25/06/2012
Field of study

Assistive systems for persons with cognitive disabilities (e.g. dementia) are difficult to build due to the wide range of different approaches people can take to accomplishing the same task, and the significant uncertainties that arise from both the unpredictability of client's behaviours and from noise in sensor readings. Partially observable Markov decision process (POMDP) models have been used successfully as the reasoning engine behind such assistive systems for small multi-step tasks such as hand washing. POMDP models are a powerful, yet flexible framework for modelling assistance that can deal with uncertainty and utility. Unfortunately, POMDPs usually require a very labour intensive, manual procedure for their definition and construction. Our previous work has described a knowledge driven method for automatically generating POMDP activity recognition and context sensitive prompting systems for complex tasks. We call the resulting POMDP a SNAP (SyNdetic Assistance Process). The spreadsheet-like result of the analysis does not correspond to the POMDP model directly and the translation to a formal POMDP representation is required. To date, this translation had to be performed manually by a trained POMDP expert. In this paper, we formalise and automate this translation process using a probabilistic relational model (PRM) encoded in a relational database. We demonstrate the method by eliciting three assistance tasks from non-experts. We validate the resulting POMDP models using case-based simulations to show that they are reasonable for the domains. We also show a complete case study of a designer specifying one database, including an evaluation in a real-life experiment with a human actor

arXiv.org e-Print Archive

Crossref

Kent Academic Repository

Multibaseline gravitational wave radiometry

Author: C. W. Helstrom
Dipongkar Talukder
Sanjit Mitra
Sukanta Bose
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2011
Field of study

We present a statistic for the detection of stochastic gravitational wave backgrounds (SGWBs) using radiometry with a network of multiple baselines. We also quantitatively compare the sensitivities of existing baselines and their network to SGWBs. We assess how the measurement accuracy of signal parameters, e.g., the sky position of a localized source, can improve when using a network of baselines, as compared to any of the single participating baselines. The search statistic itself is derived from the likelihood ratio of the cross correlation of the data across all possible baselines in a detector network and is optimal in Gaussian noise. Specifically, it is the likelihood ratio maximized over the strength of the SGWB, and is called the maximized-likelihood ratio (MLR). One of the main advantages of using the MLR over past search strategies for inferring the presence or absence of a signal is that the former does not require the deconvolution of the cross correlation statistic. Therefore, it does not suffer from errors inherent to the deconvolution procedure and is especially useful for detecting weak sources. In the limit of a single baseline, it reduces to the detection statistic studied by Ballmer [Class. Quant. Grav. 23, S179 (2006)] and Mitra et al. [Phys. Rev. D 77, 042002 (2008)]. Unlike past studies, here the MLR statistic enables us to compare quantitatively the performances of a variety of baselines searching for a SGWB signal in (simulated) data. Although we use simulated noise and SGWB signals for making these comparisons, our method can be straightforwardly applied on real data.Comment: 17 pages and 19 figure

arXiv.org e-Print Archive

Crossref

Caltech Authors

MPG.PuRe