Search CORE

1,325 research outputs found

Delete or merge regressors for linear model selection

Author: Maj-Kańska Aleksandra
Pokarowski Piotr
Prochenka Agnieszka
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2015
Field of study

We consider a problem of linear model selection in the presence of both continuous and categorical predictors. Feasible models consist of subsets of numerical variables and partitions of levels of factors. A new algorithm called delete or merge regressors (DMR) is presented which is a stepwise backward procedure involving ranking the predictors according to squared t-statistics and choosing the final model minimizing BIC. In the article we prove consistency of DMR when the number of predictors tends to infinity with the sample size and describe a simulation study using a pertaining R package. The results indicate significant advantage in time complexity and selection accuracy of our algorithm over Lasso-based methods described in the literature. Moreover, a version of DMR for generalized linear models is proposed

arXiv.org e-Print Archive

An Evaluation of the X10 Programming Language

Author: Guo Xiu
Publication venue: SJSU ScholarWorks
Publication date: 01/10/2012
Field of study

As predicted by Moore\u27s law, the number of transistors on a chip has been doubled approximately every two years. As miraculous as it sounds, for many years, the extra transistors have massively benefited the whole computer industry, by using the extra transistors to increase CPU clock speed, thus boosting performance. However, due to heat wall and power constraints, the clock speed cannot be increased limitlessly. Hardware vendors now have to take another path other than increasing clock speed, which is to utilize the transistors to increase the number of processor cores on each chip. This hardware structural change presents inevitable challenges to software structure, where single thread targeted software will not benefit from newer chips or may even suffer from lower clock speed. The two fundamental challenges are: 1. How to deal with the stagnation of single core clock speed and cache memory. 2. How to utilize the additional power generated from more cores on a chip. Most software programming languages nowadays have distributed computing support, such as C and Java [1]. Meanwhile, some new programming languages were invented from scratch just to take advantage of the more distributed hardware structures. The X10 Programming Language is one of them. The goal of this project is to evaluate X10 in terms of performance, programmability and tool support

SJSU ScholarWorks

Merging Sorted Lists of Similar Strings

Author: Myers Gene
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

Iterative Optimization and Simplification of Hierarchical Clusterings

Author: Fisher D.
Publication venue
Publication date: 01/01/1995
Field of study

Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can partition the search so that a system inexpensively constructs a `tentative' clustering for initial examination, followed by iterative optimization, which continues to search in background for improved clusterings. Given this motivation, we evaluate an inexpensive strategy for creating initial clusterings, coupled with several control strategies for iterative optimization, each of which repeatedly modifies an initial clustering in search of a better one. One of these methods appears novel as an iterative optimization strategy in clustering contexts. Once a clustering has been constructed it is judged by analysts -- often according to task-specific criteria. Several authors have abstracted these criteria and posited a generic performance task akin to pattern completion, where the error rate over completed patterns is used to `externally' judge clustering utility. Given this performance task, we adapt resampling-based pruning strategies used by supervised learning systems to the task of simplifying hierarchical clusterings, thus promising to ease post-clustering analysis. Finally, we propose a number of objective functions, based on attribute-selection measures for decision-tree induction, that might perform well on the error rate and simplicity dimensions.Comment: See http://www.jair.org/ for any accompanying file

arXiv.org e-Print Archive

CiteSeerX

Colouring flags with Dafny & Idris

Author: de Muijnck-Hughes Jan
Noble James
Publication venue
Publication date: 14/01/2024
Field of study

Dafny and Idris are two verification-aware programming languages that support two different styles of fine-grained reasoning about our software programs. Dafny is an imperative design-by-contract language that provides a clear separation between specifications and code, while Idris is a dependently-typed functional language in which specifications are code. Each of these approaches support different styles of verification (Hoare Logic in Dafny versus Dependent Type Theory in Idris). In this paper, we will examine how Dafny and Idris express The Problem of the Dutch National Flag from Dijkstra’s Discipline of Programming and note the differences and similarities between both approaches

Hierarchical Entity Resolution using an Oracle

Author: Barna Saha
Divesh Srivastava
Donatella Firmani
Sainyam Galhotra
Publication venue: place:New York
Publication date: 01/01/2022
Field of study

In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like type-subtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets

Archivio della ricerca- Università di Roma La Sapienza