259 research outputs found
LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon
Published 30 September 2021Language scientists often need to generate lists of related words, such as potential competitors. Theymay do this for purposes
of experimental control (e.g., selecting items matched on lexical neighborhood but varying in word frequency), or to test
theoretical predictions (e.g., hypothesizing that a novel type of competitor may impact word recognition). Several online
tools are available, but most are constrained to a fixed lexicon and fixed sets of competitor definitions, and may not give the
user full access to or control of source data. We present LexFindR, an open-source R package that can be easily modified
to include additional, novel competitor types. LexFindR is easy to use. Because it can leverage multiple CPU cores and
uses vectorized code when possible, it is also extremely fast. In this article, we present an overview of LexFindR usage,
illustrated with examples.We also explain the details of how we implemented several standard lexical competitor types used
in spoken word recognition research (e.g., cohorts, neighbors, embeddings, rhymes), and show how “lexical dimensions”
(e.g., word frequency, word length, uniqueness point) can be integrated into LexFindR workflows (for example, to calculate
“frequency-weighted competitor probabilities”), for both spoken and visual word recognition research.This work was supported in part by U.S. National
Science Foundation grants PAC 1754284 (JM, PI) and IGE NRT
1747486 (JM, PI). The authors are solely responsible for the content
of this article. This work was also supported in part by the Basque
Government through the BERC 2018-2021 program, and by the
Agencia Estatal de Investigaci´on through BCBL Severo Ochoa
excellence accreditation SEV-2015-0490
FedA3I: Annotation Quality-Aware Aggregation for Federated Medical Image Segmentation against Heterogeneous Annotation Noise
Federated learning (FL) has emerged as a promising paradigm for training
segmentation models on decentralized medical data, owing to its
privacy-preserving property. However, existing research overlooks the prevalent
annotation noise encountered in real-world medical datasets, which limits the
performance ceilings of FL. In this paper, we, for the first time, identify and
tackle this problem. For problem formulation, we propose a contour evolution
for modeling non-independent and identically distributed (Non-IID) noise across
pixels within each client and then extend it to the case of multi-source data
to form a heterogeneous noise model (i.e., Non-IID annotation noise across
clients). For robust learning from annotations with such two-level Non-IID
noise, we emphasize the importance of data quality in model aggregation,
allowing high-quality clients to have a greater impact on FL. To achieve this,
we propose Federated learning with Annotation quAlity-aware AggregatIon, named
FedA3I, by introducing a quality factor based on client-wise noise estimation.
Specifically, noise estimation at each client is accomplished through the
Gaussian mixture model and then incorporated into model aggregation in a
layer-wise manner to up-weight high-quality clients. Extensive experiments on
two real-world medical image segmentation datasets demonstrate the superior
performance of FedAI against the state-of-the-art approaches in dealing
with cross-client annotation noise. The code is available at
https://github.com/wnn2000/FedAAAI.Comment: Accepted at AAAI'2
Efficient k-means++ approximation with MapReduce
PublishedJournal Articlek-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, k-means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, k-means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents MapReduce k-means++ method which can drastically reduce the number of MapReduce jobs by using only one MapReduce job to obtain k centers. The k-means++ initialization algorithm is executed in the Mapper phase and the weighted k-means++ initialization algorithm is run in the Reducer phase. As this new MapReduce k-means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed MapReduce k-means++ method obtains O(α2) approximation to the optimal solution of k-means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed MapReduce k-means++ method is much more efficient and can achieve a good approximation.This work was supported by the National Science Foundation for Distinguished Young Scholars of China under Grant No. of 61225010, National Nature Science Foundation of China (Nos. 61173162, 61173165, 61370199, 61300187, 61300189, and 61370198), New Century Excellent Talents (No. NCET-10-0095), the Fundamental Research Funds for the Central Universities (Nos. 2013QN044 and 2012TD008)
- …