Search CORE

354 research outputs found

A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms"

Author: Jia Ruoxi
Wang Jiachen T.
Publication venue
Publication date: 09/04/2023
Field of study

Data valuation is a growing research field that studies the influence of individual data points for machine learning (ML) models. Data Shapley, inspired by cooperative game theory and economics, is an effective method for data valuation. However, it is well-known that the Shapley value (SV) can be computationally expensive. Fortunately, Jia et al. (2019) showed that for K-Nearest Neighbors (KNN) models, the computation of Data Shapley is surprisingly simple and efficient. In this note, we revisit the work of Jia et al. (2019) and propose a more natural and interpretable utility function that better reflects the performance of KNN models. We derive the corresponding calculation procedure for the Data Shapley of KNN classifiers/regressors with the new utility functions. Our new approach, dubbed soft-label KNN-SV, achieves the same time complexity as the original method. We further provide an efficient approximation algorithm for soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental results demonstrate that Soft-label KNN-SV outperforms the original method on most datasets in the task of mislabeled data detection, making it a better baseline for future work on data valuation

arXiv.org e-Print Archive

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning

Author: Jia Ruoxi
Wang Jiachen T.
Publication venue
Publication date: 18/12/2023
Field of study

Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.Comment: AISTATS 2023 Ora

arXiv.org e-Print Archive

Recommended from our members

Host-guest interaction in P2P accommodation under the epidemic: Motivations, behavior and influences

Author: Duan Ruoxi
Shi Musha
Wang Sujie
Wang Xinke
Publication venue: ScholarWorks@UMass Amherst
Publication date: 10/07/2022
Field of study

The epidemic has reshaped tourists demands and changed the way they interact with others. Using qualitative grounded theory, from the perspective of tourists, this study developed a framework to further illustrate the driving factors, other external factors, the new content of the interaction between the hosts and the guests, and the outcomes of such interaction in the context of epidemic. The result shows that exploring meaningful interpersonal relationship and releasing depression and anxiety are the driving factors for guests to engage in face-to-face interactions. Furthermore, guests’ perceived hygienic attributes of P2P accommodation during the stay influenced them infection risk perceptions. Guests with high-risk perceptions preferring to contactless or without interaction with the host, those with lower risk perceptions choosing to face-to-face interaction. The face-to-face and contactless interaction will enhance guests’ psychological capital and lead to tourist citizenship behaviors. The theoretical and practical implications were discussed as well

ScholarWorks@UMass Amherst

Justifying a privacy guardian in discourse and behaviour : the People’s Republic of China’s strategic framing in data governance

Author: Lei Yaxiong
Wang Ruoxi
Zhang Chi
Publication venue
Publication date: 19/02/2024
Field of study

The People’s Republic of China’s (PRC) approach to data governance, centred on data sovereignty, is much debated in academic literature. However, it remains unclear how the PRC’s different state actors justify this approach. Based on an analysis of the discourse and behaviour of the PRC’s state actors through strategic framing theory, their role as a privacy guardian can arguably be described as strategically constructed. The Chinese government and legislative bodies have tailored their communications to present themselves as champions of individual privacy, aiming to secure support for state policies. This strategic framing encompasses four mechanisms: the reframing of privacy threats through political narratives; legal ambiguities; selective framing; and the implementation of censorship to influence public discourse. An examination of how the Chinese government responded differently to data breaches in the cases of Didi and the Shanghai National Police Database leak highlights the Chinese government’s efforts in maintaining framing consistency to construct itself as a guardian, rather than a violator, of individual privacy.Peer reviewe

University of St. Andrews - Pure

St Andrews Research Repository

Étude de la réponse dosimétrique du Nitrure de Gallium (GaN) : modélisation, simulation et caractérisation pour la radiothérapie

Author: Wang Ruoxi
Publication venue: HAL CCSD
Publication date: 27/05/2015
Field of study

The work in this thesis has the objective to increase the measurement precision of the dosimetry based on the Gallium Nitride (GaN) transducer and develop its applications on radiotherapy. The study includes the aspects of modeling, simulation and characterization of this response in external radiotherapy and brachytherapy. In modeling, we have proposed two approaches to model the GaN transducer’s response in external radiotherapy. For the first approach, a model has been built based on experimental data, while separating the primary and scattering component of the beam. For the second approach, we have adopted a response model initially developed for the silicon diodes for the GaN radioluminescent transducer. We have also proposed an original concept of bi-media dosimetry which evaluates the dose in tissue according to different responses from two media without prior information on the conditions of irradiation. This concept has been shown by Monte Carlo simulation. Moreover, for High Dose Rate brachytherapy, the response of GaN transducer irradiated by iridium 192 and cobalt 60 sources has been evaluated by Monte Carlo simulation and confirmed by the measurements. Studies on the property characterization of GaN radioluminescent transducer has been carried out with these sources as well. An instrumented phantom prototype with GaN probe has been developed for the HDR brachytherapy quality control. It allows a real-time verification of the physics parameters of a treatment (source dwell position, source dwell time, source activity)Ce travail de thèse a pour but d'améliorer la précision de mesure de la dosimétrie basée sur le transducteur en Nitrure de Gallium (GaN) et de développer son application en radiothérapie. L'étude comprend des phases de modélisation, de simulation et de caractérisation de cette réponse pour la radiothérapie externe et la curiethérapie. En modélisation, nous avons proposé deux approches pour modéliser la réponse du transducteur GaN en radiothérapie externe. Dans la première approche, un modèle a été construit à partir de données expérimentales et en séparant les composantes primaires et diffusées du faisceau. Pour la deuxième approche, nous avons adapté un modèle initialement proposé pour les diodes silicium pour l'adapter au transducteur radioluminescent GaN. Nous avons également proposé un concept original de dosimétrie bi-média qui permet à partir des réponses mesurées des deux média de remonter à la dose dans les tissus, sans connaissance à priori des conditions d'irradiation. Ce concept a été démontré par des simulations Monte Carlo. Par ailleurs GaN pour la curiethérapie à Haut Débit de Dose, la réponse du transducteur GaN sous irradiation des sources d'iridium 192 et de cobalt 60 a été évaluée par simulation Monte Carlo et confirmée par des mesures. Des études de caractérisation des propriétés du transducteur radioluminescent GaN ont été réalisées avec ces sources. Un prototype de fantôme instrumenté avec des sondes GaN a été développé pour le contrôle qualité en curiethérapie HDR. Il permet de vérifier en temps réel les paramètres physiques du traitement (position de la source, le temps d'exposition, activité de la source

Thèses en Ligne

HAL Descartes

Hal-Diderot