Search CORE

20 research outputs found

The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development

Author: Fournier-Tombs Eleonore
Marwala Tshilidzi
Stinckwich Serge
Publication venue
Publication date: 31/08/2023
Field of study

In the current data driven era, synthetic data, artificially generated data that resembles the characteristics of real world data without containing actual personal information, is gaining prominence. This is due to its potential to safeguard privacy, increase the availability of data for research, and reduce bias in machine learning models. This paper investigates the policies governing the creation, utilization, and dissemination of synthetic data. Synthetic data can be a powerful instrument for protecting the privacy of individuals, but it also presents challenges, such as ensuring its quality and authenticity. A well crafted synthetic data policy must strike a balance between privacy concerns and the utility of data, ensuring that it can be utilized effectively without compromising ethical or legal standards. Organizations and institutions must develop standardized guidelines and best practices in order to capitalize on the benefits of synthetic data while addressing its inherent challenges

arXiv.org e-Print Archive

Improving the performance of the Rpper in insurance risk classification : a comparative study using feature selection

Author: Duma Mlungisi
Marwala Tshilidzi
Twala Bhekisipho
Publication venue: arXiv.org
Publication date: 01/01/2011
Field of study

The Ripper algorithm is designed to generate rule sets for large datasets with many features. However, it was shown that the algorithm struggles with classification performance in the presence of missing data. The algorithm struggles to classify instances when the quality of the data deteriorates as a result of increasing missing data. In this paper, a feature selection technique is used to help improve the classification performance of the Ripper model. Principal component analysis and evidence automatic relevance determination techniques are used to improve the performance. A comparison is done to see which technique helps the algorithm improve the most. Training datasets with completely observable data were used to construct the model and testing datasets with missing values were used for measuring accuracy. The results showed that principal component analysis is a better feature selection for the Ripper in improving the classification performance

University of Johannesburg Institutional Repository

CADAQUES: Metodika pro komplexní řízení kvality dat a informací

Author: David
Publication venue: University of Economics, Prague
Publication date: 01/06/2014
Field of study

Dnešní doba je charakteristická stále se zvětšujícím množstvím pořizovaných a zpracovávaných dat. Cílem tohoto článku je poukázat na různorodost současně používaných datových zdrojů, ukázat jejich specifika z pohledu řízení kvality a představit konkrétní metodiku, která umožňuje řízení kvality dat a informací napříč těmito zdroji. Hlavní součástí této metodiky je sada základních principů a činností, které je možné univerzálně aplikovat. Jedním z klíčových doporučení této metodiky je zaměření se na relativně malou sadu vlastností dat, kterou lze efektivně řídit. Součástí metodiky je rovněž model zralosti datového zdroje, který slouží pro zhodnocení míry rizika spojené s použitím konkrétního zdroje

Directory of Open Access Journals