890,923 research outputs found
Recommended from our members
A classification of data quality assessment and improvement methods
Data quality (DQ) assessment and improvement in larger
information systems would often not be feasible without using suitable “DQ
methods”, which are algorithms that can be automatically executed by
computer systems to detect and/or correct problems in datasets. Currently, these
methods are already essential, and they will be of even greater importance as
the quantity of data in organisational systems grows. This paper provides a
review of existing methods for both DQ assessment and improvement and
classifies them according to the DQ problem and problem context. Six gaps
have been identified in the classification, where no current DQ methods exist,
and these show where new methods are required as a guide for future research
and DQ tool development.This is the accepted manuscript. It's currently embargoed pending publication by Inderscience
Towards Sweetness Classification of Orange Cultivars Using Short‑Wave NIR Spectroscopy
The global orange industry constantly faces new technical challenges to meet consumer demands for quality fruits. Instead of traditional subjective fruit quality assessment methods, the interest in the horticulture industry has increased in objective, quantitative, and non-destructive assessment methods. Oranges have a thick peel which makes their non-destructive quality assessment challenging. This paper evaluates the potential of short-wave NIR spectroscopy and direct sweetness classification approach for Pakistani cultivars of orange, i.e., Red-Blood, Mosambi, and Succari. The correlation between quality indices, i.e., Brix, titratable acidity (TA), Brix: TA and BrimA (Brix minus acids), sensory assessment of the fruit, and short-wave NIR spectra, is analysed. Mix cultivar oranges are classified as sweet, mixed, and acidic based on short-wave NIR spectra. Short-wave NIR spectral data were obtained using the industry standard F-750 fruit quality meter (310–1100 nm). Reference Brix and TA measurements were taken using standard destructive testing methods. Reference taste labels i.e., sweet, mix, and acidic, were acquired through sensory evaluation of samples. For indirect fruit classification, partial least squares regression models were developed for Brix, TA, Brix: TA, and BrimA estimation with a correlation coefficient of 0.57, 0.73, 0.66, and 0.55, respectively, on independent test data. The ensemble classifier achieved 81.03% accuracy for three classes (sweet, mixed, and acidic) classification on independent test data for direct fruit classification. A good correlation between NIR spectra and sensory assessment is observed as compared to quality indices. A direct classification approach is more suitable for a machine-learning-based orange sweetness classification using NIR spectroscopy than the estimation of quality indices
Recent development in electronic nose data processing for beef quality assessment
Beef is kind of perishable food that easily to decay. Hence, a rapid system for beef quality assessment is needed to guarantee the quality of beef. In the last few years, electronic nose (e-nose) is developed for beef spoilage detection. In this paper, we discuss the challenges of e-nose application to beef quality assessment, especially in e-nose data processing. We also provide a summary of our previous studies that explains several methods to deal with gas sensor noise, sensor array optimization problem, beef quality classification, and prediction of the microbial population in beef sample. This paper might be useful for researchers and practitioners to understand the challenges and methods of e-nose data processing for beef quality assessment
Large-scale nonlinear dimensionality reduction for network intrusion detection
International audienceNetwork intrusion detection (NID) is a complex classification problem. In this paper, we combine classification with recent and scalable nonlinear dimensionality reduction (NLDR) methods. Classification and DR are not necessarily adversarial, provided adequate cluster magnification occurring in NLDR methods like -SNE: DR mitigates the curse of dimensionality, while cluster magnification can maintain class separability. We demonstrate experimentally the effectiveness of the approach by analyzing and comparing results on the big KDD99 dataset, using both NLDR quality assessment and classification rate for SVMs and random forests. Since data involves features of mixed types (numerical and categorical), the use of Gower's similarity coefficient as metric further improves the results over the classical similarity metric
Multisource and temporal variability in Portuguese hospital administrative datasets: Data quality implications
[EN] Background: Unexpected variability across healthcare datasets may indicate data quality issues and thereby affect the credibility of these data for reutilization. No gold-standard reference dataset or methods for variability assessment are usually available for these datasets. In this study, we aim to describe the process of discovering data quality implications by applying a set of methods for assessing variability between sources and over time in a large hospital database. Methods: We described and applied a set of multisource and temporal variability assessment methods in a large Portuguese hospitalization database, in which variation in condition-specific hospitalization ratios derived from clinically coded data were assessed between hospitals (sources) and over time. We identified condition-specific admissions using the Clinical Classification Software (CCS), developed by the Agency of Health Care Research and Quality. A Statistical Process Control (SPC) approach based on funnel plots of condition-specific standardized hospitalization ratios (SHR) was used to assess multisource variability, whereas temporal heat maps and Information-Geometric Temporal (IGT) plots were used to assess temporal variability by displaying temporal abrupt changes in data distributions. Results were presented for the 15 most common inpatient conditions (CCS) in Portugal. Main findings: Funnel plot assessment allowed the detection of several outlying hospitals whose SHRs were much lower or higher than expected. Adjusting SHR for hospital characteristics, beyond age and sex, considerably affected the degree of multisource variability for most diseases. Overall, probability distributions changed over time for most diseases, although heterogeneously. Abrupt temporal changes in data distributions for acute myocardial infarction and congestive heart failure coincided with the periods comprising the transition to the International Classification of Diseases, 10th revision, Clinical Modification, whereas changes in the DiagnosisRelated Groups software seem to have driven changes in data distributions for both acute myocardial infarction and liveborn admissions. The analysis of heat maps also allowed the detection of several discontinuities at hospital level over time, in some cases also coinciding with the aforementioned factors. Conclusions: This paper described the successful application of a set of reproducible, generalizable and systematic methods for variability assessment, including visualization tools that can be useful for detecting abnormal patterns in healthcare data, also addressing some limitations of common approaches. The presented method for multisource variability assessment is based on SPC, which is an advantage considering the lack of gold standard for such process. Properly controlling for hospital characteristics and differences in case-mix for estimating SHR is critical for isolating data quality-related variability among data sources. The use of IGT plots provides an advantage over common methods for temporal variability assessment due its suitability for multitype and multimodal data, which are common characteristics of healthcare data. The novelty of this work is the use of a set of methods to discover new data quality insights in healthcare data.The authors would like to thank the Central Authority for Health Services, I.P. (ACSS) for providing access to the data. The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financed by FEDER-Fundo Europeu de Desenvolvimento Regional funds through the COMPETE 2020-Operacional Programme for Competitiveness and Internationalisation (POCI) and by Portuguese funds through FCT- Fundacao para a Ciencia e a Tecnologia in the framework of the project POCI-01-0145-FEDER-030766 ("1st.IndiQare-Quality indicators in primary health care: validation and implementation of quality indicators as an assessment and comparison tool") . In addition, we would like to thank to projects GEMA (SBPLY/17/180501/000293) -Generation and Evaluation of Models for Data Quality, and ADAGIO (SBPLY/21/180501/000061) - Alarcos Data Governance framework and systems generation, both funded by the Department of Education, Culture and Sports of the JCCM and FEDER; and to AETHER-UCLM: A smart data holistic approach for context -aware data analytics focused on Quality and Security project (Ministerio de Ciencia e Innovacion, PID2020- 112540RB-C42) . CSS thanks the Universitat Politecnica de Valencia contract no. UPV-SUB.2-1302 and FONDO SUPERA COVID-19 by CRUE- Santander Bank grant "Severity Subgroup Discovery and Classification on COVID-19 Real World Data through Machine Learning and Data Quality assessment (SUBCOVERWD-19) ."Souza, J.; Caballero, I.; Vasco Santos, J.; Lobo, M.; Pinto, A.; Viana, J.; Sáez Silvestre, C.... (2022). Multisource and temporal variability in Portuguese hospital administrative datasets: Data quality implications. Journal of Biomedical Informatics. 136:1-11. https://doi.org/10.1016/j.jbi.2022.10424211113
COST 733 - WG4: Applications of weather type classification
The main objective of the COST Action 733 is to achieve a general numerical method for
assessing, comparing and classifying typical weather situations in the European regions. To
accomplish this goal, different workgroups are established, each with their specific aims:
WG1: Existing methods and applications (finished); WG2: Implementation and development of
weather types classification methods; WG3: Comparison of selected weather types
classifications; WG4: Testing methods for various applications.
The main task of Workgroup 4 (WG4) in COST 733 implies the testing of the selected weather
type methods for various classifications. In more detail, WG4 focuses on the following topics:•
Selection of dedicated applications (using results from WG1),
• Performance of the selected applications using available weather types provided by WG2,
• Intercomparison of the application results as a results of different methods
• Final assessment of the results and uncertainties,
• Presentation and release of results to the other WGs and external interested
• Recommend specifications for a new (common) method WG2
Introduction
In order to address these specific aims, various applications are selected and WG4 is divided in
subgroups accordingly:
1.Air quality
2. Hydrology (& Climatological mapping)
3. Forest fires
4. Climate change and variability
5. Risks and hazards
Simultaneously, the special attention is paid to the several wide topics concerning some other
COST Actions such as: phenology (COST725), biometeorology (COST730), agriculture (COST 734)
and mesoscale modelling and air pollution (COST728).
Sub-groups are established to find advantages and disadvantages of different classification
methods for different applications. Focus is given to data requirements, spatial and temporal
scale, domain area, specifi
- …