Search CORE

123,766 research outputs found

Analysing inconsistent information using distance-based measures

Author: Grant J
Hunter A
Publication venue
Publication date: 01/10/2017
Field of study

There have been a number of proposals for measuring inconsistency in a knowledgebase (i.e. a set of logical formulae). These include measures that consider the minimally inconsistent subsets of the knowledgebase, and measures that consider the paraconsistent models (3 or 4 valued models) of the knowledgebase. In this paper, we present a new approach that considers the amount by which each formula has to be weakened in order for the knowledgebase to be consistent. This approach is based on ideas of knowledge merging by Konienczny and Pino-Perez. We show that this approach gives us measures that are different from existing measures, that have desirable properties, and that can take the significance of inconsistencies into account. The latter is useful when we want to differentiate between inconsistencies that have minor significance from inconsistencies that have major significance. We also show how our measures are potentially useful in applications such as evaluating violations of integrity constraints in databases and for deciding how to act on inconsistency

UCL Discovery

TESTING A METHODOLOGY FOR IDENTIFYING CLUSTERED ALLELE LOSS USING SNP ARRAY DATA

Author: Zheng Ping
Publication venue
Publication date: 31/01/2008
Field of study

HumanHap550 Genotyping BeadChip provides a platform allowing for genotyping of single nucleotide polymorphisms (SNPs) greater than 550,000 loci. Such SNPs genotyping array technology makes it possible to identify genetic variation in individuals and across populations, profiling somatic mutations in cancer and loss of heterozygosity (LOH) events, amplifying deletions of regions of DNA, as well as possibly evaluating germline mutations in individuals. This study particularly focuses on analysis of clusters of Mendelian inconsistencies (MIs) in the SNPs array for six Russian radiation worker family trios, in order to identify the type of deletion variants for offspring such as inherited parental deletion variants (PDVs), spontaneous mutations (SMs) and germline mutations (GMs). By adapting the Bayesian theorem combining with the hereditary rule, this study presents an exciting result because 96.15% of genotypes in six selected clusters under the investigation could be identified as either PDVs or SMs/GMs, with two clusters are perfectly identified as SMs/GMs. This opens an avenue for further investigation of whether external environmental exposures (e.g., ionizing radiation) can effect the frequency of deletion variants (i.e., germline mutations) occurring in the offspring of highly exposed nuclear workers. While the applied methodology provides a practical means to recognize the genomic variations within the SNPs array some weaknesses of the study have been observed; particularly, the control group which consists of 112 individuals of Yoruba, Han Chinese, Japanese and Mormons is of deficiency on its sample size, diverse ethnicity and DNA process compared to the case group, and unclean potential hemizygous SNPs (i.e., Mendelian inconsistencies). Further statistical investigation and research needs to be conducted in order to overcome the weaknesses observed in the study; hence, the methodology introduced would be further of enhancement in its reliability and validity and it should be more effective when applied.Public health significance: The development of a reliable method to identify and count germline mutations in radiation workers could be generalized to exposures from any form of environmental mutagen (e.g., chemicals). Such a generalized marker could be used to measure the effects of various toxic environmental exposures on specific individuals and to predict genetically determined illness conditions

D-Scholarship@Pitt

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Author: Blase Jennifer
Chu Xu
Li Peng
Rao Xi
Zhang Ce
Zhang Yue
Publication venue
Publication date: 01/01/2020
Field of study

Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

Evaluating the Variability of Urban Land Surface Temperatures Using Drone Observations

Author: McDonald Walter M.
Naughton Joseph
Publication venue: e-Publications@Marquette
Publication date: 20/07/2019
Field of study

Urbanization and climate change are driving increases in urban land surface temperatures that pose a threat to human and environmental health. To address this challenge, we must be able to observe land surface temperatures within spatially complex urban environments. However, many existing remote sensing studies are based upon satellite or aerial imagery that capture temperature at coarse resolutions that fail to capture the spatial complexities of urban land surfaces that can change at a sub-meter resolution. This study seeks to fill this gap by evaluating the spatial variability of land surface temperatures through drone thermal imagery captured at high-resolutions (13 cm). In this study, flights were conducted using a quadcopter drone and thermal camera at two case study locations in Milwaukee, Wisconsin and El Paso, Texas. Results indicate that land use types exhibit significant variability in their surface temperatures (3.9–15.8 °C) and that this variability is influenced by surface material properties, traffic, weather and urban geometry. Air temperature and solar radiation were statistically significant predictors of land surface temperature (R2 0.37–0.84) but the predictive power of the models was lower for land use types that were heavily impacted by pedestrian or vehicular traffic. The findings from this study ultimately elucidate factors that contribute to land surface temperature variability in the urban environment, which can be applied to develop better temperature mitigation practices to protect human and environmental health

epublications@Marquette

The value added statement: bastion of social reporting or dinosaur of financial reporting?

Author: van Staden C. J.
Publication venue
Publication date: 01/01/2001
Field of study

South Africa is at present experiencing the highest incidence of publication of the value added statement reported anywhere in the world to date. In addition research investigating the predictive ability of value added information has been conducted in the USA since 1990, even though the value added statement has not been published there. The research reported in this paper sets out to establish whether the value added statement is a disclosure worth considering by companies around the world, by investigating the South African experience with the value added statement. The social accounting theories of organisational legitimacy and political costs were found to be best suited to explain why the value added statement is published. Surveys among the companies publishing the value added statement indicated that management had the employees in mind when they published this information. However, a survey among users has indicated that very little use has been made of the value added statement. The main reason for this seems to be that the unregulated nature of the value added statement allows for inconsistencies in disclosures, which eventually caused users to suspect bias in the reports. The USA evidence that the information has additional predictive power is not confirmed by a South African study, and is complicated by the limited additional information contained in the value added statement. The South African experience with the value added statement does not make a convincing case for publication. Rather, it highlights the need for unbiased and verified social disclosures that will be useful to all the stakeholders of the company. In addition, it has implications for other voluntary social and environmental disclosures

Massey Research Online

Accounting for the Uncertainty in the Evaluation of Percentile Ranks

Author: Leydesdorff Loet
Publication venue
Publication date: 01/01/2012
Field of study

In a recent paper entitled "Inconsistencies of Recently Proposed Citation Impact Indicators and how to Avoid Them," Schreiber (2012, at arXiv:1202.3861) proposed (i) a method to assess tied ranks consistently and (ii) fractional attribution to percentile ranks in the case of relatively small samples (e.g., for n < 100). Schreiber's solution to the problem of how to handle tied ranks is convincing, in my opinion (cf. Pudovkin & Garfield, 2009). The fractional attribution, however, is computationally intensive and cannot be done manually for even moderately large batches of documents. Schreiber attributed scores fractionally to the six percentile rank classes used in the Science and Engineering Indicators of the U.S. National Science Board, and thus missed, in my opinion, the point that fractional attribution at the level of hundred percentiles-or equivalently quantiles as the continuous random variable-is only a linear, and therefore much less complex problem. Given the quantile-values, the non-linear attribution to the six classes or any other evaluation scheme is then a question of aggregation. A new routine based on these principles (including Schreiber's solution for tied ranks) is made available as software for the assessment of documents retrieved from the Web of Science (at http://www.leydesdorff.net/software/i3).Comment: Journal of the American Society for Information Science and Technology (in press

arXiv.org e-Print Archive

International Migration, Integration and Social Cohesion online publications

Active Latitude Oscillations Observed on the Sun

Author: Clette F.
Kilcik A.
Ozguc A.
Rozelot J. -P.
Yurchyshyn V.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

We investigate periodicities in mean heliographic latitudes of sunspot groups, called active latitudes, for the last six complete solar cycles (1945-2008). For this purpose, the Multi Taper Method and Morlet Wavelet analysis methods were used. We found the following: 1) Solar rotation periodicities (26-38 days) are present in active latitudes of both hemispheres for all the investigated cycles (18 to 23). 2) Both in the northern and southern hemispheres, active latitudes drifted towards the equator starting from the beginning to the end of each cycle by following an oscillating path. These motions are well described by a second order polynomial. 3) There are no meaningful periods between 55 and about 300 days in either hemisphere for all cycles. 4) A 300 to 370 day periodicity appears in both hemispheres for Cycle 23, in the northern hemisphere for Cycle 20, and in the southern hemisphere for Cycle 18.Comment: Accepted for publication by Solar Physic

arXiv.org e-Print Archive

HAL-INSU