22,628 research outputs found

    A Rule Based Taxonomy of Dirty Data

    Get PDF
    There is a growing awareness that high quality of datais a key to today’s business success and that dirty data existingwithin data sources is one of the causes of poor data quality. Toensure high quality data, enterprises need to have a process,methodologies and resources to monitor, analyze and maintainthe quality of data. Nevertheless, research shows that manyenterprises do not pay adequate attention to the existence of dirtydata and have not applied useful methodologies to ensure highquality data for their applications. One of the reasons is a lack ofappreciation of the types and extent of dirty data. In practice,detecting and cleaning all the dirty data that exists in all datasources is quite expensive and unrealistic. The cost of cleaningdirty data needs to be considered for most of enterprises. Thisproblem has not attracted enough attention from researchers. Inthis paper, a rule-based taxonomy of dirty data is developed. Theproposed taxonomy not only provides a mechanism to deal withthis problem but also includes more dirty data types than any ofexisting such taxonomies

    Data Quality in Very Large, Multiple-Source, Secondary Datasets for Data Mining Applications

    Get PDF
    The data mining research community is increasingly addressing data quality issues, including problems of dirty data. Hand, Blunt, Kelly and Adams (2000) have identified high-level and low-level quality issues in data mining. Kim, Choi, Hong, Kim and Lee (2003) have compiled a useful, complete taxonomy of dirty data that provides a starting point for research in effective techniques and fast algorithms for preprocessing data, and ways to approach the problems of dirty data. In this study we create a classification scheme for data errors by transforming their general taxonomy to apply to very large multiple-source secondary datasets. These types of datasets are increasingly being compiled by organizations for use in their data mining applications. We contribute this classification scheme to the body of research addressing quality issues in the very large multiple-source secondary datasets that are being built through today’s global organizations’ massive data collection from the Internet

    Robust Recommender System: A Survey and Future Directions

    Full text link
    With the rapid growth of information, recommender systems have become integral for providing personalized suggestions and overcoming information overload. However, their practical deployment often encounters "dirty" data, where noise or malicious information can lead to abnormal recommendations. Research on improving recommender systems' robustness against such dirty data has thus gained significant attention. This survey provides a comprehensive review of recent work on recommender systems' robustness. We first present a taxonomy to organize current techniques for withstanding malicious attacks and natural noise. We then explore state-of-the-art methods in each category, including fraudster detection, adversarial training, certifiable robust training against malicious attacks, and regularization, purification, self-supervised learning against natural noise. Additionally, we summarize evaluation metrics and common datasets used to assess robustness. We discuss robustness across varying recommendation scenarios and its interplay with other properties like accuracy, interpretability, privacy, and fairness. Finally, we delve into open issues and future research directions in this emerging field. Our goal is to equip readers with a holistic understanding of robust recommender systems and spotlight pathways for future research and development

    A Holistic View of Identity Theft Tax Refund Fraud

    Get PDF
    This thesis attempts to explain what identity theft tax refund fraud is and how the issue has developed over the years. It presents a holistic, historic view of the problem as well as how it has been addressed. It primarily relies on reports from the Internal Revenue Service (IRS), Treasury Inspector General for Tax Administration (TIGTA), Government Accountability Office (GAO) and National Taxpayer Advocate (NTA) in its assessment. It does not examine foreign tax administrations’ methods of dealing with identity theft refund fraud or the extent of the issue in other principalities, and therefore this is an area in need of further research. This thesis does not attempt to make an argument for the efficacy of funding for the IRS either, which is an area that could be further studied. It also does not deal with employment-related identity fraud, which some relate to identity theft refund fraud

    Multi-GPU maximum entropy image synthesis for radio astronomy

    Full text link
    The maximum entropy method (MEM) is a well known deconvolution technique in radio-interferometry. This method solves a non-linear optimization problem with an entropy regularization term. Other heuristics such as CLEAN are faster but highly user dependent. Nevertheless, MEM has the following advantages: it is unsupervised, it has a statistical basis, it has a better resolution and better image quality under certain conditions. This work presents a high performance GPU version of non-gridding MEM, which is tested using real and simulated data. We propose a single-GPU and a multi-GPU implementation for single and multi-spectral data, respectively. We also make use of the Peer-to-Peer and Unified Virtual Addressing features of newer GPUs which allows to exploit transparently and efficiently multiple GPUs. Several ALMA data sets are used to demonstrate the effectiveness in imaging and to evaluate GPU performance. The results show that a speedup from 1000 to 5000 times faster than a sequential version can be achieved, depending on data and image size. This allows to reconstruct the HD142527 CO(6-5) short baseline data set in 2.1 minutes, instead of 2.5 days that takes a sequential version on CPU.Comment: 11 pages, 13 figure

    A new golden frog species of the genus Diasporus (Amphibia, Eleutherodactylidae) from the Cordillera Central, western Panama

    Get PDF
    We describe the frog species Diasporus citrinobapheussp. n. from the Cordillera Central of western Panama. The new species differs from all other species in its genus in coloration, disk cover and disk pad shape, skin texture, advertisement call, and size. It is most similar to Diasporus tigrillo, from which it differs in dorsal skin texture, relative tibia length, number of vomerine teeth, ventral coloration, dorsal markings, and relative tympanum size, and to Diasporus gularis, from which it can be distinguished by the lack of membranes between the toes, adult size, posterior thigh coloration, and position of the choanae. We provide data on morpho- logy, vocalization, and distribution of the new species, as well as brief information on its natural history.Describimos la especie de rana Diasporus citrinobapheus sp. n. de la Cordillera Central, occidente de Panamá. La nueva especie se distingue de otras especies del género por su coloración, su forma de la cubierta y la almohadilla de los discos, textura de la piel, canto de anúncio, y tamaño corporal. Se asemeja mas a D. tigrillo, del cual se distingue por la textura de la piel dorsal, longitud relativa de la tibia, número de dientes vomerianos, coloración ventral, patrón dorsal, y tamaño relativo del tímpano, y a D. gularis, del cual se diferencia por la ausencia de membranas entre los dedos de pie, tamaño corporal, coloración de la parte trasera del muslo, y posición de las coanas. Presentamos datos de la morfología, vocalización, y distribución de la nueva especie, así como notas concisas de su historia natural

    THE KEYBOARD WARRIORS: EXPRESSING HATRED AND JUDGEMENT ON “ANOTHER” WOMAN THROUGH HATERS’ INSTAGRAM ACCOUNT

    Get PDF
    Nowadays, many celebrities use Instagram to connect with their fans. Unfortunately, for some celebrities, their popularity may not necessarily mean that they are liked by the public. The keyboard warriors, i.e. haters can freely hit the keyboard and leave hate comments as cyber communication does not require face-to-face interactions. Some of them even go so far by creating haters’ accounts of certain public figures, as can be found on @mulanjameelaqueen, created by the haters of Mulan Jameela, an Indonesian singer known for her affairs and unregistered marriage with her friend’s husband. This paper explores how being “another” woman is perceived in Indonesia. Mateo and Yus’ (2013) pragmatic taxonomy of insults was used as the framework of analysis. The data were taken from the captions and the comments of 10 of the most commented posts of @mulanjameelaqueen. They were processed by using AntConc to obtain the most frequently used words and their collocations, and the word clusters. The results show that the most commonly used lexicons to refer to Mulan are: cireng ‘traditional snack’, lonte ‘whore’, Jamilonte or Mulonte (coined from Mulan Jameela and lonte ‘prostitute’), and iblis ‘devil’. The malicious comments are mostly related to Mulan’s physical appearance, death threat to Mulan, divorce, and nikah siri ‘unregistered marriage’. The comments may also reflect most of the haters’ (mostly females) negative perception and judgement on unregistered marriage, divorced female, and “another” woman

    ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning

    Full text link
    We present ATOMIC, an atlas of everyday commonsense reasoning, organized through 877k textual descriptions of inferential knowledge. Compared to existing resources that center around taxonomic knowledge, ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables (e.g., "if X pays Y a compliment, then Y will likely return the compliment"). We propose nine if-then relation types to distinguish causes vs. effects, agents vs. themes, voluntary vs. involuntary events, and actions vs. mental states. By generatively training on the rich inferential knowledge described in ATOMIC, we show that neural models can acquire simple commonsense capabilities and reason about previously unseen events. Experimental results demonstrate that multitask models that incorporate the hierarchical structure of if-then relation types lead to more accurate inference compared to models trained in isolation, as measured by both automatic and human evaluation.Comment: AAAI 2019 C
    corecore