10,795 research outputs found

    Hashing-based delayed duplicate detection as an approach to improve the scalability of optimal Bayesian network structure learning with external memory frontier breadth-first branch and bound search

    Get PDF
    Bayesian networks are graphical models used to represent the joint probability distribution for all variables in a data set. A Bayesian network can be constructed by an expert, but learning the network from the data is another option. This thesis is centered around exact score-based Bayesian network structure learning. The idea is to optimize a scoring function measuring the fit of the network to the data, and the solution represents an optimal network structure. The thesis adds to the earlier literature by extending an external memory frontier breadth-first branch and bound algorithm by Malone et al. [MYHB11], which searches the space of candidate networks using dynamic programming in a layered fashion. To detect duplicates during the candidate solution generation, the algorithm uses efficiently both semiconductor and magnetic disk memory. In-memory duplicate detection is performed using a hash table, while a delayed duplicate detection strategy is employed when resorting to the disk. Delayed duplicate detection is designed to work well against long disk latencies, because hash tables are currently still infeasible on disk. Delayed duplicate detection allows the algorithm to scale beyond search spaces of candidate solutions fitting only in the comparatively expensive and limited semiconductor memory at disposal. The sorting-based delayed duplicate detection strategy employed by the original algorithm has been found to be inferior to a hashing-based delayed duplicate strategy in other application domains [Kor08]. This thesis presents an approach to use hashing-based delayed duplicate detection in Bayesian network structure learning and compares it to the sorting-based method. The problem faced in the hashing of candidate solutions to disk is dividing the candidate solutions in a certain stage of the search in an efficient and scalable manner to files on the disk. The division presented in this thesis distributes the candidate solutions into files unevenly, but takes into account the maximum number of candidate solutions that can be held in the semiconductor memory. The method works in theory as long as operating system has free space and inodes to allocate for new files. Although the hashing-based method should in principle be faster than the sorting-based, the benchmarks presented in this thesis for the two methods show mixed results. However, the analysis presented also highlights the fact that by improving the efficiency of, for example, the distribution of hashed candidate solutions to different files, the hashing-based method could be further improved

    Dynamic Bayesian Combination of Multiple Imperfect Classifiers

    Get PDF
    Classifier combination methods need to make best use of the outputs of multiple, imperfect classifiers to enable higher accuracy classifications. In many situations, such as when human decisions need to be combined, the base decisions can vary enormously in reliability. A Bayesian approach to such uncertain combination allows us to infer the differences in performance between individuals and to incorporate any available prior knowledge about their abilities when training data is sparse. In this paper we explore Bayesian classifier combination, using the computationally efficient framework of variational Bayesian inference. We apply the approach to real data from a large citizen science project, Galaxy Zoo Supernovae, and show that our method far outperforms other established approaches to imperfect decision combination. We go on to analyse the putative community structure of the decision makers, based on their inferred decision making strategies, and show that natural groupings are formed. Finally we present a dynamic Bayesian classifier combination approach and investigate the changes in base classifier performance over time.Comment: 35 pages, 12 figure

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    Measuring the similarity of PML documents with RFID-based sensors

    Get PDF
    The Electronic Product Code (EPC) Network is an important part of the Internet of Things. The Physical Mark-Up Language (PML) is to represent and de-scribe data related to objects in EPC Network. The PML documents of each component to exchange data in EPC Network system are XML documents based on PML Core schema. For managing theses huge amount of PML documents of tags captured by Radio frequency identification (RFID) readers, it is inevitable to develop the high-performance technol-ogy, such as filtering and integrating these tag data. So in this paper, we propose an approach for meas-uring the similarity of PML documents based on Bayesian Network of several sensors. With respect to the features of PML, while measuring the similarity, we firstly reduce the redundancy data except information of EPC. On the basis of this, the Bayesian Network model derived from the structure of the PML documents being compared is constructed.Comment: International Journal of Ad Hoc and Ubiquitous Computin

    Fame for sale: efficient detection of fake Twitter followers

    Get PDF
    Fake followers\textit{Fake followers} are those Twitter accounts specifically created to inflate the number of followers of a target account. Fake followers are dangerous for the social platform and beyond, since they may alter concepts like popularity and influence in the Twittersphere - hence impacting on economy, politics, and society. In this paper, we contribute along different dimensions. First, we review some of the most relevant existing features and rules (proposed by Academia and Media) for anomalous Twitter accounts detection. Second, we create a baseline dataset of verified human and fake follower accounts. Such baseline dataset is publicly available to the scientific community. Then, we exploit the baseline dataset to train a set of machine-learning classifiers built over the reviewed rules and features. Our results show that most of the rules proposed by Media provide unsatisfactory performance in revealing fake followers, while features proposed in the past by Academia for spam detection provide good results. Building on the most promising features, we revise the classifiers both in terms of reduction of overfitting and cost for gathering the data needed to compute the features. The final result is a novel Class A\textit{Class A} classifier, general enough to thwart overfitting, lightweight thanks to the usage of the less costly features, and still able to correctly classify more than 95% of the accounts of the original training set. We ultimately perform an information fusion-based sensitivity analysis, to assess the global sensitivity of each of the features employed by the classifier. The findings reported in this paper, other than being supported by a thorough experimental methodology and interesting on their own, also pave the way for further investigation on the novel issue of fake Twitter followers
    corecore