48 research outputs found

    Descriptive analysis of online roulette gamblers: segmentation of different gamblers based on their behavior using data mining algorithms

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThe popularity of gambling activities has been increasing over the last decades, with onlinebased gambling being a key driver of its growth due to the ease of accessing online platforms. Consequently, there is a severe concern that the negative social impact of gambling arises, and regulatory agencies are identifying and managing those effects. In this context, a potential solution to address those effects is based on the concept of 'Responsible Gambling', which means playing consciously, with complete control of time and money. The present study aims to segment online gamblers based on their playing behaviors, differentiating groups as much as possible and ultimately identifying a cluster with players of concern. This is achieved using unsupervised learning algorithms such as K-Means, Hierarchical Clustering, or Self-Organizing Maps. The information on which this project is based reflects the activity on some of the Portuguese online gambling platforms over 2019. Available data covers multiple aspects such as the gambling institution, type of gambling, player identification, each player's total bets, and the following outcomes of it

    Determination of the requirement for transportation and technological machines by clusterization of oil and gas production departments

    Get PDF
    The article considers the analysis of production indicators of oil and gas production departments with the aim of clustering them for the subsequent determination of the need for automobiles and technological machines. The departments have different sizes, power, are in different conditions, are characterized by different performance indicators, but at the same time they are equipped with vehicles according to the same standards. This leads to problems in ensuring the uninterrupted transport and technological service of the main production. In a number of departments, situations arise when the planned number of transport and technological machines is not enough to perform technological operations for the repair or maintenance of wells. In this case, vehicles are sent from another sub-division, thereby limiting their own transport service capabilities. Fleet planning often takes place taking into account the historical conditions of the department, which is generally applicable for old departments with an established well stock, but practically does not work for newly formed departments with large volumes of newly commissioned wells and complicated production conditions. These subdivisions are equipped with vehicles in relation to existing workshops with similar indicators, which most often leads to an insufficient number of machines and downtime of the main production due to lack of machines. In this regard, it is necessary to search for and justify those production indicators of departments that determine their differentiation. The aim of the paper is to increase the efficiency of transport and technological service of oil and gas production facilities based on determining the patterns of influence of production indicators of production and gas shops on the need for transport and technological machines and developing, on this basis, differentiated standards for equipping units with vehicles. Using machine learning methods, the clustering of production units was carried out, and the factors that determine the distribution of departments into four groups were identified. The main factors include the stock of wells in the department and the degree of complexity of this stock. Groups are determined by the degree of change in these factors. The presented approach and the resulting distribution can be used as a basis for more efficient standardization of the needs of departments in automobiles and technological machines and also as part of decision support systems for vehicle fleet management

    An exploration of methodologies to improve semi-supervised hierarchical clustering with knowledge-based constraints

    Get PDF
    Clustering algorithms with constraints (also known as semi-supervised clustering algorithms) have been introduced to the field of machine learning as a significant variant to the conventional unsupervised clustering learning algorithms. They have been demonstrated to achieve better performance due to integrating prior knowledge during the clustering process, that enables uncovering relevant useful information from the data being clustered. However, the research conducted within the context of developing semi-supervised hierarchical clustering techniques are still an open and active investigation area. Majority of current semi-supervised clustering algorithms are developed as partitional clustering (PC) methods and only few research efforts have been made on developing semi-supervised hierarchical clustering methods. The aim of this research is to enhance hierarchical clustering (HC) algorithms based on prior knowledge, by adopting novel methodologies. [Continues.

    Automated machine learning plankton taxonomy pipeline

    Get PDF
    Plankton taxonomy is considered a multi-class classification problem. The current state-of-the-art developments in machine learning and phytoplankton taxonomy, such as MorphoCluster, include using a convolutional neural network as a feature extractor and Hierarchical Density-Based Clustering for the classification of plankton and identification of outliers. These convolutional feature extraction algorithms achieved accuracies of 0.78 during the classification process. However, these feature extraction models are trained on clean datasets. They perform very well when analysing previously encountered and well-defined classes but do not perform well when tested on raw datasets expected in field deployment. Raw plankton datasets are unbalanced; whereas some classes only have one or two samples, others can have thousands. They also exhibit many inter-class similarities with significant size differences. The data can also be in the form of low-resolution, noisy images. Phytoplankton species are also highly biodiverse, meaning that there is always a higher chance of a network encountering unknown sample types. Some samples, such as the various body parts of organisms, are easily confused with the species itself. Marine experts classifying plankton tend to group ambiguous samples according to the highest order to which they are confident they belong. This system leads to a dataset containing conflicting classes and forces the feature extraction network to overfit when training. This research aims to address these spatial issues and present a feature extraction methodology built upon existing research and novel concepts. The proposed algorithm uses feature extraction methods designed around real-world sample sets and offers an alternative approach to optimizing the features extracted and supplied to the clustering algorithm. The proposed feature extraction methods achieved scores of 0.821 when tested on the same datasets as the general feature extractor. The algorithm also consists of Auxiliary SoftMax classification branches which indicate the class prediction obtained by the feature extraction models. These branches allow for autonomous labelling of the clusters formed during the HDBSCAN algorithm being performed on the extracted features. This results in a fully automated semi-supervised plankton taxonomy pipeline which achieves a classification score of 0.775 on a real-life sample set.Thesis (MA) -- Faculty of Engineering, the Built Environment, and Technology, 202

    Automated machine learning plankton taxonomy pipeline

    Get PDF
    Plankton taxonomy is considered a multi-class classification problem. The current state-of-the-art developments in machine learning and phytoplankton taxonomy, such as MorphoCluster, include using a convolutional neural network as a feature extractor and Hierarchical Density-Based Clustering for the classification of plankton and identification of outliers. These convolutional feature extraction algorithms achieved accuracies of 0.78 during the classification process. However, these feature extraction models are trained on clean datasets. They perform very well when analysing previously encountered and well-defined classes but do not perform well when tested on raw datasets expected in field deployment. Raw plankton datasets are unbalanced; whereas some classes only have one or two samples, others can have thousands. They also exhibit many inter-class similarities with significant size differences. The data can also be in the form of low-resolution, noisy images. Phytoplankton species are also highly biodiverse, meaning that there is always a higher chance of a network encountering unknown sample types. Some samples, such as the various body parts of organisms, are easily confused with the species itself. Marine experts classifying plankton tend to group ambiguous samples according to the highest order to which they are confident they belong. This system leads to a dataset containing conflicting classes and forces the feature extraction network to overfit when training. This research aims to address these spatial issues and present a feature extraction methodology built upon existing research and novel concepts. The proposed algorithm uses feature extraction methods designed around real-world sample sets and offers an alternative approach to optimizing the features extracted and supplied to the clustering algorithm. The proposed feature extraction methods achieved scores of 0.821 when tested on the same datasets as the general feature extractor. The algorithm also consists of Auxiliary SoftMax classification branches which indicate the class prediction obtained by the feature extraction models. These branches allow for autonomous labelling of the clusters formed during the HDBSCAN algorithm being performed on the extracted features. This results in a fully automated semi-supervised plankton taxonomy pipeline which achieves a classification score of 0.775 on a real-life sample set.Thesis (MA) -- Faculty of Engineering, the Built Environment, and Technology, 202

    Efficient Point Clustering for Visualization

    Get PDF
    The visualization of large spatial point data sets constitutes a problem with respect to runtime and quality. A visualization of raw data often leads to occlusion and clutter and thus a loss of information. Furthermore, particularly mobile devices have problems in displaying millions of data items. Often, thinning via sampling is not the optimal choice because users want to see distributional patterns, cardinalities and outliers. In particular for visual analytics, an aggregation of this type of data is very valuable for providing an interactive user experience. This thesis defines the problem of visual point clustering that leads to proportional circle maps. It furthermore introduces a set of quality measures that assess different aspects of resulting circle representations. The Circle Merging Quadtree constitutes a novel and efficient method to produce visual point clusterings via aggregation. It is able to outperform comparable methods in terms of runtime and also by evaluating it with the aforementioned quality measures. Moreover, the introduction of a preprocessing step leads to further substantial performance improvements and a guaranteed stability of the Circle Merging Quadtree. This thesis furthermore addresses the incorporation of miscellaneous attributes into the aggregation. It discusses means to provide statistical values for numerical and textual attributes that are suitable for side-views such as plots and data tables. The incorporation of multiple data sets or data sets that contain class attributes poses another problem for aggregation and visualization. This thesis provides methods for extending the Circle Merging Quadtree to output pie chart maps or maps that contain circle packings. For the latter variant, this thesis provides results of a user study that investigates the methods and the introduced quality criteria. In the context of providing methods for interactive data visualization, this thesis finally presents the VAT System, where VAT stands for visualization, analysis and transformation. This system constitutes an exploratory geographical information system that implements principles of visual analytics for working with spatio-temporal data. This thesis details on the user interface concept for facilitating exploratory analysis and provides the results of two user studies that assess the approach

    Efficient Partitioning and Allocation of Data for Workflow Compositions

    Get PDF
    Our aim is to provide efficient partitioning and allocation of data for web service compositions. Web service compositions are represented as partial order database transactions. We accommodate a variety of transaction types, such as read-only and write-oriented transactions, to support workloads in cloud environments. We introduce an approach that partitions and allocates small units of data, called micropartitions, to multiple database nodes. Each database node stores only the data needed to support a specific workload. Transactions are routed directly to the appropriate data nodes. Our approach guarantees serializability and efficient execution. In Phase 1, we cluster transactions based on data requirements. We associate each cluster with an abstract query definition. An abstract query represents the minimal data requirement that would satisfy all the queries that belong to a given cluster. A micropartition is generated by executing the abstract query on the original database. We show that our abstract query definition is complete and minimal. Intuitively, completeness means that all queries of the corresponding cluster can be correctly answered using the micropartition generated from the abstract query. The minimality property means that no smaller partition of the data can satisfy all of the queries in the cluster. We also aim to support efficient web services execution. Our approach reduces the number of data accesses to distributed data. We also aim to limit the number of replica updates. Our empirical results show that the partitioning approach improves data access efficiency over standard partitioning of data. In Phase 2, we investigate the performance improvement via parallel execution.Based on the data allocation achieved in Phase I, we develop a scheduling approach. Our approach guarantees serializability while efficiently exploiting parallel execution of web services. We achieve conflict serializability by scheduling conflicting operations in a predefined order. This order is based on the calculation of a minimal delay requirement. We use this delay to schedule services to preserve serializability without the traditional locking mechanisms

    Neuromodulatory Supervised Learning

    Get PDF
    corecore