175 research outputs found

    Attribute-Based Access Control Policy Generation Approach from Access Logs Based on CatBoost

    Get PDF
    Attribute-based access control (ABAC) has higher flexibility and better scalability than traditional access control and can be used for fine-grained access control of large-scale information systems. Although ABAC can depict a dynamic, complex access control policy, it is costly, tedious, and error-prone to manually define. Therefore, it is worth studying how to construct an ABAC policy efficiently and accurately. This paper proposes an ABAC policy generation approach based on the CatBoost algorithm to automatically learn policies from historical access logs. First, we perform a weighted reconstruction of the attributes for the policy to be mined. Second, we provide an ABAC rule extraction algorithm, rule pruning algorithm, and rule optimization algorithm, among which the rule pruning and rule optimization algorithms are used to improve the accuracy of the generated policies. In addition, we present a new policy quality indicator to measure the accuracy and simplicity of the generated policies. Finally, the results of an experiment conducted to validate the approach verify its feasibility and effectiveness

    Classification algorithms for Big Data with applications in the urban security domain

    Get PDF
    A classification algorithm is a versatile tool, that can serve as a predictor for the future or as an analytical tool to understand the past. Several obstacles prevent classification from scaling to a large Volume, Velocity, Variety or Value. The aim of this thesis is to scale distributed classification algorithms beyond current limits, assess the state-of-practice of Big Data machine learning frameworks and validate the effectiveness of a data science process in improving urban safety. We found in massive datasets with a number of large-domain categorical features a difficult challenge for existing classification algorithms. We propose associative classification as a possible answer, and develop several novel techniques to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. The experiments, run on a real large-scale dataset with more than 4 billion records, confirmed the quality of the approach. To assess the state-of-practice of Big Data machine learning frameworks and streamline the process of integration and fine-tuning of the building blocks, we developed a generic, self-tuning tool to extract knowledge from network traffic measurements. The result is a system that offers human-readable models of the data with minimal user intervention, validated by experiments on large collections of real-world passive network measurements. A good portion of this dissertation is dedicated to the study of a data science process to improve urban safety. First, we shed some light on the feasibility of a system to monitor social messages from a city for emergency relief. We then propose a methodology to mine temporal patterns in social issues, like crimes. Finally, we propose a system to integrate the findings of Data Science on the citizenry’s perception of safety and communicate its results to decision makers in a timely manner. We applied and tested the system in a real Smart City scenario, set in Turin, Italy

    Incorporating a Machine Learning Model into a Web-Based Administrative Decision Support Tool for Predicting Workplace Absenteeism

    Get PDF
    Productivity losses caused by absenteeism at work cost U.S. employers billions of dollars each year. In addition, employers typically spend a considerable amount of time managing employees who perform poorly. By using predictive analytics and machine learning algorithms, organizations can make better decisions, thereby increasing organizational productivity, reducing costs, and im-proving efficiency. Thus, in this paper we propose hybrid optimization methods in order to find the most parsimonious model for absenteeism classification. We utilized data from a Brazilian courier company. In order to categorize absenteeism classes, we preprocessed the data, selected the attributes via multiple methods, balanced the dataset using the synthetic minority over-sampling method, and then employed four methods of machine learning classification: Support Vector Machine (SVM), Multinomial Logistic Regression (MLR), Artificial Neural Network (ANN), and Random Forest (RF). We selected the best model based on several validation scores, and compared its performance against the existing model. Furthermore, project managers may lack experience in machine learning, or may not have the time to spend developing machine learning algorithms. Thus, we propose a web-based interactive tool supported by cognitive analytics management (CAM) theory. The web-based decision tool enables managers to make more informed decisions, and can be used without any prior knowledge of machine learning. Understanding absenteeism patterns can assist managers in revising policies or creating new arrangements to reduce absences in the workplace, financial losses, and the probability of economic insolvency

    Identifying hazardous patterns in MSHA data using random forests

    Get PDF
    Mining safety and health in the US can be better understood through the application of machine learning techniques to data collected by the Mine Safety and Health Administration (MSHA). By identifying hazardous conditions that could lead to accidents before they occur, valuable insights can be gained by MSHA, mining operators, and miners. In this study, we propose using a Random Forest machine learning model to predict whether a given mining violation will lead to an accident, and if so, whether it will be fatal or non-fatal. To achieve this, the model is trained on MSHA violation data and the sum of scheduled accident charges within 35 days of the violation. We experiment with different predictive models using varying data columns, training set sizes, prediction classes, and hyperparameters to achieve a reliable prediction. One of the challenges in generating these models is accurately predicting the sparse class of accidents, as opposed to the abundant class of no accidents. To address this, we propose utilizing sample minimizing to balance the false negative and false positive rate and create a more accurate predictive model. Our results demonstrate, with a high degree of confidence, the potential for machine learning to improve mine safety and health by identifying hazardous conditions and mitigating the risk of accidents

    AI in Education

    Get PDF
    Artificial intelligence (AI) is changing the world as we know it. Recent advances are enabling people, companies, and governments to envision and experiment with new methods of interacting with computers and modifying how virtual and physical processes are carried out. One of the fields in which this transformation is taking place is education. After years of witnessing the incorporation of technological innovations into learning/teaching processes, we can currently observe many new research works involving AI. Moreover, there has been increasing interest in this research area after the COVID-19 pandemic, driven toward fostering digital education. Among recent research in this field, AI applications have been applied to enhance educational experiences, studies have considered the interaction between AI and humans while learning, analyses of educational data have been conducted, including using machine learning techniques, and proposals have been presented for new paradigms mediated by intelligent agents. This book, entitled “AI in Education”, aims to highlight recent research in the field of AI and education. The included works discuss new advances in methods, applications, and procedures to enhance educational processes via artificial intelligence and its subfields (machine learning, neural networks, deep learning, cognitive computing, natural language processing, computer vision, etc.)

    Assessing dimensions of the city’s reputation.

    Get PDF
    In social psychology, reputation has been studied with reference to different objects (individuals, brands, cities, etc.) and methodologically, measured discerning between its subdimensions. In this article, city reputation is operationally defined, by using the validated City Reputation Indicators scale. This empirical tool is useful to evaluate the separate dimensions of city reputation independently. Data, obtained from a survey administered in the city of Naples, were analysed using the Classification-tree, a non-parametric procedure, widely used in supervised classification. We also used the Spearman rank correlation, in order to assess the degree of association between overall citizen satisfaction and overall city reputation. The classification tree has made possible the identification of the key path which better identifies people considering Naples a city with a good reputation. Furthermore, results also show the main constituents of city reputation

    THREE ESSAYS ON THE APPLICATION OF MACHINE LEARNING METHODS IN ECONOMICS

    Get PDF
    Over the last decades, economics as a field has experienced a profound transformation from theoretical work toward an emphasis on empirical research (Hamermesh, 2013). One common constraint of empirical studies is the access to data, the quality of the data and the time span it covers. In general, applied studies rely on surveys, administrative or private sector data. These data are limited and rarely have universal or near universal population coverage. The growth of the internet has made available a vast amount of digital information. These big digital data are generated through social networks, sensors, and online platforms. These data account for an increasing part of the economic activity yet for economists, the availability of these big data also raises many new challenges related to the techniques needed to collect, manage, and derive knowledge from them. The data are in general unstructured, complex, voluminous and the traditional software used for economic research are not always effective in dealing with these types of data. Machine learning is a branch of computer science that uses statistics to deal with big data. The objective of this dissertation is to reconcile machine learning and economics. It uses threes case studies to demonstrate how data freely available online can be harvested and used in economics. The dissertation uses web scraping to collect large volume of unstructured data online. It uses machine learning methods to derive information from the unstructured data and show how this information can be used to answer economic questions or address econometric issues. The first essay shows how machine learning can be used to derive sentiments from reviews and using the sentiments as a measure for quality it examines an old economic theory: Price competition in oligopolistic markets. The essay confirms the economic theory that agents compete for price. It also confirms that the quality measure derived from sentiment analysis of the reviews is a valid proxy for quality and influences price. The second essay uses a random forest algorithm to show that reviews can be harnessed to predict consumers’ preferences. The third essay shows how properties description can be used to address an old but still actual problem in hedonic pricing models: the Omitted Variable Bias. Using the Least Absolute Shrinkage and Selection Operator (LASSO) it shows that pricing errors in hedonic models can be reduced by including the description of the properties in the models

    Advances in Computational Intelligence Applications in the Mining Industry

    Get PDF
    This book captures advancements in the applications of computational intelligence (artificial intelligence, machine learning, etc.) to problems in the mineral and mining industries. The papers present the state of the art in four broad categories: mine operations, mine planning, mine safety, and advances in the sciences, primarily in image processing applications. Authors in the book include both researchers and industry practitioners

    Ruumiandmete harmoniseerimine ja masinõpe veekvaliteedi modelleerimiseks

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsioonePõllumajanduslik reostus põhjustab jätkuvalt magevee kvaliteedi üleilmset halvenemist. Tõhusate veemajandamise meetmete väljatöötamisel on oluline osa veekvaliteedi modelleerimisel. Veekvaliteedi laialdaseks modelleerimiseks on aga vajalik hea ruumilise katvusega lähteandmete olemasolu. Töö eesmärk oli parandada ja harmoniseerida veekvaliteedi modelleerimiseks vajalikke andmestikke ning arendada välja masinõppe raamistik, mida saaks kasutada riigiüleseks veekvaliteedi modelleerimiseks. Töö üheks väljundiks on Eesti mullastikuandmebaas EstSoil-EH. EstSoil-EH atribuudid olid sisendiks masinõppe mudelile, mida kasutasin mulla orgaanilise süsiniku sisalduse prognoosimiseks. Selgus, et proovivõtukohtade keskkonnatingimused mõjutasid mudeli prognoosi täpsust. Globaalse veekvaliteedi andmete parandamiseks loodi viie andmestiku põhjal andmebaas Global River Water Quality Archive (GRQA). Mullasüsiniku mudeli loomise käigus õpitu põhjal arendati välja raamistik üle-eestiliseks veekvaliteedi modelleerimiseks. Mudel prognoosis toitainete kontsentratsioone 242 Eesti jõe valglas. Saadud mudelite täpsus on võrreldav Baltimaades varem rakendatud mudelitega. Mudelite täpsust mõjutas valglate suurus, kuna prognoosid olid üldjuhul ebatäpsemad väiksemates valglates. Seejuures piisas rahuldava täpsuse saavutamiseks vähem kui pooltest tunnustest, mis näitab, et tunnuste arvust olulisem on nende kirjeldusvõime. Seega on loodud masinõppe mudelid rakendatavad piirkondades, kus tunnuste tuletamiseks vajalike lähteandmete katvus on piiratud.The state of freshwater quality continues to deteriorate worldwide due to agricultural pollution. In order to combat these issues effectively, water quality modeling could be used to better manage water resources. However, large-scale water quality models depend on input datasets with good spatial coverage. The aim of the thesis was to improve and harmonize datasets for water quality modeling purposes and create a machine learning framework for national-scale modeling. We created EstSoil-EH as a new numerical soil database for Estonia by converting the text-based soil properties in the Estonian Soil Map to machine-readable values. We used it to predict soil organic carbon content using the random forest machine learning method and found that the conditions of sampling locations affected prediction accuracy. We improved the global coverage of water quality data by producing the Global River Water Quality Archive (GRQA), which was compiled from five existing large-scale datasets. The compilation involved harmonizing the corresponding metadata, flagging outliers, calculating time series characteristics and detecting duplicate observations. We developed a framework suitable for national-scale water quality modeling based on lessons learnt from predicting soil carbon content. We used 82 environmental variables, including soil properties from EstSoil-EH as features to predict nutrient concentrations in 242 river catchments. The resulting models achieved accuracy comparable to the ones used previously in the Baltic region. We found that the size of the catchment influenced accuracy, since predictions were less accurate in smaller catchments. The models maintained reasonable accuracy even when the number of features was reduced by half, which shows that the relevance of features is more important than the amount. This flexibility makes our models applicable in areas that are otherwise lacking in the input data needed for extracting features.https://www.ester.ee/record=b552067
    corecore