72 research outputs found

    S3Mining: A model-driven engineering approach for supporting novice data miners in selecting suitable classifiers

    Get PDF
    Data mining has proven to be very useful in order to extract information from data in many different contexts. However, due to the complexity of data mining techniques, it is required the know-how of an expert in this field to select and use them. Actually, adequately applying data mining is out of the reach of novice users which have expertise in their area of work, but lack skills to employ these techniques. In this paper, we use both model-driven engineering and scientific workflow standards and tools in order to develop named S3Mining framework, which supports novice users in the process of selecting the data mining classification algorithm that better fits with their data and goal. To this aim, this selection process uses the past experiences of expert data miners with the application of classification techniques over their own datasets. The contributions of our S3Mining framework are as follows: (i) an approach to create a knowledge base which stores the past experiences of experts users, (ii) a process that provides the expert users with utilities for the construction of classifiers? recommenders based on the existing knowledge base, (iii) a system that allows novice data miners to use these recommenders for discovering the classifiers that better fit for solving their problem at hand, and (iv) a public implementation of the framework?s workflows. Finally, an experimental evaluation has been conducted to shown the feasibility of our framework

    A Deep-dive into Cryptojacking Malware: From an Empirical Analysis to a Detection Method for Computationally Weak Devices

    Get PDF
    Cryptojacking is an act of using a victim\u27s computation power without his/her consent. Unauthorized mining costs extra electricity consumption and decreases the victim host\u27s computational efficiency dramatically. In this thesis, we perform an extensive research on cryptojacking malware from every aspects. First, we present a systematic overview of cryptojacking malware based on the information obtained from the combination of academic research papers, two large cryptojacking datasets of samples, and numerous major attack instances. Second, we created a dataset of 6269 websites containing cryptomining scripts in their source codes to characterize the in-browser cryptomining ecosystem by differentiating permissioned and permissionless cryptomining samples. Third, we introduce an accurate and efficient IoT cryptojacking detection mechanism based on network traffic features that achieves an accuracy of 99%. Finally, we believe this thesis will greatly expand the scope of research and facilitate other novel solutions in the cryptojacking domain

    Large Scale Data Mining for IT Service Management

    Get PDF
    More than ever, businesses heavily rely on IT service delivery to meet their current and frequently changing business requirements. Optimizing the quality of service delivery improves customer satisfaction and continues to be a critical driver for business growth. The routine maintenance procedure plays a key function in IT service management, which typically involves problem detection, determination and resolution for the service infrastructure. Many IT Service Providers adopt partial automation for incident diagnosis and resolution where the operation of the system administrators and automation operation are intertwined. Often the system administrators\u27 roles are limited to helping triage tickets to the processing teams for problem resolving. The processing teams are responsible to perform a complex root cause analysis, providing the system statistics, event and ticket data. A large scale of system statistics, event and ticket data aggravate the burden of problem diagnosis on both the system administrators and the processing teams during routine maintenance procedures. Alleviating human efforts involved in IT service management dictates intelligent and efficient solutions to maximize the automation of routine maintenance procedures. Three research directions are identified and considered to be helpful for IT service management optimization: (1) Automatically determine problem categories according to the symptom description in a ticket; (2) Intelligently discover interesting temporal patterns from system events; (3) Instantly identify temporal dependencies among system performance statistics data. Provided with ticket, event, and system performance statistics data, the three directions can be effectively addressed with a data-driven solution. The quality of IT service delivery can be improved in an efficient and effective way. The dissertation addresses the research topics outlined above. Concretely, we design and develop data-driven solutions to help system administrators better manage the system and alleviate the human efforts involved in IT Service management, including (1) a knowledge guided hierarchical multi-label classification method for IT problem category determination based on both the symptom description in a ticket and the domain knowledge from the system administrators; (2) an efficient expectation maximization approach for temporal event pattern discovery based on a parametric model; (3) an online inference on time-varying temporal dependency discovery from large-scale time series data

    Application-Driven Big Data Mining

    Get PDF
    认为大数据挖掘的核心和本质是应用、数据、算法和平台4个要素的紧密结合。从大数据的特点出发,结合大数据挖掘的案例,提出大数据挖掘中的平台架构、数据获取和预处理、算法的选择和集成都是应用驱动的。强调大数据挖掘的目标来自实际应用的真实需求,只有结合具体应用数据和适合应用的算法,利用高效处理平台的支撑,并将挖掘到的模式或知识应用在实践中,才能体现大数据挖掘的真正价值。The core of big data analysis is the combination of applications, data,algorithms and platforms. Big data mining platforms, algorithms, and big data itself are driven by applications. Big data mining tasks come from real applications. With specific application data and appropriate algorithms, using efficient processing platform, digging into the patterns or knowledge in practice, big data mining platform can show its true value

    Intelligent Data Mining Techniques for Automatic Service Management

    Get PDF
    Today, as more and more industries are involved in the artificial intelligence era, all business enterprises constantly explore innovative ways to expand their outreach and fulfill the high requirements from customers, with the purpose of gaining a competitive advantage in the marketplace. However, the success of a business highly relies on its IT service. Value-creating activities of a business cannot be accomplished without solid and continuous delivery of IT services especially in the increasingly intricate and specialized world. Driven by both the growing complexity of IT environments and rapidly changing business needs, service providers are urgently seeking intelligent data mining and machine learning techniques to build a cognitive ``brain in IT service management, capable of automatically understanding, reasoning and learning from operational data collected from human engineers and virtual engineers during the IT service maintenance. The ultimate goal of IT service management optimization is to maximize the automation of IT routine procedures such as problem detection, determination, and resolution. However, to fully automate the entire IT routine procedure is still a challenging task without any human intervention. In the real IT system, both the step-wise resolution descriptions and scripted resolutions are often logged with their corresponding problematic incidents, which typically contain abundant valuable human domain knowledge. Hence, modeling, gathering and utilizing the domain knowledge from IT system maintenance logs act as an extremely crucial role in IT service management optimization. To optimize the IT service management from the perspective of intelligent data mining techniques, three research directions are identified and considered to be greatly helpful for automatic service management: (1) efficiently extract and organize the domain knowledge from IT system maintenance logs; (2) online collect and update the existing domain knowledge by interactively recommending the possible resolutions; (3) automatically discover the latent relation among scripted resolutions and intelligently suggest proper scripted resolutions for IT problems. My dissertation addresses these challenges mentioned above by designing and implementing a set of intelligent data-driven solutions including (1) constructing the domain knowledge base for problem resolution inference; (2) online recommending resolution in light of the explicit hierarchical resolution categories provided by domain experts; and (3) interactively recommending resolution with the latent resolution relations learned through a collaborative filtering model

    Geospatial Data Indexing Analysis and Visualization via Web Services with Autonomic Resource Management

    Get PDF
    With the exponential growth of the usage of web-based map services, the web GIS application has become more and more popular. Spatial data index, search, analysis, visualization and the resource management of such services are becoming increasingly important to deliver user-desired Quality of Service. First, spatial indexing is typically time-consuming and is not available to end-users. To address this, we introduce TerraFly sksOpen, an open-sourced an Online Indexing and Querying System for Big Geospatial Data. Integrated with the TerraFly Geospatial database [1-9], sksOpen is an efficient indexing and query engine for processing Top-k Spatial Boolean Queries. Further, we provide ergonomic visualization of query results on interactive maps to facilitate the user’s data analysis. Second, due to the highly complex and dynamic nature of GIS systems, it is quite challenging for the end users to quickly understand and analyze the spatial data, and to efficiently share their own data and analysis results with others. Built on the TerraFly Geo spatial database, TerraFly GeoCloud is an extra layer running upon the TerraFly map and can efficiently support many different visualization functions and spatial data analysis models. Furthermore, users can create unique URLs to visualize and share the analysis results. TerraFly GeoCloud also enables the MapQL technology to customize map visualization using SQL-like statements [10]. Third, map systems often serve dynamic web workloads and involve multiple CPU and I/O intensive tiers, which make it challenging to meet the response time targets of map requests while using the resources efficiently. Virtualization facilitates the deployment of web map services and improves their resource utilization through encapsulation and consolidation. Autonomic resource management allows resources to be automatically provisioned to a map service and its internal tiers on demand. v-TerraFly are techniques to predict the demand of map workloads online and optimize resource allocations, considering both response time and data freshness as the QoS target. The proposed v-TerraFly system is prototyped on TerraFly, a production web map service, and evaluated using real TerraFly workloads. The results show that v-TerraFly can accurately predict the workload demands: 18.91% more accurate; and efficiently allocate resources to meet the QoS target: improves the QoS by 26.19% and saves resource usages by 20.83% compared to traditional peak load-based resource allocation

    Relevance is in the Eye of the Beholder: Design Principles for the Extraction of Context-Aware Information

    Get PDF
    Since the1970s many approaches of representing domains have been suggested. Each approach maintains the assumption that the information about the objects represented in the Information System (IS) is specified and verified by domain experts and potential users. Yet, as more IS are developed to support a larger diversity of users such as customers, suppliers, and members of the general public (such as many multi-user online systems), analysts can no longer rely on a stable single group of people for complete specification of domains –to the extent that prior research has questioned the efficacy of conceptual modeling in these heterogeneous settings. We formulated principles for identifying basic classes in a domain. These classes can guide conceptual modeling, database design, and user interface development in a wide variety of traditional and emergent domains. Moreover, we used a case study of a large foster organization to study how unstructured data entry practices result in differences in how information is collected across organizational units. We used institutional theory to show how institutional elements enacted by individuals can generate new practices that can be adopted over time as best practices. We analyzed free-text notes to prioritize potential cases of psychotropic drug use—our tactical need. We showed that too much flexibility in how data can be entered into the system, results in different styles, which tend to be homogenous across organizational units but not across organizational units. Theories in Psychology help explain the implications of the level of specificity and the inferential utility of the text encoded in the unstructured note

    Temporal Mining for Distributed Systems

    Get PDF
    Many systems and applications are continuously producing events. These events are used to record the status of the system and trace the behaviors of the systems. By examining these events, system administrators can check the potential problems of these systems. If the temporal dynamics of the systems are further investigated, the underlying patterns can be discovered. The uncovered knowledge can be leveraged to predict the future system behaviors or to mitigate the potential risks of the systems. Moreover, the system administrators can utilize the temporal patterns to set up event management rules to make the system more intelligent. With the popularity of data mining techniques in recent years, these events grad- ually become more and more useful. Despite the recent advances of the data mining techniques, the application to system event mining is still in a rudimentary stage. Most of works are still focusing on episodes mining or frequent pattern discovering. These methods are unable to provide a brief yet comprehensible summary to reveal the valuable information from the high level perspective. Moreover, these methods provide little actionable knowledge to help the system administrators to better man- age the systems. To better make use of the recorded events, more practical techniques are required. From the perspective of data mining, three correlated directions are considered to be helpful for system management: (1) Provide concise yet comprehensive summaries about the running status of the systems; (2) Make the systems more intelligence and autonomous; (3) Effectively detect the abnormal behaviors of the systems. Due to the richness of the event logs, all these directions can be solved in the data-driven manner. And in this way, the robustness of the systems can be enhanced and the goal of autonomous management can be approached. This dissertation mainly focuses on the foregoing directions that leverage tem- poral mining techniques to facilitate system management. More specifically, three concrete topics will be discussed, including event, resource demand prediction, and streaming anomaly detection. Besides the theoretic contributions, the experimental evaluation will also be presented to demonstrate the effectiveness and efficacy of the corresponding solutions

    Data Mining Techniques to Understand Textual Data

    Get PDF
    More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier. Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data. My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions

    Advancing the use of geographic information systems, numerical and physical models for the planning of managed aquifer recharge schemes

    Get PDF
    Global change is a major threat to local groundwater resources. Climate change and population growth are factors that directly or indirectly augment the increasing uptake of groundwater resources. To outbalance the pressure on aquifers, managed aquifer recharge (MAR) schemes are increasingly being implemented. They enable the subsurface storage of surplus water for times of high demand. The complexity of MAR schemes makes their planning and implementation multifaceted and requires a comprehensive assessment of the local hydrogeological and hydrogeochemical conditions. Despite the fact that MAR is a widely used technique, its implementation is not well regulated and comprehensive planning and design guidelines are rare. The use of supporting tools, such as numerical and physical models or geographic information systems (GIS), is rising for MAR planning but their scope and requirements for application are rarely reflected in the available MAR guidelines. To depict the application potential and the advantages and disadvantages of the tools for surface infiltration MAR planning, this thesis comprises reviews on the past use of the tools as well as suggestions to improve their applicability for MAR planning. GIS is not mentioned by most MAR guidelines as a planning tool even though it is increasingly being used for MAR mapping. Through a review of GIS-based MAR suitability studies, this thesis shows that the MAR mapping process could be standardized by using the often-applied approach of constraint mapping, suitability mapping by using pairwise comparison for weight assignment and weighted linear combination as a decision rule, and a subsequent sensitivity analysis. Standardizing the methodology would increase the reliability and comparability of MAR maps due to the common methodological approach. Thus, the proposed standard methodology was incorporated into a web GIS that simplifies MAR mapping through a pre-defined workflow. Numerical models are widely used for the assessment of MAR schemes and are included into some MAR planning guidelines. However, only a few studies were found that utilized vadose zone models for the planning and design of MAR schemes. In this thesis, a review and a subsequent case study highlight that numerical modelling has many assets, such as monitoring network design or infiltration scenario planning, that make its utilization during the MAR planning phase worthwhile. Consequently, this study advocates the use of vadose zone models for MAR planning by showing their potential areas of application as well as their uncertainties that need to be regarded carefully during modelling. Physical models used for MAR planning are typically field or pilot sites, as some MAR legislation requests pilot sites as part of the preliminary assessment. Laboratory experiments are used less often and are mostly restricted to the analysis of very specific issues, such as clogging. This thesis takes on the issue of scaling laboratory results to the field scale by comparing results from three physical models of different scales and dimensionality. The results indicate that preferential flow paths, air entrapment and boundary influence limit the quantitative validity of laboratory experiments. The use of 3D tanks instead of 1D soil columns and the application of statistical indicators are means to increase the representativeness of laboratory measurements. Nevertheless, physical models have the potential to improve MAR planning in terms of detailed process assessment, scenario and sensitivity analyses. All tools discussed in this thesis have their merits for MAR scheme planning and should be advocated better in MAR guidelines by depicting their application potential, advantages and disadvantages. The information accumulated in this thesis is a step towards an advanced use of supporting tools for the planning and design of MAR schemes.:1 Introduction 1.1 Motivation 1.2 Objectives 1.3 Structure of the thesis 2 Status quo of the planning process of MAR schemes 2.1 Guidance documents on general MAR planning 2.2 Application of GIS, numerical and physical models for MAR planning 2.3 Planning of surface infiltration schemes 3 Using GIS for the planning of MAR schemes 3.1 Implications from GIS-MCDA studies for MAR mapping 3.2 Development of web tools for MAR suitability mapping 4 Using numerical models for the planning of MAR schemes 4.1 Review on the use of numerical models for the design and optimization of MAR schemes 4.2 Planning a small-scale MAR scheme through vadose zone modelling 5 Using physical models for the planning of MAR schemes 5.1 Design of the experimental study 5.2 Comparison of three different physical models for MAR planning 6 Discussion and research perspectives 7 Bibliography 8 AppendixDer globale Wandel stellt eine große Bedrohung für die lokalen Grundwasserressourcen dar. Klimawandel und Bevölkerungswachstum sind Faktoren, die, direkt oder indirekt, die zunehmende Nutzung von Grundwasserressourcen verstärken. Um diesen Druck auf die Grundwasserleiter auszugleichen, werden verstärkt Maßnahmen zur gezielten Grundwasserneubildung (managed aquifer recharge = MAR) durchgeführt. Dies ermöglicht die unterirdische Speicherung von überschüssigem Wasser für Zeiten hohen Bedarfs. Die Komplexität von MAR-Anlagen macht ihre Planung und Umsetzung kompliziert und erfordert eine umfassende Bewertung der lokalen hydrogeologischen und hydrogeochemischen Bedingungen. Trotz der weltweiten Implementierung von MAR ist dessen Planung wenig reguliert. Umfassende Planungs- und Gestaltungsrichtlinien sind rar. Der Einsatz unterstützender Werkzeuge, wie numerischer und physikalischer Modelle oder Geoinformationssysteme (GIS), nimmt bei der MAR-Planung zu, aber ihre Einsatzmöglichkeiten und ihre Anforderungen an die Anwendung spiegeln sich selten in den verfügbaren MAR-Richtlinien wider. Um das Anwendungspotential und die Vor- und Nachteile der Werkzeuge für die MAR-Planung darzustellen, wurden für diese Arbeit Recherchen über den bisherigen Einsatz der Werkzeuge durchgeführt. Zusätzlich wurden Vorschläge zur Erhöhung ihrer Anwendbarkeit für die MAR Planung gemacht. Der Schwerpunkt lag dabei auf Oberflächeninfiltrationsverfahren. GIS wird in keiner MAR-Richtlinie als Planungsinstrument erwähnt, obwohl es zunehmend für die MAR-Kartierung eingesetzt wird. Eine Recherche über GIS-basierte MAR-Eignungsstudien zeigte, dass der MAR-Kartierungsprozess standardisiert werden kann mittels des oft genutzten Ansatzes: initiales Ausschneiden von Gebieten, welche Restriktionen unterliegen, dem folgend die Eignungskartierung mittels Paarvergleich für die Wichtung der GIS-Karten und der gewichteten Linearkombination als Entscheidungsregel, sowie eine abschließende Sensitivitätsanalyse. Die Standardisierung der Methodik könnte die Zuverlässigkeit und Vergleichbarkeit von MAR-Karten aufgrund des gemeinsamen methodischen Ansatzes erhöhen. Daher wurde die standardisierte Methodik in ein Web-GIS integriert, das über einen definierten Workflow die MAR-Kartierung vereinfacht. Numerische Modelle werden häufig für die Beurteilung von MAR-Systemen verwendet und sind in einigen MAR-Planungsrichtlinien ausgewiesen. Es wurden jedoch nur wenige Studien gefunden, die die Modelle der ungesättigten Zone für die Planung und Gestaltung von MAR Standorten verwendeten. Die in dieser Arbeit durchgeführte Literaturrecherche und eine darauf aufbauende Fallstudie zeigen, dass die numerische Modellierung viele Vorteile bietet, wie z. B. beim Design eines Monitoring-Netzwerkes oder bei der Planung von Infiltrationsszenarien. Physikalische Modelle, die für die MAR-Planung verwendet werden, sind meist Feld- oder Pilotversuche, da einige MAR-Gesetzgebungen Pilotstandorte im Rahmen der Vorabbewertung verlangen. Laborexperimente werden seltener eingesetzt und beschränken sich meist auf die Analyse sehr spezifischer Fragestellungen, wie z.B. der Kolmatierung. Diese Arbeit beschäftigt sich mit der Skalierbarkeit von Laborergebnissen auf die Feldskale, indem sie Ergebnisse aus drei physikalischen Modellen verschiedener Maßstäbe und Dimensionen vergleicht. Die Ergebnisse deuten darauf hin, dass Makroporen, Lufteinschlüsse und der Einfluss der Randbedingungen die quantitative Aussagekraft von Laborversuchen einschränken. Der Einsatz von 3D-Tanks anstelle von 1D-Bodensäulen oder von statistischen Indikatoren ist ein Mittel zur Erhöhung der Repräsentativität von Labormessungen. Nichtsdestotrotz hat die Anwendung physikalischerModelle das Potenzial, die MAR-Planung in Bezug auf detaillierte Prozessbewertung, Szenarien und Sensitivitätsanalysen zu unterstützen. Alle beschriebenen Instrumente haben ihre Vorzüge bei der Bewertung von MAR-Anlagen und sollten in MAR-Richtlinien detaillierter berücksichtigt werden, indem ihr Anwendungspotenzial, ihre Vor- und ihre Nachteile dargestellt werden. Die für diese Arbeit zusammengestellten Informationen sind ein Schritt zur Förderung der beschriebenen Planungsinstrumente für die Planung und Gestaltung von MAR-Anlagen.:1 Introduction 1.1 Motivation 1.2 Objectives 1.3 Structure of the thesis 2 Status quo of the planning process of MAR schemes 2.1 Guidance documents on general MAR planning 2.2 Application of GIS, numerical and physical models for MAR planning 2.3 Planning of surface infiltration schemes 3 Using GIS for the planning of MAR schemes 3.1 Implications from GIS-MCDA studies for MAR mapping 3.2 Development of web tools for MAR suitability mapping 4 Using numerical models for the planning of MAR schemes 4.1 Review on the use of numerical models for the design and optimization of MAR schemes 4.2 Planning a small-scale MAR scheme through vadose zone modelling 5 Using physical models for the planning of MAR schemes 5.1 Design of the experimental study 5.2 Comparison of three different physical models for MAR planning 6 Discussion and research perspectives 7 Bibliography 8 Appendi
    corecore