261,263 research outputs found

    Dynamic Data Mining: Methodology and Algorithms

    No full text
    Supervised data stream mining has become an important and challenging data mining task in modern organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions. To address these three challenges, this thesis proposes the novel dynamic data mining (DDM) methodology by effectively applying supervised ensemble models to data stream mining. DDM can be loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired by the idea that although the underlying concepts in a data stream are time-varying, their distinctions can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in order to classify incoming examples of similar concepts. First, following the general paradigm of DDM, we examine the different concept-drifting stream mining scenarios and propose corresponding effective and efficient data mining algorithms. • To address concept drift caused merely by changes of variable distributions, which we term pseudo concept drift, base models built on categorized streaming data are organized and selected in line with their corresponding variable distribution characteristics. • To address concept drift caused by changes of variable and class joint distributions, which we term true concept drift, an effective data categorization scheme is introduced. A group of working models is dynamically organized and selected for reacting to the drifting concept. Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce easily six effective algorithms for mining data streams with skewed class distributions. In addition, we also introduce a new ensemble model approach for batch learning, following the same methodology. Both theoretical and empirical studies demonstrate its effectiveness. Future work would be targeted at improving the effectiveness and efficiency of the proposed algorithms. Meantime, we would explore the possibilities of using the integration framework to solve other open stream mining research problems

    Economic baselines for current underground coal mining technology

    Get PDF
    The cost of mining coal using a room pillar mining method with continuous miner and a longwall mining system was calculated. Costs were calculated for the years 1975 and 2000 time periods and are to be used as economic standards against which advanced mining concepts and systems will be compared. Some assumptions were changed and some internal model stored data was altered from the original calculations procedure chosen, to obtain a result that more closely represented what was considered to be a standard mine. Coal seam thicknesses were varied from one and one-half feet to eight feet to obtain the cost of mining coal over a wide range. Geologic conditions were selected that had a minimum impact on the mining productivity

    The effect of student self -described learning styles within two models of teaching in an introductory data mining course

    Get PDF
    This dissertation examines the roles of learning styles and teaching methodologies within a data mining educational program designed for non-Computer Science undergraduate college students. The experimental design is framed by a discussion of the history and development of data mining and education, as well as a vision for its future.;Data mining is a relatively new discipline which has grown out of the fields of database management and data warehousing, statistics, logic, and decision sciences. Over the course of its approximately 15 year history, data mining has emerged from its genesis within the academic and commercial research and development arenas to become a widely accepted and utilized method of exploratory data analysis for management, strategic planning and decision support. Over the first several years of its development, data mining remained the province of computer scientists and professional statisticians at large corporations and research universities around the world. Beginning in about 1989, these data mining pioneers developed many of data mining\u27s standards and methodologies on large datasets using mainframe computing systems. Throughout the 1990s, as both the hardware and software tools required for the realization of data mining have become increasingly accessible, powerful and affordable, the pool of potential data miners has expanded rapidly. Today, even individuals and small businesses can exploit the power of data mining using freely acquirable open source software packages capable of running on personal computers.;During the growth and development of data mining methodologies however, little research has been dedicated specifically to the pedagogical approaches used in teaching data mining. Educational programs that have evolved have largely remained within Computer Science departments and have often targeted graduate students as an audience. This dissertation seeks to examine the possibility of successful teaching data mining concepts and techniques to a non-Computer Science undergraduate audience. The study approached this research question by delivering a lesson on the data mining topic of Association Rules to 86 participants who are representative of the target audience. These participants were randomly assigned to receive the Association Rules lesson through either a Direct Instruction or a Concept Attainment teaching approach. The students completed Kolb\u27s Learning Styles Inventory, participated in the data mining lesson, and then completed a quiz on the concepts and techniques of Association Rules. A t-test was used to determine if significant differences existed between the scores generated under the two teaching models, and an ANOVA was conducted to identify significant differences between the four learning style groups from Kolb\u27s instrument. In addition to these two statistical tests, the data were also mined using Association Rules and Decision Tree methods.;In both statistical tests, we failed to reject the null hypothesis, finding no significant differences in quiz scores between the two teaching models or among the four learning style groups. Further investigation into the differences among learning styles within teaching models however did reveal that the Assimilator learning style students who received their instruction via Direct Instruction did score significantly higher on the quiz than did their learning style counterparts who received the lesson via Concept Attainment. This finding suggests that although we cannot rely solely on one instructional approach as consistently more effective than the other, there may be instances where the correct instructional choice will positively benefit some learners with certain learning styles. The results of the data mining activities also support this assertion. Association Rules mining yielded no strong relationships between teaching models, learning styles and quiz scores, but Decision Tree mining did reveal a similar pattern of higher scores earned by Assimilator learners within Direct Instruction.;The findings of this study show that effectively teaching data mining concepts to undergraduate non-Computer Science students will not be as simple as choosing one teaching methodology over another or targeting a specific learning style group. Rather, designing instructional activities using teaching methodologies which closely align with predominant learning styles in a classroom should prove more effective. Perhaps the most significant finding of the study is that elementary data mining concepts and techniques can be effectively taught to the target audience. Finally, we recommend that additional teaching methodologies and perhaps different learning style assessments could be tested in the same way as those selected for this study

    The Impact of Directionality in Predications on Text Mining

    Get PDF
    The number of publications in biomedicine is increasing enormously each year. To help researchers digest the information in these documents, text mining tools are being developed that present co-occurrence relations between concepts. Statistical measures are used to mine interesting subsets of relations. We demonstrate how directionality of these relations affects interestingness. Support and confidence, simple data mining statistics, are used as proxies for interestingness metrics. We first built a test bed of 126,404 directional relations extracted from biomedical abstracts, which we represent as graphs containing a central starting concept and 2 rings of associated relations. We manipulated directionality in four ways and randomly selected 100 starting concepts as a test sample for each graph type. Finally, we calculated the number of relations and their support and confidence. Variation in directionality significantly affected the number of relations as well as the support and confidence of the four graph types

    Collaborative data stream mining in ubiquitous environments using dynamic classifier selection

    Full text link
    In ubiquitous data stream mining applications, different devices often aim to learn concepts that are similar to some extent. In these applications, such as spam filtering or news recommendation, the data stream underlying concept (e.g., interesting mail/news) is likely to change over time. Therefore, the resultant model must be continuously adapted to such changes. This paper presents a novel Collaborative Data Stream Mining (Coll-Stream) approach that explores the similarities in the knowledge available from other devices to improve local classification accuracy. Coll-Stream integrates the community knowledge using an ensemble method where the classifiers are selected and weighted based on their local accuracy for different partitions of the feature space. We evaluate Coll-Stream classification accuracy in situations with concept drift, noise, partition granularity and concept similarity in relation to the local underlying concept. The experimental results show that Coll-Stream resultant model achieves stability and accuracy in a variety of situations using both synthetic and real world datasets

    Improving Knowledge-Based Systems with statistical techniques, text mining, and neural networks for non-technical loss detection

    Get PDF
    Currently, power distribution companies have several problems that are related to energy losses. For example, the energy used might not be billed due to illegal manipulation or a breakdown in the customer’s measurement equipment. These types of losses are called non-technical losses (NTLs), and these losses are usually greater than the losses that are due to the distribution infrastructure (technical losses). Traditionally, a large number of studies have used data mining to detect NTLs, but to the best of our knowledge, there are no studies that involve the use of a Knowledge-Based System (KBS) that is created based on the knowledge and expertise of the inspectors. In the present study, a KBS was built that is based on the knowledge and expertise of the inspectors and that uses text mining, neural networks, and statistical techniques for the detection of NTLs. Text mining, neural networks, and statistical techniques were used to extract information from samples, and this information was translated into rules, which were joined to the rules that were generated by the knowledge of the inspectors. This system was tested with real samples that were extracted from Endesa databases. Endesa is one of the most important distribution companies in Spain, and it plays an important role in international markets in both Europe and South America, having more than 73 million customers
    • …
    corecore