19 research outputs found

    A Survey of Parallel Data Mining

    Get PDF
    With the fast, continuous increase in the number and size of databases, parallel data mining is a natural and cost-effective approach to tackle the problem of scalability in data mining. Recently there has been a considerable research on parallel data mining. However, most projects focus on the parallelization of a single kind of data mining algorithm/paradigm. This paper surveys parallel data mining with a broader perspective. More precisely, we discuss the parallelization of data mining algorithms of four knowledge discovery paradigms, namely rule induction, instance-based learning, genetic algorithms and neural networks. Using the lessons learned from this discussion, we also derive a set of heuristic principles for designing efficient parallel data mining algorithms

    Dissimilarity learning for nominal data

    Get PDF

    INFORMATION SYSTEM FOR OCCUPATIONAL NEUROINTOXICATIONS IDENTIFICATION: CAPABILITIES AND PROSPECTS

    Get PDF
    The article represents main results of the approach to development of information system, for supporting of clinical diagnostic process of occupational neurointoxications identification. The key features and capabilities of the system, are reviewed, and the original algorithmic base of its expert-analytical functions is shown. The directions of further development of the system, and. the ways of its prospective scientific use are revealed

    A petabyte size electronic library using the N-Gram memory engine

    Get PDF
    A model library containing petabytes of data is proposed by Triada, Ltd., Ann Arbor, Michigan. The library uses the newly patented N-Gram Memory Engine (Neurex), for storage, compression, and retrieval. Neurex splits data into two parts: a hierarchical network of associative memories that store 'information' from data and a permutation operator that preserves sequence. Neurex is expected to offer four advantages in mass storage systems. Neurex representations are dense, fully reversible, hence less expensive to store. Neurex becomes exponentially more stable with increasing data flow; thus its contents and the inverting algorithm may be mass produced for low cost distribution. Only a small permutation operator would be recalled from the library to recover data. Neurex may be enhanced to recall patterns using a partial pattern. Neurex nodes are measures of their pattern. Researchers might use nodes in statistical models to avoid costly sorting and counting procedures. Neurex subsumes a theory of learning and memory that the author believes extends information theory. Its first axiom is a symmetry principle: learning creates memory and memory evidences learning. The theory treats an information store that evolves from a null state to stationarity. A Neurex extracts information data without a priori knowledge; i.e., unlike neural networks, neither feedback nor training is required. The model consists of an energetically conservative field of uniformly distributed events with variable spatial and temporal scale, and an observer walking randomly through this field. A bank of band limited transducers (an 'eye'), each transducer in a bank being tuned to a sub-band, outputs signals upon registering events. Output signals are 'observed' by another transducer bank (a mid-brain), except the band limit of the second bank is narrower than the band limit of the first bank. The banks are arrayed as n 'levels' or 'time domains, td.' The banks are the hierarchical network (a cortex) and transducers are (associative) memories. A model Neurex was built and studied. Data were 50 MB to 10 GB samples of text, data base, and images: black/white, grey scale, and high resolution in several spectral bands. Memories at td, S(m(sub td)), were plotted against outputs of memories at td-1. S(m(sub td)) was Boltzman distributed, and memory frequencies exhibited self-organized criticality (SOC); i.e., 'l/f(sup beta)' after long exposures to data. Whereas output signals from level n may be encoded with B(sub output) = O(-log(2)f(sup beta)) bits, and input data encoded with B(sub input) = O((S(td)/S(td-1))(sup n)), B(sup output)/B(sub input) is much less than 1 always, the Neurex determines a canonical code for data and it is a lossless data compressor. Further tests are underway to confirm these results with more data types and larger samples

    Three Methods for Occupation Coding Based on Statistical Learning

    Get PDF
    Occupation coding, an important task in official statistics, refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Editing Statistical Records by Neural Networks

    Get PDF
    Abstract. Editing records to ensure quality in statistical surveys is an expensive and time-consuming process. Since the introduction of computers in statistical processing, development and use of automated editing have been important objectives. In this paper, editing is reformulated as a neural network problem. The use of such a network is demonstrated and a preliminary evaluation is presented
    corecore