Search CORE

19 research outputs found

A Survey of Parallel Data Mining

Author: Freitas Alex A.
Publication venue
Publication date
Field of study

With the fast, continuous increase in the number and size of databases, parallel data mining is a natural and cost-effective approach to tackle the problem of scalability in data mining. Recently there has been a considerable research on parallel data mining. However, most projects focus on the parallelization of a single kind of data mining algorithm/paradigm. This paper surveys parallel data mining with a broader perspective. More precisely, we discuss the parallelization of data mining algorithms of four knowledge discovery paradigms, namely rule induction, instance-based learning, genetic algorithms and neural networks. Using the lessons learned from this discussion, we also derive a set of heuristic principles for designing efficient parallel data mining algorithms

Kent Academic Repository

Dissimilarity learning for nominal data

Author: V CHENG
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Crossref

INFORMATION SYSTEM FOR OCCUPATIONAL NEUROINTOXICATIONS IDENTIFICATION: CAPABILITIES AND PROSPECTS

Author: A. G. Ivanov
Publication venue: Scientific Сentre for Family Health and Human Reproduction Problems
Publication date: 01/04/2012
Field of study

The article represents main results of the approach to development of information system, for supporting of clinical diagnostic process of occupational neurointoxications identification. The key features and capabilities of the system, are reviewed, and the original algorithmic base of its expert-analytical functions is shown. The directions of further development of the system, and. the ways of its prospective scientific use are revealed

Directory of Open Access Journals

A petabyte size electronic library using the N-Gram memory engine

Author: Bugajski Joseph M.
Publication venue
Publication date
Field of study

A model library containing petabytes of data is proposed by Triada, Ltd., Ann Arbor, Michigan. The library uses the newly patented N-Gram Memory Engine (Neurex), for storage, compression, and retrieval. Neurex splits data into two parts: a hierarchical network of associative memories that store 'information' from data and a permutation operator that preserves sequence. Neurex is expected to offer four advantages in mass storage systems. Neurex representations are dense, fully reversible, hence less expensive to store. Neurex becomes exponentially more stable with increasing data flow; thus its contents and the inverting algorithm may be mass produced for low cost distribution. Only a small permutation operator would be recalled from the library to recover data. Neurex may be enhanced to recall patterns using a partial pattern. Neurex nodes are measures of their pattern. Researchers might use nodes in statistical models to avoid costly sorting and counting procedures. Neurex subsumes a theory of learning and memory that the author believes extends information theory. Its first axiom is a symmetry principle: learning creates memory and memory evidences learning. The theory treats an information store that evolves from a null state to stationarity. A Neurex extracts information data without a priori knowledge; i.e., unlike neural networks, neither feedback nor training is required. The model consists of an energetically conservative field of uniformly distributed events with variable spatial and temporal scale, and an observer walking randomly through this field. A bank of band limited transducers (an 'eye'), each transducer in a bank being tuned to a sub-band, outputs signals upon registering events. Output signals are 'observed' by another transducer bank (a mid-brain), except the band limit of the second bank is narrower than the band limit of the first bank. The banks are arrayed as n 'levels' or 'time domains, td.' The banks are the hierarchical network (a cortex) and transducers are (associative) memories. A model Neurex was built and studied. Data were 50 MB to 10 GB samples of text, data base, and images: black/white, grey scale, and high resolution in several spectral bands. Memories at td, S(m(sub td)), were plotted against outputs of memories at td-1. S(m(sub td)) was Boltzman distributed, and memory frequencies exhibited self-organized criticality (SOC); i.e., 'l/f(sup beta)' after long exposures to data. Whereas output signals from level n may be encoded with B(sub output) = O(-log(2)f(sup beta)) bits, and input data encoded with B(sub input) = O((S(td)/S(td-1))(sup n)), B(sup output)/B(sub input) is much less than 1 always, the Neurex determines a canonical code for data and it is a lossless data compressor. Further tests are underway to confirm these results with more data types and larger samples

NASA Technical Reports Server

Recommended from our members

The impact of metadata on the accuracy of automated patent classification

Author: Andrew MacFarlane
Chai
Chakrabarti
Creecy
Georg Richter
Koster
Krier
Larkey
Larkey
Larson
Lewis
Li
Salton
Salton
Sebastiani
Smith
Stanfill
Tumer
Yang
Yang
Yang
Yang
Publication venue: 'Elsevier BV'
Publication date: 01/03/2005
Field of study

During the last decade, the advance of machine-learning tools and algorithms has resulted in tremendous progress in the automated classification of documents. However, many classifiers base their classification decisions solely on document text and ignore metadata (such as authors, publication date, and author affiliation). In this project, automated classifiers using the k-Nearest Neighbour algorithm were developed for the classification of patents into two different classification systems. Those using metadata (in this case inventor names, applicant names and International Patent Classification codes) were compared with those ignoring it. The use of metadata could significantly improve the classification of patents with one classification system, improving classification accuracy from 70.8% up to 75.4%, which was highly statistically significant. However, the results for the other classification system were inconclusive: while metadata could improve the quality of the classifier for some experiments (recall increased from 66.0% to 68.9%, which was a small but nonetheless significant improvement), experiments with different parameters showed that it could also lead to a deterioration of quality (recall dropping as low as 61.0%). The study shows that metadata can play an extremely useful role in the classification of patents. Nonetheless, it must not be used indiscriminately but only after careful evaluation of its usefulness

City Research Online

Crossref

Three Methods for Occupation Coding Based on Statistical Learning

Author: Blohm Michael
Gweon Hyukjun
Kaczmirek Lars
Schonlau Matthias
Steiner Stefan
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2017
Field of study

Occupation coding, an important task in ofﬁcial statistics, refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modiﬁed nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we ﬁnd deﬁning duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches

SSOAR - Social Science Open Access Repository

An Empirical Comparison of Text Categorization Methods

Author: B. Masand
C. Cortes
C.J.C. Burges
F. Sebastiani
G. Salton
G. Salton
G. Salton
N. Cristianini
R. Baeza-Yates
R.M. Creecy
S.C. Deerwester
T. Joachims
T. Joachims
V. Vapnik
W.S. Cooper
Y. Yang
Y. Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2003
Field of study

Crossref

Machine Learning in Automated Text Categorization

Author: ANDROUTSOPOULOS I.
ATTARDI G.
BAKER L.D.
BIEBRICHER P.
CAROPRESO M.F.
CAVNAR W.B.
CHAKRABARTI S.
CLACK C.
CLEVERDON C.
COHEN W. W.
COHEN W. W.
COHEN W.W.
DAGAN I.
DEERWESTER S.
DENOYER L.
DIAZ ESTEBAN A.
DRUCKER H.
DUMAIS S.T.
DUMAIS S.T.
ESCUDERO G.
Fabrizio Sebastiani
FIELD B.
FORSYTH R. S.
FUHR N.
FUHR N.
FUHR N.
FURNKRANZ J.
GALAVOTTI L.
GALE W. A.
GOVERT N.
GRAY W.A.
GUTHRIE L.
HAYES P.J.
HEAPS H.
HERSH W.
HULL D. A.
HULL D. A.
ITTNER D.J.
IWAYAMA M.
IYER R.D.
JOACHIMS T.
JOACHIMS T.
JOACHIMS T.
JOHN G. H.
JUNKER M.
JUNKER M.
KESSLER B.
KIM Y.-H.
KLINKENBERG R.
KNORZ G.
KOLLER D.
LAM S.L.
LAM W.
LAM W.
LANG K.
LARKEY L. S.
LARKEY L. S.
LARKEY L.S.
LEWIS D. D.
LEWIS D. D.
LEWIS D. D.
LEWIS D. D.
LEWIS D.D.
LEWIS D.D.
LEWIS D.D.
LEWIS D.D.
LEWIS D.D.
LI H.
LI Y.H.
LIERE R.
LIM J. H.
MASAND B.
MASAND B.
MCCALLUM A. K.
MCCALLUM A.K.
MLADENIC D.
MLADENIC D.
MOULINIER I.
MOULINIER I.
MYERS K.
NG H.T.
OH H.-J.
PAZIENZA M. T.
RILOFF E.
ROBERTSON S.E.
ROBERTSON S.E.
ROTH D.
RUIZ M.E.
SABLE C.L.
SARACEVIC T.
SCHAPIRE R. E.
SCHUTZE H.
SCHUTZE H.
SCOTT S.
SEBASTIANI F.
SINGHAL A.
SLONIM N.
TAIRA H.
TUMER K.
TZERAS K.
VAN RIJSBERGEN C. J.
WIENER E.D.
YANG Y.
YANG Y.
YANG Y.
YANG Y.
YU K.L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2001
Field of study

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

arXiv.org e-Print Archive

CiteSeerX

Crossref

Editing Statistical Records by Neural Networks

Author: Svein Nordbotten
Publication venue
Publication date: 01/01/1995
Field of study

Abstract. Editing records to ensure quality in statistical surveys is an expensive and time-consuming process. Since the introduction of computers in statistical processing, development and use of automated editing have been important objectives. In this paper, editing is reformulated as a neural network problem. The use of such a network is demonstrated and a preliminary evaluation is presented

CiteSeerX