978 research outputs found
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
Gradient-Based Training and Pruning of Radial Basis Function Networks with an Application in Materials Physics
Many applications, especially in physics and other sciences, call for easily
interpretable and robust machine learning techniques. We propose a fully
gradient-based technique for training radial basis function networks with an
efficient and scalable open-source implementation. We derive novel closed-form
optimization criteria for pruning the models for continuous as well as binary
data which arise in a challenging real-world material physics problem. The
pruned models are optimized to provide compact and interpretable versions of
larger models based on informed assumptions about the data distribution.
Visualizations of the pruned models provide insight into the atomic
configurations that determine atom-level migration processes in solid matter;
these results may inform future research on designing more suitable descriptors
for use with machine learning algorithms
Exploring Data Mining Techniques for Tree Species Classification Using Co-Registered LiDAR and Hyperspectral Data
NASA Goddard’s LiDAR, Hyperspectral, and Thermal imager provides co-registered remote sensing data on experimental forests. Data mining methods were used to achieve a final tree species classification accuracy of 68% using a combined LiDAR and hyperspectral dataset, and show promise for addressing deforestation and carbon sequestration on a species-specific level
Machine Learning
Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience
Survey of data mining approaches to user modeling for adaptive hypermedia
The ability of an adaptive hypermedia system to create tailored environments depends mainly on the amount and accuracy of information stored in each user model. Some of the difficulties that user modeling faces are the amount of data available to create user models, the adequacy of the data, the noise within that data, and the necessity of capturing the imprecise nature of human behavior. Data mining and machine learning techniques have the ability to handle large amounts of data and to process uncertainty. These characteristics make these techniques suitable for automatic generation of user models that simulate human decision making. This paper surveys different data mining techniques that can be used to efficiently and accurately capture user behavior. The paper also presents guidelines that show which techniques may be used more efficiently according to the task implemented by the applicatio
A review of clustering techniques and developments
© 2017 Elsevier B.V. This paper presents a comprehensive study on clustering: exiting methods and developments made at various times. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering the objects such as hierarchical, partitional, grid, density based and model based. The approaches used in these methods are discussed with their respective states of art and applicability. The measures of similarity as well as the evaluation criteria, which are the central components of clustering, are also presented in the paper. The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted
- …