76 research outputs found
On the Qualitative Behavior of Impurity-Based Splitting Rules I: The Minima-Free Property
We show that all strictly convex n impurity measures lead to splits at boundary points, and furthermore show that certain rational splitting rules, notably the information gain ratio, also have this property. A slightly weaker result is shown to hold for impurity measures that are only convex n, such as Inaccuracy
Finding Anomalous Periodic Time Series: An Application to Catalogs of Periodic Variable Stars
Catalogs of periodic variable stars contain large numbers of periodic
light-curves (photometric time series data from the astrophysics domain).
Separating anomalous objects from well-known classes is an important step
towards the discovery of new classes of astronomical objects. Most anomaly
detection methods for time series data assume either a single continuous time
series or a set of time series whose periods are aligned. Light-curve data
precludes the use of these methods as the periods of any given pair of
light-curves may be out of sync. One may use an existing anomaly detection
method if, prior to similarity calculation, one performs the costly act of
aligning two light-curves, an operation that scales poorly to massive data
sets. This paper presents PCAD, an unsupervised anomaly detection method for
large sets of unsynchronized periodic time-series data, that outputs a ranked
list of both global and local anomalies. It calculates its anomaly score for
each light-curve in relation to a set of centroids produced by a modified
k-means clustering algorithm. Our method is able to scale to large data sets
through the use of sampling. We validate our method on both light-curve data
and other time series data sets. We demonstrate its effectiveness at finding
known anomalies, and discuss the effect of sample size and number of centroids
on our results. We compare our method to naive solutions and existing time
series anomaly detection methods for unphased data, and show that PCAD's
reported anomalies are comparable to or better than all other methods. Finally,
astrophysicists on our team have verified that PCAD finds true anomalies that
might be indicative of novel astrophysical phenomena
Detecting the Abnormal: Machine Learning in Computer Security
TINOpr oblems of importance in computer security are to I) detect the presence of an intruder masquerading as the valid user and 2) detect the perpetration of abusive actions on the part of an otherwise innocuous user. In this paper we present a machine learning approach to anomaly detection, desigined to handle these two problems. Our system learns a user profile for each user account and subsequently employs it to detect anomalous behavior in that account. Based on sequences of actions (UNIX commands) of the current user\u27s input sti:earn, the system compares each fixed-length input sequence with a historical library of the account\u27s command sequences using a similarity measure. Tlle system must learn to classify current behavior as consistent or anomalous with past behavior using only positive examples of the account\u27s valid user. Our empirical results demonstrate that in most cases it is possib1.e to distingu. ish the legitimate user from an intruder and, furthermore, that an instance selection technique based on a memory page-replacement algorithm is capable of drastically reducing library size without hindering detection accuracy
The Need for Diagnostics for Classification Algorithms
Many machine learning researchers view the task of inductive generalization as beginning after the data is collected, assuming that the useful features have been identified and that representative data has been collected. This assumption has led researchers to focus, with considerable success, on algorithm development. As a result, little attention has been paid to applying machine learning algorithms. One problem that arises is that when classification performance does not meet expectations, inexperienced practitioners can find little guidance in the available literature to help them. This talk addresses this gap between research and applied machine learning and suggests areas of research that can help bridge this gap. 1 THE APPLICATION DEVELOPMENT PROCESS The first step in the application development process is to analyze the factors relevant to the application domain. Application factors include the overall objectives of the project, the amount of domain knowledge available, and th..
Dynamic Automatic Model Selection
The problem of how to learn from examples has been studied throughout the history of machine learning, and many successful learning algorithms have been developed. A problem that has received less attention is how to select which algorithm to use for a given learning task. The ability of a chosen algorithm to induce a good generalization depends on how appropriate the model class underlying the algorithm is for the given task. We define an algorithm's model class to be the representation language it uses to express a generalization of the examples. Supervised learning algorithms differ in their underlying model class and in how they search for a good generalization. Given this characterization, it is not surprising that some algorithms find better generalizations for some, but not all tasks. Therefore, in order to find the best generalization for each task, an automated learning system must search for the appropriate model class in addition to searching for the best generalization wit..
An Application of Machine Learning to Anomaly Detection
The anomaly detection problem has been widely studied in the computer security literature. In this paper we present a machine learning approach to anomaly detection. Our system builds user profiles based on command sequences and compares current input sequences to the profile using a similarity measure. The system must learn to classify current behavior as consistent or anomalous with past behavior using only positive examples of the account's valid user. Our empirical results demonstrate that this is a promising approach to distinguishing the legitamate user from an intruder
- …