239,633 research outputs found
Extremely fast decision tree mining for evolving data streams
Nowadays real-time industrial applications are generating a huge amount of data continuously every day. To process these large data streams, we need fast and efficient methodologies and systems. A useful feature desired for data scientists and analysts is to have easy to visualize and understand machine learning models. Decision trees are preferred in many real-time applications for this reason, and also, because combined in an ensemble, they are one of the most powerful methods in machine learning.
In this paper, we present a new system called STREAMDM-C++, that implements decision trees for data streams in C++, and that has been used extensively at Huawei. Streaming decision trees adapt to changes on streams, a huge advantage since standard decision trees are built using a snapshot of data, and can not evolve over time. STREAMDM-C++ is easy to extend, and contains more powerful ensemble methods, and a more efficient and easy to use adaptive decision trees. We compare our new implementation with VFML, the current state of the art implementation in C, and show how our new system outperforms VFML in speed using less resources
Online structural damage classification methodology for offshore wind turbine foundations using data stream analysis
Structural health monitoring (SHM) of wind turbines is crucial to improve maintenance and extend their lifespan. This study develops an online data analysis methodology using data stream analysis to classify damage in the links of an offshore wind turbine foundation. The methodology is validated using a laboratory-scaled jacket-type wind turbine foundation structure. 2460 measurements of the healthy structure were acquired, and a 5mm crack was applied to four different links to determine the four unhealthy classes. 820 measurements were taken for each of the unhealthy structures, resulting in a dataset with 5740 instances. As this is an imbalanced multiclass classification problem, a random sampler approach was used to treat the data. The only data obtained was from eight triaxial accelerometers distributed throughout the structure. Three different tree-based stream data classifiers were compared: Hoeffding Tree classifier, Extremely Fast Decision Tree classifier, and Hoeffding Adaptive Tree classifier. Each classification model underwent a tuning parameter procedure, and high values of the receiving operating characteristic area under the curve (ROC AUC) metric were achieved as a result. It is important to note that stream learning differs from batch learning.Peer ReviewedPostprint (published version
Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data
Recently, because of increasing amount of data in the society, data stream mining targeting large scale data has
attracted attention. The data mining is a technology of discovery new knowledge and patterns from the massive amounts of data, and what the data correspond to data stream is data stream mining. In this paper, we propose the feature selection with online decision tree. At first, we construct online type decision tree to regard credit card transaction data as data stream on data stream
mining. At second, we select attributes thought to be important for detection of illegal use. We apply VFDT (Very Fast Decision Tree learner) algorithm to online type decision tree construction
Non-uniform Feature Sampling for Decision Tree Ensembles
We study the effectiveness of non-uniform randomized feature selection in
decision tree classification. We experimentally evaluate two feature selection
methodologies, based on information extracted from the provided dataset:
\emph{leverage scores-based} and \emph{norm-based} feature selection.
Experimental evaluation of the proposed feature selection techniques indicate
that such approaches might be more effective compared to naive uniform feature
selection and moreover having comparable performance to the random forest
algorithm [3]Comment: 7 pages, 7 figures, 1 tabl
Fast Supervised Hashing with Decision Trees for High-Dimensional Data
Supervised hashing aims to map the original features to compact binary codes
that are able to preserve label based similarity in the Hamming space.
Non-linear hash functions have demonstrated the advantage over linear ones due
to their powerful generalization capability. In the literature, kernel
functions are typically used to achieve non-linearity in hashing, which achieve
encouraging retrieval performance at the price of slow evaluation and training
time. Here we propose to use boosted decision trees for achieving non-linearity
in hashing, which are fast to train and evaluate, hence more suitable for
hashing with high dimensional data. In our approach, we first propose
sub-modular formulations for the hashing binary code inference problem and an
efficient GraphCut based block search method for solving large-scale inference.
Then we learn hash functions by training boosted decision trees to fit the
binary codes. Experiments demonstrate that our proposed method significantly
outperforms most state-of-the-art methods in retrieval precision and training
time. Especially for high-dimensional data, our method is orders of magnitude
faster than many methods in terms of training time.Comment: Appearing in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2014, Ohio, US
- …