239,633 research outputs found

    Extremely fast decision tree mining for evolving data streams

    Get PDF
    Nowadays real-time industrial applications are generating a huge amount of data continuously every day. To process these large data streams, we need fast and efficient methodologies and systems. A useful feature desired for data scientists and analysts is to have easy to visualize and understand machine learning models. Decision trees are preferred in many real-time applications for this reason, and also, because combined in an ensemble, they are one of the most powerful methods in machine learning. In this paper, we present a new system called STREAMDM-C++, that implements decision trees for data streams in C++, and that has been used extensively at Huawei. Streaming decision trees adapt to changes on streams, a huge advantage since standard decision trees are built using a snapshot of data, and can not evolve over time. STREAMDM-C++ is easy to extend, and contains more powerful ensemble methods, and a more efficient and easy to use adaptive decision trees. We compare our new implementation with VFML, the current state of the art implementation in C, and show how our new system outperforms VFML in speed using less resources

    Online structural damage classification methodology for offshore wind turbine foundations using data stream analysis

    Get PDF
    Structural health monitoring (SHM) of wind turbines is crucial to improve maintenance and extend their lifespan. This study develops an online data analysis methodology using data stream analysis to classify damage in the links of an offshore wind turbine foundation. The methodology is validated using a laboratory-scaled jacket-type wind turbine foundation structure. 2460 measurements of the healthy structure were acquired, and a 5mm crack was applied to four different links to determine the four unhealthy classes. 820 measurements were taken for each of the unhealthy structures, resulting in a dataset with 5740 instances. As this is an imbalanced multiclass classification problem, a random sampler approach was used to treat the data. The only data obtained was from eight triaxial accelerometers distributed throughout the structure. Three different tree-based stream data classifiers were compared: Hoeffding Tree classifier, Extremely Fast Decision Tree classifier, and Hoeffding Adaptive Tree classifier. Each classification model underwent a tuning parameter procedure, and high values of the receiving operating characteristic area under the curve (ROC AUC) metric were achieved as a result. It is important to note that stream learning differs from batch learning.Peer ReviewedPostprint (published version

    Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

    Get PDF
    Recently, because of increasing amount of data in the society, data stream mining targeting large scale data has attracted attention. The data mining is a technology of discovery new knowledge and patterns from the massive amounts of data, and what the data correspond to data stream is data stream mining. In this paper, we propose the feature selection with online decision tree. At first, we construct online type decision tree to regard credit card transaction data as data stream on data stream mining. At second, we select attributes thought to be important for detection of illegal use. We apply VFDT (Very Fast Decision Tree learner) algorithm to online type decision tree construction

    Non-uniform Feature Sampling for Decision Tree Ensembles

    Full text link
    We study the effectiveness of non-uniform randomized feature selection in decision tree classification. We experimentally evaluate two feature selection methodologies, based on information extracted from the provided dataset: (i)(i) \emph{leverage scores-based} and (ii)(ii) \emph{norm-based} feature selection. Experimental evaluation of the proposed feature selection techniques indicate that such approaches might be more effective compared to naive uniform feature selection and moreover having comparable performance to the random forest algorithm [3]Comment: 7 pages, 7 figures, 1 tabl

    Fast Supervised Hashing with Decision Trees for High-Dimensional Data

    Get PDF
    Supervised hashing aims to map the original features to compact binary codes that are able to preserve label based similarity in the Hamming space. Non-linear hash functions have demonstrated the advantage over linear ones due to their powerful generalization capability. In the literature, kernel functions are typically used to achieve non-linearity in hashing, which achieve encouraging retrieval performance at the price of slow evaluation and training time. Here we propose to use boosted decision trees for achieving non-linearity in hashing, which are fast to train and evaluate, hence more suitable for hashing with high dimensional data. In our approach, we first propose sub-modular formulations for the hashing binary code inference problem and an efficient GraphCut based block search method for solving large-scale inference. Then we learn hash functions by training boosted decision trees to fit the binary codes. Experiments demonstrate that our proposed method significantly outperforms most state-of-the-art methods in retrieval precision and training time. Especially for high-dimensional data, our method is orders of magnitude faster than many methods in terms of training time.Comment: Appearing in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, Ohio, US
    corecore