16 research outputs found
Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences
This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection
A novel model for hourly PM2.5 concentration prediction based on CART and EELM
Hourly PM2.5 concentrations have multiple change patterns. For hourly PM2.5 concentration prediction, it is beneficial to split the whole dataset into several subsets with similar properties and to train a local prediction model for each subset. However, the methods based on local models need to solve the global-local duality. In this study, a novel prediction model based on classification and regression tree (CART) and ensemble extreme learning machine (EELM) methods is developed to split the dataset into subsets in a hierarchical fashion and build a prediction model for each leaf. Firstly, CART is used to split the dataset by constructing a shallow hierarchical regression tree. Then at each node of the tree, EELM models are built using the training samples of the node, and hidden neuron numbers are selected to minimize validation errors respectively on the leaves of a sub-tree that takes the node as the root. Finally, for each leaf of the tree, a global and several local EELMs on the path from the root to the leaf are compared, and the one with the smallest validation error on the leaf is chosen. The meteorological data of Yancheng urban area and the air pollutant concentration data from City Monitoring Centre are used to evaluate the method developed. The experimental results demonstrate that the method developed addresses the global-local duality, having better performance than global models including random forest (RF), v-support vector regression (v-SVR) and EELM, and other local models based on season and k-means clustering. The new model has improved the capability of treating multiple change patterns
Recommended from our members
Authority identification in online communities and social networks
textAs Internet communities such as question-answer (Q&A) forums and online social networks (OSNs) grow in prominence as knowledge sources, traditional editorial filters are unable to scale to their size and pace. This absence hinders the exchange of knowledge online, by creating an understandable lack of trust in information. This mistrust can be partially overcome by a forum by consistently providing reliable information, thus establishing itself as a reliable source. This work investigates how algorithmic approaches can contribute to building such a community of voluntary experts willing to contribute authoritative information. This work identifies two approaches: a) reducing the cost of participation for experts via matching user queries to experts (question recommendation), and b) identifying authoritative contributors for incentivization (authority estimation). The question recommendation problem is addressed by extending existing approaches via a new generative model that augments textual data with expert preference information among different questions. Another contribution to this domain is the introduction of a set of formalized metrics to include the expert's experience besides the questioner's. This is essential for expert retention in a voluntary community, and has not been addressed by previous work. The authority estimation problem is addressed by observing that the global graph structure of user interactions, results from two factors: a user's performance in local one-to-one interactions, and their activity levels. By positing an intrinsic authority 'strength' for each user node in the graph that governs the outcome of individual interactions via the Bradley-Terry model for pairwise comparison, this research establishes a relationship between intrinsic user authority, and global measures of influence. This approach overcomes many drawbacks of current measures of node importance in OSNs by naturally correcting for user activity levels, and providing an explanation for the frequent disconnect between real world reputation and online influence. Also, while existing research has been restricted to node ranking on a single OSN graph, this work demonstrates that co-ranking across multiple endorsement graphs drawn from the same OSN is a highly effective approach for aggregating complementary graph information. A new scalable co-ranking framework is introduced for this task. The resulting algorithms are evaluated on data from various online communities, and empirically shown to outperform existing approaches by a large margin.Electrical and Computer Engineerin
Recommended from our members
Expertise modeling and recommendation in online question and answer forums
textQuestion and answer (Q&A) forums, as a way for seeking expertise on the Internet, have seen rapid growth in popularity in recent years. The expertise available on most such forums is voluntary, provided by individuals willing to invest their resources for no monetary remuneration. While these forums provide easy access to expertise, the expertise available is often lacking in quality and depth. Two major reasons for this are, the time investment required to participate in such forums, and the lack of a mechanism for identifying experts for specialized questions. We believe a Q&A recommender engine can ameliorate this problem significantly. The two primary contributions of this work are: a) a hierarchical Bayesian model based Q&A recommender, and b) a discussion of metrics to measure the performance of such a Q&A recommender. Two new metrics, responder load and questioner satisfaction, are suggested based on this discussion. These metrics are used to evaluate the performance of the recommender system on datasets harvested from the Yahoo! Answers website.Electrical and Computer Engineerin