11 research outputs found
AMANDA : density-based adaptive model for nonstationary data under extreme verification latency scenarios
Gradual concept-drift refers to a smooth and gradual change in the relations between input and output data in the underlying distribution over time. The problem generates a model obsolescence and consequently a quality decrease in predictions. Besides, there is a challenging task during the stream: The extreme verification latency (EVL) to verify the labels. For batch scenarios, state-of-the-art methods propose an adaptation of a supervised model by using an unconstrained least squares importance fitting (uLSIF) algorithm or a semi-supervised approach along with a core support extraction (CSE) method. However, these methods do not properly tackle the mentioned problems due to their high computational time for large data volumes, lack in representing the right samples of the drift or even for having several parameters for tuning. Therefore, we propose a density-based adaptive model for nonstationary data (AMANDA), which uses a semi-supervised classifier along with a CSE method. AMANDA has two variations: AMANDA with a fixed cutting percentage (AMANDA-FCP); and AMANDA with a dynamic cutting percentage (AMANDADCP). Our results indicate that the two variations of AMANDA outperform the state-of-the-art methods for almost all synthetic datasets and real ones with an improvement up to 27.98% regarding the average error. We have found that the use of AMANDA-FCP improved the results for a gradual concept-drift even with a small size of initial labeled data. Moreover, our results indicate that SSL classifiers are improved when they work along with our static or dynamic CSE methods. Therefore, we emphasize the importance of research directions based on this approach.Concept-drift gradual refere-se à mudança suave e gradual na distribuição dos dados conforme o tempo passa. Este problema causa obsolescência no modelo de aprendizado e queda na qualidade das previsões. Além disso, existe um complicador durante o processamento dos dados: a latência de verificação extrema (LVE) para se verificar os rótulos. Métodos do estado da arte propõem uma adaptação do modelo supervisionado usando uma abordagem de estimação de importância baseado em mínimos quadrados ou usando uma abordagem semi-supervisionada em conjunto com a extração de instâncias centrais, na sigla em inglês (CSE). Entretanto, estes métodos não tratam adequadamente os problemas mencionados devido ao fato de requererem alto tempo computacional para processar grandes volumes de dados, falta de correta seleção das instâncias que representam a mudança da distribuição, ou ainda por demandarem o ajuste de grande quantidade de parâmetros. Portanto, propomos um modelo adaptativo baseado em densidades para dados não-estacionários (AMANDA), que tem como base um classificador semi-supervisionado e um método CSE baseado em densidade. AMANDA tem duas variações: percentual de corte fixo (AMANDAFCP); e percentual de corte dinâmico (AMANDA-DCP). Nossos resultados indicam que as duas variações da proposta superam o estado da arte em quase todas as bases de dados sintéticas e reais em até 27,98% em relação ao erro médio. Concluímos que a aplicação do método AMANDA-FCP faz com que a classificação melhore mesmo quando há uma pequena porção inicial de dados rotulados. Mais ainda, os classificadores semi-supervisionados são melhorados quando trabalham em conjunto com nossos métodos de CSE, estático ou dinâmico
Learning from textual data streams for detecting email spam
This master thesis introduces a method for the detecting email spam through the translation problem in incremental learning of the time series. Common spam detection systems mainly use methods of supervised learning (naive Bayesian classifier, decision trees), while in the master’s thesis presents the classification by using the methods of data stream mining.
For learning sets, we also choose the attributes that do not contain personal data and which are not required to obtain the consent of the sender or the recipient (attributes consist the envelope part of e-mail). With the help of algorithms for learning from data streams (VFDT, cVFDT) we used the electronic sequence of messages as text data stream. The results were compared with the traditional spam detection methods and they show that traditional spam detection methods have higher accuracy compared to algorithms for learning from data stream and therefore are not suitable for detecting email spam
Learning from textual data streams for detecting email spam
This master thesis introduces a method for the detecting email spam through the translation problem in incremental learning of the time series. Common spam detection systems mainly use methods of supervised learning (naive Bayesian classifier, decision trees), while in the master’s thesis presents the classification by using the methods of data stream mining.
For learning sets, we also choose the attributes that do not contain personal data and which are not required to obtain the consent of the sender or the recipient (attributes consist the envelope part of e-mail). With the help of algorithms for learning from data streams (VFDT, cVFDT) we used the electronic sequence of messages as text data stream. The results were compared with the traditional spam detection methods and they show that traditional spam detection methods have higher accuracy compared to algorithms for learning from data stream and therefore are not suitable for detecting email spam
Evolving Neural Fuzzy Network With Adaptive Feature Selection
This paper introduces a neural fuzzy network approach for evolving system modeling. The approach uses neofuzzy neurons and a neural fuzzy structure monished with an incremental learning algorithm that includes adaptive feature selection. The feature selection mechanism starts considering one or more input variables from a given set of variables, and decides if a new variable should be added, or if an existing variable should be excluded or kept as an input. The decision process uses statistical tests and information about the current model performance. The incremental learning scheme simultaneously selects the input variables and updates the neural network weights. The weights are adjusted using a gradient-based scheme with optimal learning rate. The performance of the models obtained with the neural fuzzy modeling approach is evaluated considering weather temperature forecasting problems. Computational results show that the approach is competitive with alternatives reported in the literature, especially in on-line modeling situations where processing time and learning are critical. © 2012 IEEE.2440445Angelov, P., Buswell, R., Evolving rule-based models: A tool for intelligent adaptation (2001) Proc. Joint IFSA World Congress and NAFIPS Int. Conf., 2, pp. 1062-1067Kasabov, N., Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning (2001) IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 31 (6), pp. 902-918. , DOI 10.1109/3477.969494, PII S1083441901086411Kasabov, N., Song, Q., Dynamic evolving fuzzy neural networks with m-out-of-n" activation nodes for on-line adaptive systems (1999) Tech. Rep., , Department of Information Science-University of Otago, Dunedin, New ZealandCernuda, C., Lughofer, E., Suppan, L., Roder, T., Schmuch, R., Hintenaus, P., Marzinger, W., Kasberger, J., Evolving chemometric models for predicting dynamic process parameters in viscose production (2012) Analytica Chimica Acta, pp. 22-38. , mayBarros, J., Dexter, A., Evolving fuzzy model-based adaptive control (2007) Proc. IEEE Int. Conf. on Fuzzy Systems, pp. 1-5Lughofer, E., On-line incremental feature weighting in evolving fuzzy classifiers (2011) Fuzzy Sets Systems, 163 (1), pp. 1-23Iglesias, J., Angelov, P., Ledezma, A., Sanchis, A., Modelling evolving user behaviours (2009) Proc. IEEE Workshop on Evolving and Self-Developing Intelligent Systems, pp. 16-23Lemos, A., Caminhas, W., Gomide, F., Fuzzy multivariable gaussian evolving approach for fault detection and diagnosis (2010) Computational Intelligence for Knowledge-Based Systems Design., 6178, pp. 360-369. , SpringerLemos, A., Gomide, F., Caminhas, W., Multivariable gaussian evolving fuzzy modeling system (2011) IEEE Transactions on Fuzzy Systems, 1, pp. 91-104Zhu, J., Lao, N., Xing, E., Grafting-light: Fast, incremental feature selection and structure learning of markovrandom fields (2010) Proc. Int. Conf. on Knowledge Discovery and Data Mining. ACM, pp. 303-312Li, Y., On incremental and robust subspace learning (2004) Pattern Recognition, 37 (7), pp. 1509-1518. , DOI 10.1016/j.patcog.2003.11.010, PII S003132030300431XKatakis, I., Tsoumakas, G., Vlahavas, I., Dynamic feature space and incremental feature selection for the classification of textual data streams (2006) Proc. Int. Workshop on Knowledge Discovery from Data Streams., pp. 107-116. , SpringerLemos, A., Caminhas, W., Gomide, F., Evolving fuzzy linear regression trees with feature selection (2001) Proc. of the IEEE Workshop on Evolving and Adaptive Intelligent Systems, 1, pp. 31-38Yamakawa, T., Uchino, E., Miki, T., Kusabagi, H., A neo fuzzy neuron and its applications to system identification and predictions to system behavior (1992) Proc. of the Int. Conf. on Fuzzy Logic and Neural Networks, 1, pp. 477-484Takagi, T., Sugeno, M., Fuzzy identification of systems and its applications to modeling and control (1985) IEEE Trans. on Systems, Man and Cybernetics, 15 (1), pp. 116-132Caminhas, W., Gomide, F., A fast learning algorithm for neofuzzy networks (2000) Proc. Information Processing and Management of Uncertainty in Knowledge Based Systems, 1 (1), pp. 1784-1790Bazaraa, M., Sherali, H., Shetty, C., (1993) Nonlinear Programming: Theory And Algorithms 3rd Ed. John, , Wiley & SonsCao, F., Wang, Y., Design of a single-phase grid-connected photovoltaic systems based on fuzzy-pid controller (2009) Proc. Intelligent Computing Int. Conf. on Emerging Intelligent Computing Technology and Applications., pp. 912-919. , SpringerHarris, C.J., Brown, M., Bossley, K.M., Mills, D.J., Ming, F., Advances in neurofuzzy algorithms for real-time modelling and control (1996) Engineering Applications of Artificial Intelligence, 9 (1), pp. 1-16. , DOI 10.1016/0952-1976(95)00059-3Allen, M., (1997) Understanding Regression Analysis, , 1st ed., Springer, Ed. SpringerPotts, D., Sammut, C., Incremental learning of linear model trees (2004) Machine Learning, 61 (1), pp. 5-48Angelov, P., Filev, D., An approach to online identification of takagi-sugeno fuzzy models (2004) IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 34 (1), pp. 484-498Angelov, P., Zhou, X., Evolving fuzzy systems from data streams in real-time (2006) Proceedings of the 2006 International Symposium on Evolving Fuzzy Systems, EFS'06, pp. 29-35. , DOI 10.1109/ISEFS.2006.251157, 4016721, Proceedings of the 2006 International Symposium on Evolving Fuzzy Systems, EFS'06Leite, D., Ballini, R., Costa, P., Gomide, F., Evolving fuzzy granular modeling from nonstationary fuzzy data streams (2012) Evolving Systems, 3, pp. 65-79Gama, J., Sebastiao, R., Rodrigues, P., Issues in evaluation of stream learning algorithms (2009) Proc. ACM SIGKDD Int. Conf, pp. 329-338. , Knowledge Discovery and Data MiningCarmona-Cejudo, J., Baena-Garćia, M., Bueno, R., Gama, J., Bifet, A., Using gnusmail to compare data stream mining methods for on-line email classification (2011) Journal of Machine Learning Research-Proceedings Track, pp. 12-1
Clustering and Classification of Email Contents
Information users depend heavily on emails\u27 system as one of the major sources of communication. Its importance and usage are continuously growing despite the evolution of mobile applications, social networks, etc. Emails are used on both the personal and professional levels. They can be considered as official documents in communication among users. Emails\u27 data mining and analysis can be conducted for several purposes such as: Spam detection and classification, subject classification, etc. In this paper, a large set of personal emails is used for the purpose of folder and subject classifications. Algorithms are developed to perform clustering and classification for this large text collection. Classification based on NGram is shown to be the best for such large text collection especially as text is Bi-language (i.e. with English and Arabic content)
A Survey on Concept Drift Adaptation
Concept drift primarily refers to an online supervised learning scenario when the relation between the in- put data and the target variable changes over time. Assuming a general knowledge of supervised learning in this paper we characterize adaptive learning process, categorize existing strategies for handling concept drift, discuss the most representative, distinct and popular techniques and algorithms, discuss evaluation methodology of adaptive algorithms, and present a set of illustrative applications. This introduction to the concept drift adaptation presents the state of the art techniques and a collection of benchmarks for re- searchers, industry analysts and practitioners. The survey aims at covering the different facets of concept drift in an integrated way to reflect on the existing scattered state-of-the-art
GNUsmail: Open framework for on-line email classification
Real-time classification of massive email data is a challenging task that presents its own particular difficulties. Since email data presents an important temporal component, several problems arise: emails arrive continuously, and the criteria used to classify those emails can change, so the learning algorithms have to be able to deal with concept drift. Our problem is more general than spam detection, which has received much more attention in the literature.
In this paper we present GNUsmail, an open-source extensible framework for email classification, which structure supports incremental and on-line learning. This framework enables the incorporation of algorithms developed by other researchers, such as those included in WEKA and MOA. We evaluate this framework, characterized by two overlapping phases (pre-processing and learning), using the ENRON dataset, and we compare the results achieved by WEKA and MOA algorithms