29,212 research outputs found

    A Semi-Supervised Feature Engineering Method for Effective Outlier Detection in Mixed Attribute Data Sets

    Get PDF
    Outlier detection is one of the crucial tasks in data mining which can lead to the finding of valuable and meaningful information within the data. An outlier is a data point that is notably dissimilar from other data points in the data set. As such, the methods for outlier detection play an important role in identifying and removing the outliers, thereby increasing the performance and accuracy of the prediction systems. Outlier detection is used in many areas like financial fraud detection, disease prediction, and network intrusion detection. Traditional outlier detection methods are founded on the use of different distance measures to estimate the similarity between the points and are confined to data sets that are purely continuous or categorical. These methods, though effective, lack in elucidating the relationship between outliers and known clusters/classes in the data set. We refer to this relationship as the context for any reported outlier. Alternate outlier detection methods establish the context of a reported outlier using underlying contextual beliefs of the data. Contextual beliefs are the established relationships between the attributes of the data set. Various studies have been recently conducted where they explore the contextual beliefs to determine outlier behavior. However, these methods do not scale in the situations where the data points and their respective contexts are sparse. Thus, the outliers reported by these methods tend to lose meaning. Another limitation of these methods is that they assume all features are equally important and do not consider nor determine subspaces among the features for identifying the outliers. Furthermore, determining subspaces is computationally exacerbated, as the number of possible subspaces increases with increasing dimensionality. This makes searching through all the possible subspaces impractical. In this thesis, we propose a Hybrid Bayesian Network approach to capture the underlying contextual beliefs to detect meaningful outliers in mixed attribute data sets. Hybrid Bayesian Networks utilize their probability distributions to encode the information of the data and outliers are those points which violate this information. To deal with the sparse contexts, we use an angle-based similarity method which is then combined with the joint probability distributions of the Hybrid Bayesian Network in a robust manner. With regards to the subspace selection, we employ a feature engineering method that consists of two-stage feature selection using Maximal Information Coefficient and Markov blankets of Hybrid Bayesian Networks to select highly correlated feature subspaces. This proposed method was tested on a real world medical record data set. The results indicate that the algorithm was able to identify meaningful outliers successfully. Moreover, we compare the performance of our algorithm with the existing baseline outlier detection algorithms. We also present a detailed analysis of the reported outliers using our method and demonstrate its efficiency when handling data points with sparse contexts

    Clustering South African households based on their asset status using latent variable models

    Full text link
    The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status. A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure - this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD). The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings result, providing insight to the different socio-economic strata within the Agincourt region.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS726 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Application of mutual information-based sequential feature selection to ISBSG mixed data

    Full text link
    [EN] There is still little research work focused on feature selection (FS) techniques including both categorical and continuous features in Software Development Effort Estimation (SDEE) literature. This paper addresses the problem of selecting the most relevant features from ISBSG (International Software Benchmarking Standards Group) dataset to be used in SDEE. The aim is to show the usefulness of splitting the ranked list of features provided by a mutual information-based sequential FS approach in two, regarding categorical and continuous features. These lists are later recombined according to the accuracy of a case-based reasoning model. Thus, four FS algorithms are compared using a complete dataset with 621 projects and 12 features from ISBSG. On the one hand, two algorithms just consider the relevance, while the remaining two follow the criterion of maximizing relevance and also minimizing redundancy between any independent feature and the already selected features. On the other hand, the algorithms that do not discriminate between continuous and categorical features consider just one list, whereas those that differentiate them use two lists that are later combined. As a result, the algorithms that use two lists present better performance than those algorithms that use one list. Thus, it is meaningful to consider two different lists of features so that the categorical features may be selected more frequently. We also suggest promoting the usage of Application Group, Project Elapsed Time, and First Data Base System features with preference over the more frequently used Development Type, Language Type, and Development Platform.Fernández-Diego, M.; González-Ladrón-De-Guevara, F. (2018). Application of mutual information-based sequential feature selection to ISBSG mixed data. Software Quality Journal. 26(4):1299-1325. https://doi.org/10.1007/s11219-017-9391-5S12991325264Angelis, L., & Stamelos, I. (2000). A simulation tool for efficient analogy based cost estimation. Empirical Software Engineering, 5(1), 35–68. https://doi.org/10.1023/A:1009897800559 .Auer, M., Trendowicz, A., Graser, B., Haunschmid, E., & Biffl, S. (2006). Optimal project feature weights in analogy-based cost estimation: improvement and limitations. Software Engineering, IEEE Transactions on, 32(2), 83–92.Awada, W., Khoshgoftaar, T. M., Dittman, D., Wald, R., Napolitano, A. (2012). A review of the stability of feature selection techniques for bioinformatics data. In 2012 I.E. 13th International Conference on Information Reuse and Integration (IRI) (pp. 356–363). Presented at the 2012 I.E. 13th International Conference on Information Reuse and Integration (IRI). https://doi.org/10.1109/IRI.2012.6303031 .Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. Neural Networks, IEEE Transactions, 5(4), 537–550.Bennasar, M., Hicks, Y., & Setchi, R. (2015). Feature selection using joint mutual information maximisation. Expert Systems with Applications, 42(22), 8520–8532. https://doi.org/10.1016/j.eswa.2015.07.007 .Bibi, S., Tsoumakas, G., Stamelos, I., & Vlahavas, I. (2008). Regression via classification applied on software defect estimation. Expert Systems with Applications, 34(3), 2091–2101. https://doi.org/10.1016/j.eswa.2007.02.012 .Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28.Chatzipetrou, P., Papatheocharous, E., Angelis, L., Andreou, A. S. (2012). An investigation of software effort phase distribution using compositional data analysis. In 2012 38th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) (pp. 367–375). Presented at the 2012 38th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). https://doi.org/10.1109/SEAA.2012.50 .Chen, Z., Menzies, T., Port, D., & Boehm, B. (2005). Feature subset selection can improve software cost estimation accuracy. In Proceedings of the 2005 workshop on predictor models in software engineering (pp. 1–6). New York: ACM. https://doi.org/10.1145/1082983.1083171 .Chiu, N.-H., & Huang, S.-J. (2007). The adjusted analogy-based software effort estimation based on similarity distances. Journal of Systems and Software, 80(4), 628–640.Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1), 155–176.Dejaeger, K., Verbeke, W., Martens, D., & Baesens, B. (2012). Data mining techniques for software effort estimation: a comparative study. Software Engineering, IEEE Transactions on, 38(2), 375–397. https://doi.org/10.1109/TSE.2011.55 .Deng, K., & MacDonell, S. G. (2008). Maximising data retention from the ISBSG repository. In Proceedings of the 12th international conference on evaluation and assessment in software engineering (pp. 21–30). Swinton: British Computer Society http://dl.acm.org/citation.cfm?id=2227115.2227118 . Accessed 21 Jan 2014.Doquire, G., & Verleysen, M. (2011). An hybrid approach to feature selection for mixed categorical and continuous data. In International Conference on Knowledge Discovery and Information Retrieval. http://hdl.handle.net/2078.1/90765 . Accessed 2 Nov 2015.Dudani, S. A. (1976). The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, SMC, 6(4), 325–327. https://doi.org/10.1109/TSMC.1976.5408784 .Estévez, P. A., Tesmer, M., Perez, C. A., & Zurada, J. M. (2009). Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2), 189–201. https://doi.org/10.1109/TNN.2008.2005601 .Fayyad, U.M., & Irani, K.B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In Proceedings of the International Joint Conference on Uncertainty in AI (pp. 1022–1027). Presented at the International Joint Conference on Uncertainty in AI. https://www.researchgate.net/publication/220815890_Multi-Interval_Discretization_of_Continuous-Valued_Attributes_for_Classification_Learning . Accessed 22 June 2016.Fernández-Diego, M., & González-Ladrón-de-Guevara, F. (2014). Potential and limitations of the ISBSG dataset in enhancing software engineering research: a mapping review. Information and Software Technology, 56(6), 527–544. https://doi.org/10.1016/j.infsof.2014.01.003 .Ferreira, A., & Figueiredo, M. (2011). Unsupervised joint feature discretization and selection. In J. Vitrià, J. M. Sanches, & M. Hernández (Eds.), Pattern recognition and image analysis (Vol. 6669, pp. 200–207). Berlin, Heidelberg: Springer Berlin Heidelberg http://link.springer.com/10.1007/978-3-642-21257-4_25 . Accessed 4 Mar 2016.Fleuret, F. (2004). Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5, 1531–1555.González-Ladrón-de-Guevara, F., Fernández-Diego, M., & Lokan, C. (2016). The usage of ISBSG data fields in software effort estimation: a systematic mapping study. Journal of Systems and Software, 113, 188–215. https://doi.org/10.1016/j.jss.2015.11.040 .Gupta, P., Jain, S., & Jain, A. (2014). A review of fast clustering-based feature subset selection algorithm. International Journal of Scientific & Technology Research, 3(11), 86–91.Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.Hall, M. A., & Holmes, G. (2003). Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering, 15(6), 1437–1447. https://doi.org/10.1109/TKDE.2003.1245283 .Hausser, J., & Strimmer, K. (2009). Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research, 10(Jul), 1469–1484.Hill, P. (2010). Practical software project estimation: a toolkit for estimating software development effort & duration. McGraw Hill Professional.Hsu, H.-H., Hsieh, C.-W., & Lu, M.-D. (2011). Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications, 38(7), 8144–8150.Huang, S.-J., & Chiu, N.-H. (2006). Optimization of analogy weights by genetic algorithm for software effort estimation. Information and Software Technology, 48(11), 1034–1045. https://doi.org/10.1016/j.infsof.2005.12.020 .Huang, S.-J., Chiu, N.-H., & Liu, Y.-J. (2008). A comparative evaluation on the accuracies of software effort estimates from clustered data. Information and Software Technology, 50(9–10), 879–888. https://doi.org/10.1016/j.infsof.2008.02.005 .Huang, J., Li, Y.-F., & Xie, M. (2015). An empirical analysis of data preprocessing for machine learning-based software cost estimation. Information and Software Technology, 67, 108–127. https://doi.org/10.1016/j.infsof.2015.07.004 .ISBSG. (2013a). ISBSG Dataset Release 12. ISBSG. http://isbsg.org/ . Accessed 1 Mar 2016.ISBSG. (2013b). ISBSG Guidelines Release 12.ISBSG. (2013c). ISBSG Data Demographics Release 12.Jeffery, R., Ruhe, M., Wieczorek, I. (2001). Using public domain metrics to estimate software development effort. In Software Metrics Symposium, 2001. METRICS 2001. Proceedings. Seventh International (pp. 16–27). https://doi.org/10.1109/METRIC.2001.915512 .Jiang, Z., & Comstock, C. (2007). The factors significant to software development productivity. In C. Ardil (Ed.), Proceedings of World Academy of Science, Engineering and Technology, Vol 19 (Vol. 19, pp. 160–164). Presented at the Conference of the World-Academy-of-Science-Engineering-and-Technology, Bangkok: World Acad Sci, Eng & Tech-Waset.Jørgensen, M., Indahl, U., & Sjøberg, D. (2003). Software effort estimation by analogy and ‘regression toward the mean’. Journal of Systems and Software, 68(3), 253–262. https://doi.org/10.1016/S0164-1212(03)00066-9 .Kabir, M. M., Shahjahan, M., & Murase, K. (2011). A new local search based hybrid genetic algorithm for feature selection. Neurocomputing, 74(17), 2914–2928.Kadoda, G., Cartwright, M., Chen, L., Shepperd, M. (2000). Experiences using case-based reasoning to predict software project effort. In EASE 2000 (pp. 2–3). Presented at the EASE 2000, Staffordshire, UK.Keung, J., Kocaguneli, E., & Menzies, T. (2012). Finding conclusion stability for selecting the best effort predictor in software effort estimation. Automated Software Engineering, 20(4), 543–567. https://doi.org/10.1007/s10515-012-0108-5 .Kirsopp, C., Shepperd, M. J., Hart, J. (2002). Search heuristics, case-based reasoning and software project effort prediction. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 9–13). New York, USA. http://v-scheiner.brunel.ac.uk/handle/2438/1554 . Accessed 27 Jan 2016.Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. https://doi.org/10.1016/S0004-3702(97)00043-X .Kwak, N., & Choi, C.-H. (2002). Input feature selection for classification problems. IEEE Transactions on Neural Networks, 13(1), 143–159. https://doi.org/10.1109/72.977291 .Langdon, W. B., Dolado, J., Sarro, F., & Harman, M. (2016). Exact mean absolute error of baseline predictor, MARP0. Information and Software Technology, 73, 16–18. https://doi.org/10.1016/j.infsof.2016.01.003 .Li, Y. F., Xie, M., & Goh, T. N. (2009). A study of mutual information based feature selection for case based reasoning in software cost estimation. Expert Systems with Applications, 36(3), 5921–5931.Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media. https://books.google.es/books?hl=en&lr=&id=aaDbBwAAQBAJ&oi=fnd&pg=PP10&dq=Feature+selection+for+knowledge+discovery+and+data+mining&ots=iuMhcWZGcf&sig=KlmNEIcsBdDVs-m1HUuICfpYZiM . Accessed 25 Jan 2016.Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502. https://doi.org/10.1109/TKDE.2005.66 .Liu, H., Wei, R., & Jiang, G. (2013). A hybrid feature selection scheme for mixed attributes data. Computational and Applied Mathematics, 32(1), 145–161. https://doi.org/10.1007/s40314-013-0019-5 .Liu, Q., Wang, J., Xiao, J., Zhu, H. (2014). Mutual information based feature selection for symbolic interval data. In International Conference on Software Intelligence Technologies and Applications International Conference on Frontiers of Internet of Things 2014 (pp. 62–69). Presented at the International Conference on Software Intelligence Technologies and Applications International Conference on Frontiers of Internet of Things 2014. https://doi.org/10.1049/cp.2014.1537 .Lokan, C. (2005). What should you optimize when building an estimation model? In Software Metrics, 2005. 11th IEEE International Symposium (pp. 1–10). https://doi.org/10.1109/METRICS.2005.55 .Lokan, C., & Mendes, E. (2009a). Investigating the use of chronological split for software effort estimation. Software, IET, 3(5), 422–434. https://doi.org/10.1049/iet-sen.2008.0107 .Lokan, C., & Mendes, E. (2009b). Applying moving windows to software effort estimation. In Proceedings of the 2009 3rd international symposium on empirical software engineering and measurement (pp. 111–122). Washington, DC: IEEE Computer Society. https://doi.org/10.1109/ESEM.2009.5316019 .Lokan, C., & Mendes, E. (2012). Investigating the use of duration-based moving windows to improve software effort prediction. In Software Engineering Conference (APSEC), 2012 19th Asia-Pacific (Vol. 1, pp. 818–827). Presented at the Software Engineering Conference (APSEC), 2012 19th Asia-Pacific. https://doi.org/10.1109/APSEC.2012.74 .Lustgarten, J.L., Visweswaran, S., Grover, H., Gopalakrishnan, V. (2008). An evaluation of discretization methods for learning rules from biomedical datasets. In BIOCOMP (pp. 527–532).Mandal, M., & Mukhopadhyay, A. (2013). An improved minimum redundancy maximum relevance approach for feature selection in gene expression data. Procedia Technology, 10, 20–27. https://doi.org/10.1016/j.protcy.2013.12.332 .Mendes, E., Watson, I., Triggs, C., Mosley, N., & Counsell, S. (2003). A comparative study of cost estimation models for web hypermedia applications. Empirical Software Engineering, 8(2), 163–196.Mendes, E., Lokan, C., Harrison, R., Triggs, C. (2005). A replicated comparison of cross-company and within-company effort estimation models using the ISBSG database. In Software Metrics, 2005. 11th IEEE International Symposium (pp. 1–10). https://doi.org/10.1109/METRICS.2005.4 .Moses, J., Farrow, M., Parrington, N., & Smith, P. (2006). A productivity benchmarking case study using Bayesian credible intervals. Software Quality Journal, 14(1), 37–52. https://doi.org/10.1007/s11219-006-6000-4 .Núñez, H., Sànchez-Marrè, M., Cortés, U., Comas, J., Martínez, M., Rodríguez-Roda, I., & Poch, M. (2004). A comparative study on the use of similarity measures in case-based reasoning to improve the classification of environmental system situations. Environmental Modelling & Software, 19(9), 809–819. https://doi.org/10.1016/j.envsoft.2003.03.003 .Oh, I.-S., Lee, J.-S., & Moon, B.-R. (2004). Hybrid genetic algorithms for feature selection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(11), 1424–1437.Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238. https://doi.org/10.1109/TPAMI.2005.159 .R Core Team. (2015). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing https://www.R-project.org/ .Romanski, P., & Kotthoff, L. (2014). FSelector: Selecting attributes. R package version 0.20. https://CRAN.R-project.org/package=FSelector .Shannon, C. E. (1949). The mathematical theory of communication. Urbana: University of Illinois Press.Shepperd, M., & MacDonell, S. (2012). Evaluating prediction systems in software project estimation. Information and Software Technology, 54(8), 820–827.Shepperd, M., & Schofield, C. (1997). Estimating software project effort using analogies. Software Engineering, IEEE Transactions on, 23(11), 736–743.Somol, P., Pudil, P., & Kittler, J. (2004). Fast branch & bound algorithms for optimal feature selection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(7), 900–912.Song, Q., & Shepperd, M. (2007). A new imputation method for small software project data sets. Journal of Systems and Software, 80(1), 51–62.Top, O. O., Ozkan, B., Nabi, M., Demirors, O. (2011). Internal and External Software Benchmark Repository Utilization for Effort Estimation. In Software Measurement, 2011 Joint Conference of the 21st Int’l Workshop on and 6th Int’l Conference on Software Process and Product Measurement (IWSM-MENSURA) (pp. 302–307). https://doi.org/10.1109/IWSM-MENSURA.2011.41 .Vinh, L.T., Thang, N.D., Lee, Y.-K. (2010). An improved maximum relevance and minimum redundancy feature selection algorithm based on normalized mutual information. In 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet (SAINT) (pp. 395–398). Presented at the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet (SAINT). https://doi.org/10.1109/SAINT.2010.50 .Witten, I.H., Frank, E., Hall, M.A., Pal, C.J. (2011). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann

    Automatic Bayesian Density Analysis

    Full text link
    Making sense of a dataset in an automatic and unsupervised fashion is a challenging problem in statistics and AI. Classical approaches for {exploratory data analysis} are usually not flexible enough to deal with the uncertainty inherent to real-world data: they are often restricted to fixed latent interaction models and homogeneous likelihoods; they are sensitive to missing, corrupt and anomalous data; moreover, their expressiveness generally comes at the price of intractable inference. As a result, supervision from statisticians is usually needed to find the right model for the data. However, since domain experts are not necessarily also experts in statistics, we propose Automatic Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible at large. Specifically, ABDA allows for automatic and efficient missing value estimation, statistical data type and likelihood discovery, anomaly detection and dependency structure mining, on top of providing accurate density estimation. Extensive empirical evidence shows that ABDA is a suitable tool for automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19
    • …
    corecore