170 research outputs found

    Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

    Full text link
    We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique. In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of Software Engineering (ICSE), 201

    Machine learning approaches for detecting tropical cyclone formation using satellite data

    Get PDF
    This study compared detection skill for tropical cyclone (TC) formation using models based on three different machine learning (ML) algorithms-decision trees (DT), random forest (RF), and support vector machines (SVM)-and a model based on Linear Discriminant Analysis (LDA). Eight predictors were derived from WindSat satellite measurements of ocean surface wind and precipitation over the western North Pacific for 2005-2009. All of the ML approaches performed better with significantly higher hit rates ranging from 94 to 96% compared with LDA performance (~77%), although false alarm rate by MLs is slightly higher (21-28%) than that by LDA (~13%). Besides, MLs could detect TC formation at the time as early as 26-30 h before the first time diagnosed as tropical depression by the JTWC best track, which was also 5 to 9 h earlier than that by LDA. The skill differences across MLs were relatively smaller than difference between MLs and LDA. Large yearly variation in forecast lead time was common in all models due to the limitation in sampling from orbiting satellite. This study highlights that ML approaches provide an improved skill for detecting TC formation compared with conventional linear approaches

    User relationship classification of facebook messenger mobile data using WEKA

    Full text link
    Β© Springer Nature Switzerland AG 2018. Mobile devices are a wealth of information about its user and their digital and physical activities (e.g. online browsing and physical location). Therefore, in any crime investigation artifacts obtained from a mobile device can be extremely crucial. However, the variety of mobile platforms, applications (apps) and the significant size of data compound existing challenges in forensic investigations. In this paper, we explore the potential of machine learning in mobile forensics, and specifically in the context of Facebook messenger artifact acquisition and analysis. Using Quick and Choo (2017)’s Digital Forensic Intelligence Analysis Cycle (DFIAC) as the guiding framework, we demonstrate how one can acquire Facebook messenger app artifacts from an Android device and an iOS device (the latter is, using existing forensic tools. Based on the acquired evidence, we create 199 data-instances to train WEKA classifiers (i.e. ZeroR, J48 and Random tree) with the aim of classifying the device owner’s contacts and determine their mutual relationship strength

    Comparison of machine learning algorithms for retrieval of water quality indicators in case-II waters: a case study of Hong Kong

    Get PDF
    Anthropogenic activities in coastal regions are endangering marine ecosystems. Coastal waters classified as case-II waters are especially complex due to the presence of different constituents. Recent advances in remote sensing technology have enabled to capture the spatiotemporal variability of the constituents in coastal waters. The present study evaluates the potential of remote sensing using machine learning techniques, for improving water quality estimation over the coastal waters of Hong Kong. Concentrations of suspended solids (SS), chlorophyll-a (Chl-a), and turbidity were estimated with several machine learning techniques including Artificial Neural Network (ANN), Random Forest (RF), Cubist regression (CB), and Support Vector Regression (SVR). Landsat (5,7,8) reflectance data were compared with in situ reflectance data to evaluate the performance of machine learning models. The highest accuracies of the water quality indicators were achieved by ANN for both, in situ reflectance data (89%-Chl-a, 93%-SS, and 82%-turbidity) and satellite data (91%-Chl-a, 92%-SS, and 85%-turbidity. The water quality parameters retrieved by the ANN model was further compared to those retrieved by β€œstandard Case-2 Regional/Coast Colour” (C2RCC) processing chain model C2RCC-Nets. The root mean square errors (RMSEs) for estimating SS and Chl-a were 3.3 mg/L and 2.7 Β΅g/L, respectively, using ANN, whereas RMSEs were 12.7 mg/L and 12.9 Β΅g/L for suspended particulate matter (SPM) and Chl-a concentrations, respectively, when C2RCC was applied on Landsat-8 data. Relative variable importance was also conducted to investigate the consistency between in situ reflectance data and satellite data, and results show that both datasets are similar. The red band (wavelength β‰ˆ 0.665 Β΅m) and the product of red and green band (wavelength β‰ˆ 0.560 Β΅m) were influential inputs in both reflectance data sets for estimating SS and turbidity, and the ratio between red and blue band (wavelength β‰ˆ 0.490 Β΅m) as well as the ratio between infrared (wavelength β‰ˆ 0.865 Β΅m) and blue band and green band proved to be more useful for the estimation of Chl-a concentration, due to their sensitivity to high turbidity in the coastal waters. The results indicate that the NN based machine learning approaches perform better and, thus, can be used for improved water quality monitoring with satellite data in optically complex coastal waters

    Non-linear Autoregressive Neural Networks to Forecast Short-Term Solar Radiation for Photovoltaic Energy Predictions

    Get PDF
    Nowadays, green energy is considered as a viable solution to hinder CO2 emissions and greenhouse effects. Indeed, it is expected that Renewable Energy Sources (RES) will cover 40% of the total energy request by 2040. This will move forward decentralized and cooperative power distribution systems also called smart grids. Among RES, solar energy will play a crucial role. However, reliable models and tools are needed to forecast and estimate with a good accuracy the renewable energy production in short-term time periods. These tools will unlock new services for smart grid management. In this paper, we propose an innovative methodology for implementing two different non-linear autoregressive neural networks to forecast Global Horizontal Solar Irradiance (GHI) in short-term time periods (i.e. from future 15 to 120min). Both neural networks have been implemented, trained and validated exploiting a dataset consisting of four years of solar radiation values collected by a real weather station. We also present the experimental results discussing and comparing the accuracy of both neural networks. Then, the resulting GHI forecast is given as input to a Photovoltaic simulator to predict energy production in short-term time periods. Finally, we present the results of this Photovoltaic energy estimation discussing also their accuracy

    Drivers and Socioeconomic Impacts of Tourism Participation in Protected Areas

    Get PDF
    Nature-based tourism has the potential to enhance global biodiversity conservation by providing alternative livelihood strategies for local people, which may alleviate poverty in and around protected areas. Despite the popularity of the concept of nature-based tourism as an integrated conservation and development tool, empirical research on its actual socioeconomic benefits, on the distributional pattern of these benefits, and on its direct driving factors is lacking, because relevant long-term data are rarely available. In a multi-year study in Wolong Nature Reserve, China, we followed a representative sample of 220 local households from 1999 to 2007 to investigate the diverse benefits that these households received from recent development of nature-based tourism in the area. Within eight years, the number of households directly participating in tourism activities increased from nine to sixty. In addition, about two-thirds of the other households received indirect financial benefits from tourism. We constructed an empirical household economic model to identify the factors that led to household-level participation in tourism. The results reveal the effects of local households' livelihood assets (i.e., financial, human, natural, physical, and social capitals) on the likelihood to participate directly in tourism. In general, households with greater financial (e.g., income), physical (e.g., access to key tourism sites), human (e.g., education), and social (e.g., kinship with local government officials) capitals and less natural capital (e.g., cropland) were more likely to participate in tourism activities. We found that residents in households participating in tourism tended to perceive more non-financial benefits in addition to more negative environmental impacts of tourism compared with households not participating in tourism. These findings suggest that socioeconomic impact analysis and change monitoring should be included in nature-based tourism management systems for long-term sustainability of protected areas

    Recurrent Signature Patterns in HIV-1 B Clade Envelope Glycoproteins Associated with either Early or Chronic Infections

    Get PDF
    Here we have identified HIV-1 B clade Envelope (Env) amino acid signatures from early in infection that may be favored at transmission, as well as patterns of recurrent mutation in chronic infection that may reflect common pathways of immune evasion. To accomplish this, we compared thousands of sequences derived by single genome amplification from several hundred individuals that were sampled either early in infection or were chronically infected. Samples were divided at the outset into hypothesis-forming and validation sets, and we used phylogenetically corrected statistical strategies to identify signatures, systematically scanning all of Env. Signatures included single amino acids, glycosylation motifs, and multi-site patterns based on functional or structural groupings of amino acids. We identified signatures near the CCR5 co-receptor-binding region, near the CD4 binding site, and in the signal peptide and cytoplasmic domain, which may influence Env expression and processing. Two signatures patterns associated with transmission were particularly interesting. The first was the most statistically robust signature, located in position 12 in the signal peptide. The second was the loss of an N-linked glycosylation site at positions 413–415; the presence of this site has been recently found to be associated with escape from potent and broad neutralizing antibodies, consistent with enabling a common pathway for immune escape during chronic infection. Its recurrent loss in early infection suggests it may impact fitness at the time of transmission or during early viral expansion. The signature patterns we identified implicate Env expression levels in selection at viral transmission or in early expansion, and suggest that immune evasion patterns that recur in many individuals during chronic infection when antibodies are present can be selected against when the infection is being established prior to the adaptive immune response
    • …
    corecore