Search CORE

170 research outputs found

Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

Author: Bennin Kwabena Ebo
Chiha I.
Ghotra Baljinder
Menzies Tim
Omran M.
Pedregosa Fabian
Refaeilzadeh Payam
Tan Ming
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/02/2018
Field of study

We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique. In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of Software Engineering (ICSE), 201

arXiv.org e-Print Archive

Crossref

Machine learning approaches for detecting tropical cyclone formation using satellite data

Author: Georanos
Helms
Jungho Im
Liaw
Minsang Kim
Myong-In Lee
Myung-Sook Park
Powers
Refaeilzadeh
Seonyoung Park
Zhang
Publication venue: 'MDPI AG'
Publication date: 01/05/2019
Field of study

This study compared detection skill for tropical cyclone (TC) formation using models based on three different machine learning (ML) algorithms-decision trees (DT), random forest (RF), and support vector machines (SVM)-and a model based on Linear Discriminant Analysis (LDA). Eight predictors were derived from WindSat satellite measurements of ocean surface wind and precipitation over the western North Pacific for 2005-2009. All of the ML approaches performed better with significantly higher hit rates ranging from 94 to 96% compared with LDA performance (~77%), although false alarm rate by MLs is slightly higher (21-28%) than that by LDA (~13%). Besides, MLs could detect TC formation at the time as early as 26-30 h before the first time diagnosed as tropical depression by the JTWC best track, which was also 5 to 9 h earlier than that by LDA. The skill differences across MLs were relatively smaller than difference between MLs and LDA. Large yearly variation in forecast lead time was common in all models due to the limitation in sampling from orbiting satellite. This study highlights that ML approaches provide an improved skill for detecting TC formation compared with conventional linear approaches

Crossref

Directory of Open Access Journals

ScholarWorks@UNIST

User relationship classification of facebook messenger mobile data using WEKA

Author: C Anglano
D Quick
Daniel Walnycky
F Azuaje
F Daryabar
FN Dezfouli
K Barmpatsalou
L Breiman
NDW Cahyani
P Refaeilzadeh
TR Patil
TY Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

© Springer Nature Switzerland AG 2018. Mobile devices are a wealth of information about its user and their digital and physical activities (e.g. online browsing and physical location). Therefore, in any crime investigation artifacts obtained from a mobile device can be extremely crucial. However, the variety of mobile platforms, applications (apps) and the significant size of data compound existing challenges in forensic investigations. In this paper, we explore the potential of machine learning in mobile forensics, and specifically in the context of Facebook messenger artifact acquisition and analysis. Using Quick and Choo (2017)’s Digital Forensic Intelligence Analysis Cycle (DFIAC) as the guiding framework, we demonstrate how one can acquire Facebook messenger app artifacts from an Android device and an iOS device (the latter is, using existing forensic tools. Based on the acquired evidence, we create 199 data-instances to train WEKA classifiers (i.e. ZeroR, J48 and Random tree) with the aim of classifying the device owner’s contacts and determine their mutual relationship strength

Crossref

OPUS - University of Technology Sydney

Comparison of machine learning algorithms for retrieval of water quality indicators in case-II waters: a case study of Hong Kong

Author: Danling Tang
Eibe
Gin
Hung Ho
Janet Nichol
Kuhn
Kwon Lee
Lilian Pun
Majid Nazeer
Man Wong
Panchal
Refaeilzadeh
Sawaid Abbas
Sidrah Hafeez
Small
Vermote
Publication venue: 'MDPI AG'
Publication date: 01/03/2019
Field of study

Anthropogenic activities in coastal regions are endangering marine ecosystems. Coastal waters classified as case-II waters are especially complex due to the presence of different constituents. Recent advances in remote sensing technology have enabled to capture the spatiotemporal variability of the constituents in coastal waters. The present study evaluates the potential of remote sensing using machine learning techniques, for improving water quality estimation over the coastal waters of Hong Kong. Concentrations of suspended solids (SS), chlorophyll-a (Chl-a), and turbidity were estimated with several machine learning techniques including Artificial Neural Network (ANN), Random Forest (RF), Cubist regression (CB), and Support Vector Regression (SVR). Landsat (5,7,8) reflectance data were compared with in situ reflectance data to evaluate the performance of machine learning models. The highest accuracies of the water quality indicators were achieved by ANN for both, in situ reflectance data (89%-Chl-a, 93%-SS, and 82%-turbidity) and satellite data (91%-Chl-a, 92%-SS, and 85%-turbidity. The water quality parameters retrieved by the ANN model was further compared to those retrieved by “standard Case-2 Regional/Coast Colour” (C2RCC) processing chain model C2RCC-Nets. The root mean square errors (RMSEs) for estimating SS and Chl-a were 3.3 mg/L and 2.7 µg/L, respectively, using ANN, whereas RMSEs were 12.7 mg/L and 12.9 µg/L for suspended particulate matter (SPM) and Chl-a concentrations, respectively, when C2RCC was applied on Landsat-8 data. Relative variable importance was also conducted to investigate the consistency between in situ reflectance data and satellite data, and results show that both datasets are similar. The red band (wavelength ≈ 0.665 µm) and the product of red and green band (wavelength ≈ 0.560 µm) were influential inputs in both reflectance data sets for estimating SS and turbidity, and the ratio between red and blue band (wavelength ≈ 0.490 µm) as well as the ratio between infrared (wavelength ≈ 0.865 µm) and blue band and green band proved to be more useful for the estimation of Chl-a concentration, due to their sensitivity to high turbidity in the coastal waters. The results indicate that the NN based machine learning approaches perform better and, thus, can be used for improved water quality monitoring with satellite data in optically complex coastal waters

Crossref

Directory of Open Access Journals

PolyU Institutional Repository

Sussex Research Online

Projected increase in obesity and non‐alcoholic‐steatohepatitis–related liver transplantation waitlist additions in the United States

Author: Agresti
Ajmera
Belli
Centers for Disease Control and Prevention
Doycheva
Dulai
Faraway
Flegal
Goldberg
Grundy
Hassan
Hassan
Jöreskog
Kuczmarski
Lazaridis
Malik
Marchesini
Musso
Ng
Noureddin
Orman
Parikh
Parikh
Patel
Promrat
Refaeilzadeh
Schlansky
Singal
Skinner
Starley
Stekhoven
Tetri
Thuluvath
United States Census Bureau
VanWagner
Waljee
Wong
Young
Publication venue: 'Wiley'
Publication date: 01/08/2019
Field of study

Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/151250/1/hep29473-sup-0001-suppinfo.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/151250/2/hep29473.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/151250/3/hep29473_am.pd

Crossref

Deep Blue Documents at the University of Michigan

Non-linear Autoregressive Neural Networks to Forecast Short-Term Solar Radiation for Photovoltaic Energy Predictions

Author: A Madanchi
A Qazi
A Tealab
AG Expósito
AK Yadav
AS Weigend
C Voyant
C Voyant
CA Gueymard
CJ Willmott
DC Montgomery
DP Mandic
DR Legates
E Dickinson
G Chandrashekar
GE Box
H Rahimi-Eichi
H Xing
HT Siegelmann
IH Witten
J Aghaei
JD Hamilton
JS Vardakas
L Bottaccioli
L Bottaccioli
L Bottaccioli
LK Hansen
M Hosenuzzaman
M Kubat
M Norgaard
N Srivastava
P Refaeilzadeh
P Siano
PJ Brockwell
R Rajamani
S Haykin
S Makridakis
S Rajakaruna
S Weckx
V Badescu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Nowadays, green energy is considered as a viable solution to hinder CO2 emissions and greenhouse effects. Indeed, it is expected that Renewable Energy Sources (RES) will cover 40% of the total energy request by 2040. This will move forward decentralized and cooperative power distribution systems also called smart grids. Among RES, solar energy will play a crucial role. However, reliable models and tools are needed to forecast and estimate with a good accuracy the renewable energy production in short-term time periods. These tools will unlock new services for smart grid management. In this paper, we propose an innovative methodology for implementing two different non-linear autoregressive neural networks to forecast Global Horizontal Solar Irradiance (GHI) in short-term time periods (i.e. from future 15 to 120min). Both neural networks have been implemented, trained and validated exploiting a dataset consisting of four years of solar radiation values collected by a real weather station. We also present the experimental results discussing and comparing the accuracy of both neural networks. Then, the resulting GHI forecast is given as input to a Photovoltaic simulator to predict energy production in short-term time periods. Finally, we present the results of this Photovoltaic energy estimation discussing also their accuracy

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Drivers and Socioeconomic Impacts of Tourism Participation in Protected Areas

Author: A Balmford
A Kiss
A Spiteri
A Viña
AP Kinzig
BR Su
C Li
C Li
C Wallace
CH Sekercioglu
Christine A. Vogt
CJ Stem
D Newsome
DB Weaver
DJ Timothy
ER Grumbine
F Berkes
F Ellis
F Ellis
GB Schaller
GB Schaller
GM He
Guangming He
I Scoones
I Singh
J Coria
J Liu
J Liu
J Liu
J Wheelock
JG Liu
JG Liu
JG Liu
JG Liu
JG Liu
Jianguo Liu
Junyan Luo
K Brown
K Brown
K Hirano
KB Ghimire
KB Ghimire
Kenneth A. Frank
L An
L An
L Zhong
MJ Walpole
MP Bookbinder
N Myers
N Salafsky
O Kruger
P Refaeilzadeh
PR Rosenbaum
PR Rosenbaum
R Buckley
R Hughes
R Scheyvens
RC Buckley
RW Butler
RW Butler
S Haggblade
S Wunder
S Wunder
Sebastian C. A. Ferse
T Reardon
TE Lovejoy
TO McShane
TO McShane
TO McShane
W Li
W Li
Wei Liu
Y Xie
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Nature-based tourism has the potential to enhance global biodiversity conservation by providing alternative livelihood strategies for local people, which may alleviate poverty in and around protected areas. Despite the popularity of the concept of nature-based tourism as an integrated conservation and development tool, empirical research on its actual socioeconomic benefits, on the distributional pattern of these benefits, and on its direct driving factors is lacking, because relevant long-term data are rarely available. In a multi-year study in Wolong Nature Reserve, China, we followed a representative sample of 220 local households from 1999 to 2007 to investigate the diverse benefits that these households received from recent development of nature-based tourism in the area. Within eight years, the number of households directly participating in tourism activities increased from nine to sixty. In addition, about two-thirds of the other households received indirect financial benefits from tourism. We constructed an empirical household economic model to identify the factors that led to household-level participation in tourism. The results reveal the effects of local households' livelihood assets (i.e., financial, human, natural, physical, and social capitals) on the likelihood to participate directly in tourism. In general, households with greater financial (e.g., income), physical (e.g., access to key tourism sites), human (e.g., education), and social (e.g., kinship with local government officials) capitals and less natural capital (e.g., cropland) were more likely to participate in tourism activities. We found that residents in households participating in tourism tended to perceive more non-financial benefits in addition to more negative environmental impacts of tourism compared with households not participating in tourism. These findings suggest that socioeconomic impact analysis and change monitoring should be included in nature-based tourism management systems for long-term sustainability of protected areas

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Recurrent Signature Patterns in HIV-1 B Clade Envelope Glycoproteins Associated with either Early or Chronic Infections

Author: A Bultmann
A Land
A Land
A Ly
A Rehm
A Trkola
AJ McMichael
AK Dhillon
Alan S. Lapedes
Allan C. DeCamp
B Chohan
B Efron
B Efron
B Gaschen
B Korber
B Korber
Barton F. Haynes
Beatrice H. Hahn
Bette Korber
BF Haynes
BF Keele
BF Keele
Brandon F. Keele
Brian Gaschen
C Rizzuto
CA Derdeyn
CD Rizzuto
Charles B. Hicks
Chunlai Jiang
CN Scanlan
Craig A. Magaret
D Boyd
DH Barouch
E Hunter
EG Cormier
EL Delwart
EL Turnbull
ES Gray
EW Fiebig
Feng Gao
FK Treurnicht
G Blot
G Pancino
G von Heijne
George M. Shaw
Georgia D. Tomaras
GH Learn
H Ellerbrok
H Li
H Xhu
Hui Li
HY Lee
J Auwerx
J Felsenstein
J Irungu
J Jiang
J Liu
J Sterjovski
JD Storey
Jeffrey A. Anderson
Jesus F. Salazar-Gonzalez
JF Salazar-Gonzalez
JF Salazar-Gonzalez
JL Kirchherr
JL Mellquist
JM Binley
JN Reitter
John A. T. Young
Joseph G. Sodroski
Joseph J. Eron
Julie M. Decker
K Katoh
K Ritola
Kelly A. Soderberg
KJ Doores
L Chen
L Kong
L Margolis
L Wu
Li-Hua Ping
LQ Zhang
M Braibant
M Coetzer
M Kearney
M Li
M Sagar
M Stone
Marcus Daniels
Martin Markowitz
Michael S. Saag
Ming Zhang
Mohammed Asmal
Mohan Krishnamoorthy
MR Abrahams
Myron S. Cohen
N Goonetilleke
N Wood
Norman L. Letvin
P Borrow
P Borrow
P Refaeilzadeh
P Yang
Paul A. Goepfert
PB Gilbert
PD Kwong
Peter B. Gilbert
Peter T. Hraber
PL Moore
PL Moore
R Kohavi
R Rong
R Rong
R Shankarappa
R Wyatt
RA McCaffrey
RD Astronomo
RE Haaland
Ronald Swanstrom
RR Bouckaert
RW Sanders
RW Sanders
S Gnanakaran
S Gnanakaran
S Rerks-Ngarm
S Salzberg
S Sato
S. Gnanakaran
SC Piller
SD Frost
Shuyi Wang
SK Wang
SM Wolinsky
SR Eddy
T Bhattacharya
T Golubchik
T Hirbod
T Murakami
T Zhou
T Zhou
T Zhu
Tanmoy Bhattacharya
TB Geijtenbeek
TF Wolfs
TG Edwards
Tongye Shen
TP Hopp
V Kalia
W Fischer
William A. Blattner
William R. Schief
X Wei
X Wu
Y Bengio
Y Furuta
Y Kliger
Y Li
Y Li
Y Li
Y Li
Y Liu
Yih-En Andrew Ban
ZL Brumme
ZL Brumme
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Here we have identified HIV-1 B clade Envelope (Env) amino acid signatures from early in infection that may be favored at transmission, as well as patterns of recurrent mutation in chronic infection that may reflect common pathways of immune evasion. To accomplish this, we compared thousands of sequences derived by single genome amplification from several hundred individuals that were sampled either early in infection or were chronically infected. Samples were divided at the outset into hypothesis-forming and validation sets, and we used phylogenetically corrected statistical strategies to identify signatures, systematically scanning all of Env. Signatures included single amino acids, glycosylation motifs, and multi-site patterns based on functional or structural groupings of amino acids. We identified signatures near the CCR5 co-receptor-binding region, near the CD4 binding site, and in the signal peptide and cytoplasmic domain, which may influence Env expression and processing. Two signatures patterns associated with transmission were particularly interesting. The first was the most statistically robust signature, located in position 12 in the signal peptide. The second was the loss of an N-linked glycosylation site at positions 413–415; the presence of this site has been recently found to be associated with escape from potent and broad neutralizing antibodies, consistent with enabling a common pathway for immune escape during chronic infection. Its recurrent loss in early infection suggests it may impact fitness at the time of transmission or during early viral expansion. The signature patterns we identified implicate Env expression levels in selection at viral transmission or in early expansion, and suggest that immune evasion patterns that recur in many individuals during chronic infection when antibodies are present can be selected against when the infection is being established prior to the adaptive immune response

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Carolina Digital Repository