84,073 research outputs found
Authorship Attribution Using a Neural Network Language Model
In practice, training language models for individual authors is often
expensive because of limited data resources. In such cases, Neural Network
Language Models (NNLMs), generally outperform the traditional non-parametric
N-gram models. Here we investigate the performance of a feed-forward NNLM on an
authorship attribution problem, with moderate author set size and relatively
limited data. We also consider how the text topics impact performance. Compared
with a well-constructed N-gram baseline method with Kneser-Ney smoothing, the
proposed method achieves nearly 2:5% reduction in perplexity and increases
author classification accuracy by 3:43% on average, given as few as 5 test
sentences. The performance is very competitive with the state of the art in
terms of accuracy and demand on test data. The source code, preprocessed
datasets, a detailed description of the methodology and results are available
at https://github.com/zge/authorship-attribution.Comment: Proceedings of the 30th AAAI Conference on Artificial Intelligence
(AAAI'16
Evolution of statistical analysis in empirical software engineering research: Current state and steps forward
Software engineering research is evolving and papers are increasingly based
on empirical data from a multitude of sources, using statistical tests to
determine if and to what degree empirical evidence supports their hypotheses.
To investigate the practices and trends of statistical analysis in empirical
software engineering (ESE), this paper presents a review of a large pool of
papers from top-ranked software engineering journals. First, we manually
reviewed 161 papers and in the second phase of our method, we conducted a more
extensive semi-automatic classification of papers spanning the years 2001--2015
and 5,196 papers. Results from both review steps was used to: i) identify and
analyze the predominant practices in ESE (e.g., using t-test or ANOVA), as well
as relevant trends in usage of specific statistical methods (e.g.,
nonparametric tests and effect size measures) and, ii) develop a conceptual
model for a statistical analysis workflow with suggestions on how to apply
different statistical methods as well as guidelines to avoid pitfalls. Lastly,
we confirm existing claims that current ESE practices lack a standard to report
practical significance of results. We illustrate how practical significance can
be discussed in terms of both the statistical analysis and in the
practitioner's context.Comment: journal submission, 34 pages, 8 figure
Towards the optimal Pixel size of dem for automatic mapping of landslide areas
Determining appropriate spatial resolution of digital elevation model (DEM) is a key step for effective landslide analysis based on remote sensing data. Several studies demonstrated that choosing the finest DEM resolution is not always the best solution. Various DEM resolutions can be applicable for diverse landslide applications. Thus, this study aims to assess the influence of special resolution on automatic landslide mapping. Pixel-based approach using parametric and non-parametric classification methods, namely feed forward neural network (FFNN) and maximum likelihood classification (ML), were applied in this study. Additionally, this allowed to determine the impact of used classification method for selection of DEM resolution. Landslide affected areas were mapped based on four DEMs generated at 1m, 2m, 5m and 10m spatial resolution from airborne laser scanning (ALS) data. The performance of the landslide mapping was then evaluated by applying landslide inventory map and computation of confusion matrix. The results of this study suggests that the finest scale of DEM is not always the best fit, however working at 1m DEM resolution on micro-topography scale, can show different results. The best performance was found at 5m DEM-resolution for FFNN and 1m DEM resolution for results. The best performance was found to be using 5m DEM-resolution for FFNN and 1m DEM resolution for ML classification
Model the System from Adversary Viewpoint: Threats Identification and Modeling
Security attacks are hard to understand, often expressed with unfriendly and
limited details, making it difficult for security experts and for security
analysts to create intelligible security specifications. For instance, to
explain Why (attack objective), What (i.e., system assets, goals, etc.), and
How (attack method), adversary achieved his attack goals. We introduce in this
paper a security attack meta-model for our SysML-Sec framework, developed to
improve the threat identification and modeling through the explicit
representation of security concerns with knowledge representation techniques.
Our proposed meta-model enables the specification of these concerns through
ontological concepts which define the semantics of the security artifacts and
introduced using SysML-Sec diagrams. This meta-model also enables representing
the relationships that tie several such concepts together. This representation
is then used for reasoning about the knowledge introduced by system designers
as well as security experts through the graphical environment of the SysML-Sec
framework.Comment: In Proceedings AIDP 2014, arXiv:1410.322
Ensemble Committees for Stock Return Classification and Prediction
This paper considers a portfolio trading strategy formulated by algorithms in
the field of machine learning. The profitability of the strategy is measured by
the algorithm's capability to consistently and accurately identify stock
indices with positive or negative returns, and to generate a preferred
portfolio allocation on the basis of a learned model. Stocks are characterized
by time series data sets consisting of technical variables that reflect market
conditions in a previous time interval, which are utilized produce binary
classification decisions in subsequent intervals. The learned model is
constructed as a committee of random forest classifiers, a non-linear support
vector machine classifier, a relevance vector machine classifier, and a
constituent ensemble of k-nearest neighbors classifiers. The Global Industry
Classification Standard (GICS) is used to explore the ensemble model's efficacy
within the context of various fields of investment including Energy, Materials,
Financials, and Information Technology. Data from 2006 to 2012, inclusive, are
considered, which are chosen for providing a range of market circumstances for
evaluating the model. The model is observed to achieve an accuracy of
approximately 70% when predicting stock price returns three months in advance.Comment: 15 pages, 4 figures, Neukom Institute Computational Undergraduate
Research prize - second plac
- …