1,787 research outputs found
Simplifying Deep-Learning-Based Model for Code Search
To accelerate software development, developers frequently search and reuse
existing code snippets from a large-scale codebase, e.g., GitHub. Over the
years, researchers proposed many information retrieval (IR) based models for
code search, which match keywords in query with code text. But they fail to
connect the semantic gap between query and code. To conquer this challenge, Gu
et al. proposed a deep-learning-based model named DeepCS. It jointly embeds
method code and natural language description into a shared vector space, where
methods related to a natural language query are retrieved according to their
vector similarities. However, DeepCS' working process is complicated and
time-consuming. To overcome this issue, we proposed a simplified model
CodeMatcher that leverages the IR technique but maintains many features in
DeepCS. Generally, CodeMatcher combines query keywords with the original order,
performs a fuzzy search on name and body strings of methods, and returned the
best-matched methods with the longer sequence of used keywords. We verified its
effectiveness on a large-scale codebase with about 41k repositories.
Experimental results showed the simplified model CodeMatcher outperforms DeepCS
by 97% in terms of MRR (a widely used accuracy measure for code search), and it
is over 66 times faster than DeepCS. Besides, comparing with the
state-of-the-art IR-based model CodeHow, CodeMatcher also improves the MRR by
73%. We also observed that: fusing the advantages of IR-based and
deep-learning-based models is promising because they compensate with each other
by nature; improving the quality of method naming helps code search, since
method name plays an important role in connecting query and code
Predicting Deviations in Software Quality by Using Relative Critical Value Deviation Metrics
Abstract We develop a new metric, Relative Critical Value Deviation (RCVD
How Secure Are Good Loans: Validating Loan-Granting Decisions And Predicting Default Rates On Consumer Loans
The failure or success of the banking industry depends largely on the industrys ability to properly evaluate credit risk. In the consumer-lending context, the banks goal is to maximize income by issuing as many good loans to consumers as possible while avoiding losses associated with bad loans. Mistakes could severely affect profits because the losses associated with one bad loan may undermine the income earned on many good loans. Therefore banks carefully evaluate the financial status of each customer as well as their credit worthiness and weigh them against the banks internal loan-granting policies. Recognizing that even a small improvement in credit scoring accuracy translates into significant future savings, the banking industry and the scientific community have been employing various machine learning and traditional statistical techniques to improve credit risk prediction accuracy.This paper examines historical data from consumer loans issued by a financial institution to individuals that the financial institution deemed to be qualified customers. The data consists of the financial attributes of each customer and includes a mixture of loans that the customers paid off and defaulted upon. The paper uses three different data mining techniques (decision trees, neural networks, logit regression) and the ensemble model, which combines the three techniques, to predict whether a particular customer defaulted or paid off his/her loan. The paper then compares the effectiveness of each technique and analyzes the risk of default inherent in each loan and group of loans. The data mining classification techniques and analysis can enable banks to more precisely classify consumers into various credit risk groups. Knowing what risk group a consumer falls into would allow a bank to fine tune its lending policies by recognizing high risk groups of consumers to whom loans should not be issued, and identifying safer loans that should be issued, on terms commensurate with the risk of default
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Information asset analysis: credit scoring and credit suggestion
Risk assessment is important for financial institutions, especially in loan applications. Some have already implemented their own credit-scoring mechanisms to evaluate their clients' risk and make decisions based on this indicator. In fact, the data gathered by financial institutions is a valuable source of information to create information assets, from which credit-scoring mechanisms can be developed. The purpose of this paper is to create, from information assets, a decision mechanism that is able to evaluate a client's risk. Furthermore, a suggestive algorithm is presented to better explain and give insights on how the decision mechanism values attributes
- …