45 research outputs found

    Redundant feature selection using permutation methods

    Get PDF
    Automatic feature selection aims to select the features with highest performance when used in a classifier. One popular measure for estimating feature relevancy and redundancy is Mutual Information (MI), although it is biased toward features with multiple values. Permutation methods have been successfully applied in normalizing for numerous biases including that of MI; however they are computationally expensive and complete redundancy computation is infeasible. In this paper, we introduce a measure that can be used to approximate all m2 redundancies between m features, while performing only m permutation methods for their relevancies. We then show using simulated data that this permutation redundancy measure holds similar properties to normalized MI and apply it in selecting features from example datasets using minimal Redundancy Maximal Relevancy (mRMR)

    Комбинированная схема отбора признаков для разработки банковских моделей

    Get PDF
    Machine learning methods have been successful in various aspects of bank lending. Banks have accumulated huge amounts of data about borrowers over the years of application. On the one hand, this made it possible to predict borrower behavior more accurately, on the other, it gave rise to the problem a problem of data redundancy, which greatly complicates the model development. Methods of feature selection, which allows to improve the quality of models, are apply to solve this problem. Feature selection methods can be divided into three main types: filters, wrappers, and embedded methods. Filters are simple and time-efficient methods that may help discover one-dimensional relations. Wrappers and embedded methods are more effective in feature selection, because they account for multi-dimensional relationships, but these methods are resource-consuming and may fail to process large samples with many features. In this article, the authors propose a combined feature selection scheme (CFSS), in which the first stages of selection use coarse filters, and on the final — wrappers for high-quality selection. This architecture lets us increase the quality of selection and reduce the time necessary to process large multi-dimensional samples, which are used in the development of industrial models. Experiments conducted by authors for four types of bank modelling tasks (survey scoring, behavioral scoring, customer response to cross-selling, and delayed debt collection) have shown that the proposed method better than classical methods containing only filters or only wrappers.Методы машинного обучения успешно применяются в самых разных областях банковского кредитования. За годы их применения банки накопили огромные массивы данных о заемщиках. С одной стороны, это позволило более точно предсказывать поведение заемщика, с другой — породило проблему избыточности данных, которая сильно усложняет разработку моделей. Чтобы решить эту проблему применяют методы отбора признаков, позволяющие повысить качество моделей. Эти методы делятся на три типа: фильтры, обертки и вложения. Фильтры являются простыми и быстрыми методами, при использовании которых можно находить одномерные зависимости. Обертки и вложения позволяют более качественно проводить отбор признаков, поскольку учитывают многомерную зависимость, однако данные методы требуют значительных вычислительных ресурсов и могут плохо работать на больших высокоразмерных выборках. В данной статье авторы предлагают схему комбинированного отбора признаков CFSS, в которой на первых этапах отбора используются грубые фильтры, а на финальных — обертки для более качественного отбора. Такая архитектура повышает качество и скорость отбора признаков на больших высокоразмерных выборках для промышленного моделирования в банковских задачах. Проведенные авторами эксперименты для четырех типов банковских задач (анкетный скоринг, поведенческий скоринг, отклик клиентов на кросс-сейл предложение и взыскание просроченной задолженности) показали, что предложенный метод работает лучше, чем классические методы, содержащие только фильтры или только обертки

    A Two-Stage Real-time Prediction Method for Multiplayer Shooting E-Sports

    Get PDF
    E-sports is an industry with a huge base and the number of people who pay attention to it continues to rise. The research results of E-sports prediction play an important role in many aspects. In the past game prediction algorithms, there are mainly three kinds: neural network algorithm, AdaBoost algorithm based on Naïve Bayesian (NB) classifier and decision tree algorithm. These three algorithms have their own advantages and disadvantages, but they cannot predict the match ranking in real time. Therefore, we propose a real-time prediction algorithm based on random forest model. This method is divided into two stages. In the first stage, the weights are trained to obtain the optimal model for the second stage. In the second stage, each influencing factor in the data set is corresponded to and transformed with the data items in the competition log. The accuracy of the prediction results and its change trend with time are observed. Finally, the model is evaluated. The results show that the accuracy of real-time prediction reaches 92.29%, which makes up for the shortage of real-time in traditional prediction algorithm

    Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data

    Get PDF
    BACKGROUND: Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. RESULTS: We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. CONCLUSION: Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones

    Habitat requirements and conservation needs of peripheral populations : the case of the great crested newt (Triturus cristatus) in the Scottish Highlands

    Get PDF
    Edge populations are of conservation importance because of their roles as reservoirs of evolutionary potential and in understanding a given species’ ecological needs. Mainly due to loss of aquatic breeding sites, the great crested newt Triturus cristatus is amongst the fastest declining amphibian species in Europe. Focusing on the north-westerly limit of the T. cristatus range, in the Scottish Highlands, we aimed to characterise habitat requirements and conservation needs of an isolated set of edge populations. We recorded 129 breeding-pond related environmental parameters, and used a variable-selection procedure followed by random forest analysis to build a predictive model for the species’ present occurrence, as well as for population persistence incorporating data on population losses. The most important variables predicting T. cristatus occurrence and persistence were associated with pond quality, pond shore and surrounding terrestrial habitat (especially mixed Pinus sylvestris - Betula woodland), and differed from those identified in the species’ core range. We propose that habitat management and pond creation should focus on the locally most favourable habitat characteristics to improve the conservation status and resilience of populations. This collaborative work, between conservation agencies and scientific researchers, is presented as an illustrative example of linking research, management and conservation

    Mortality and recruitment of fire-tolerant eucalypts as influenced by wildfire severity and recent prescribed fire

    Get PDF
    Mixed-species eucalypt forests of temperate Australia are assumed tolerant of most fire regimes based on the impressive capacity of the dominant eucalypts to resprout. However, empirical data to test this assumption are rare, limiting capacity to predict forest tolerance to emerging fire regimes including more frequent severe wildfires and extensive use of prescribed fire. We quantified tree mortality and regeneration in mixed-species eucalypt forests five years after an extensive wildfire that burnt under extreme fire weather. To examine combined site-level effects of wildfire and prescribed fire, our study included factorial replications of three wildfire severities, assessed as crown scorch and understorey consumption shortly after the wildfire (Unburnt, Low, High), and two times since last preceding fire (30 years since any fire). Our data indicate that while most trees survived low-severity wildfire through epicormic resprouting, this capacity was tested by high-severity wildfire. Five years after the wildfire, percentage mortalities of eucalypts in all size intervals from 10 to >70 cm diameter were significantly greater at High severity than Unburnt or Low severity sites, and included the near loss of the 10–20 cm cohort (93% mortality). Prolific seedling regeneration at High severity sites, and unreliable basal resprouting, indicated the importance of seedling recruitment to the resilience of these firetolerant forests. Recent prescribed fire had no clear effect on forest resistance (as tree survival) to wildfire, but decreased site-level resilience (as recruitment) by increasing mortalities of small stems. Our study indicates that high-severity wildfire has the potential to cause transitions to more open, simplified stand structures through increased tree mortality, including disproportionate losses in some size cohorts. Dependence on seedling recruitment could increase vulnerabilities to subsequent fires and future climates, potentially requiring direct management interventions to bolster forest resilience.Australian Governmen

    The impact of agricultural management on soil aggregation and carbon storage is regulated by climatic thresholds across a 3000 km European gradient

    Get PDF
    Organic carbon and aggregate stability are key features of soil quality and are important to consider when evaluating the potential of agricultural soils as carbon sinks. However, we lack a comprehensive understanding of how soil organic carbon (SOC) and aggregate stability respond to agricultural management across wide environmental gradients. Here, we assessed the impact of climatic factors, soil properties and agricultural management (including land use, crop cover, crop diversity, organic fertilization, and management intensity) on SOC and the mean weight diameter of soil aggregates, commonly used as an indicator for soil aggregate stability, across a 3000 km European gradient. Soil aggregate stability (-56%) and SOC stocks (-35%) in the topsoil (20 cm) were lower in croplands compared with neighboring grassland sites (uncropped sites with perennial vegetation and little or no external inputs). Land use and aridity were strong drivers of soil aggregation explaining 33% and 20% of the variation, respectively. SOC stocks were best explained by calcium content (20% of explained variation) followed by aridity (15%) and mean annual temperature (10%). We also found a threshold-like pattern for SOC stocks and aggregate stability in response to aridity, with lower values at sites with higher aridity. The impact of crop management on aggregate stability and SOC stocks appeared to be regulated by these thresholds, with more pronounced positive effects of crop diversity and more severe negative effects of crop management intensity in nondryland compared with dryland regions. We link the higher sensitivity of SOC stocks and aggregate stability in nondryland regions to a higher climatic potential for aggregate-mediated SOC stabilization. The presented findings are relevant for improving predictions of management effects on soil structure and C storage and highlight the need for site-specific agri-environmental policies to improve soil quality and C sequestration
    corecore