13 research outputs found

    Pairwise gene GO-based measures for biclustering of high-dimensional expression data

    Get PDF
    Background: Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. Results: The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. Conclusions: It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.Ministerio de Economía y Competitividad TIN2014-55894-C2-

    Student risk identification learning model using machine learning approach

    Get PDF
    Several challenges are associated with online based learning systems, the most important of which is the lack of student motivation in various course materials and for various course activities. Further, it is important to identify student who are at risk of failing to complete the course on time. The existing models applied machine learning approach for solving it. However, these models are not efficient as they are trained using legacy data and also failed to address imbalanced data issues for both training and testing the classification approach. Further, they are not efficient for classifying new courses. For overcoming these research challenges, this work presented a novel design by training the learning model for identifying risk using current courses. Further, we present an XGBoost classification algorithm that can classify risk for new courses. Experiments are conducted to evaluate performance of proposed model. The outcome shows the proposed model attain significant performance over stat-of-art model in terms of ROC, F-measure, Precision and Recall

    A neural network noise prediction model for Tehran urban roads

    Get PDF
    Over the last decades, the number of motor vehicles has increased dramatically in Iran, where different traffic characteristics and urban structures are notable. In the present study, a multilayer perceptron neural network model trained with the Levenberg-Marquardt algorithm was used for predicting the equivalent sound level (LAeq) originating from traffic. Fifty-one samples were collected from different areas of Tehran. Input parameters consisted of total traffic volume per hour, average speed of vehicles, percentage of each category of vehicles, road gradient, density of buildings around the road section and a new parameter named “Building Reflection Factor”. These data were randomly used with 80, 10 and 10 percentiles respectively for training, validation and testing of the Artificial Neural Network (ANN). Results yielded by the ANN model were compared with field measurement data, a proposed regression model and some classical well-known models. Our study indicated that the prediction error of the neural network model was much less than that of the regression model and other classical models. Moreover, a statistical t-test was applied for evaluating the goodness-of-fit of the proposed model and proved that the neural network model is highly efficient in estimating road traffic noise levels

    Ouroboros: early identification of at-risk students without models based on legacy data

    Get PDF
    This paper focuses on the problem of identifying students, who are at risk of failing their course. The presented method proposes a solution in the absence of data from previous courses, which are usually used for training machine learning models. This situation typically occurs in new courses. We present the concept of a "self-learner" that builds the machine learning models from the data generated during the current course. The approach utilises information about already submitted assessments, which introduces the problem of imbalanced data for training and testing the classification models. There are three main contributions of this paper: (1) the concept of training the models for identifying at-risk students using data from the current course, (2) specifying the problem as a classification task, and (3) tackling the challenge of imbalanced data, which appears both in training and testing data. The results show the comparison with the traditional approach of learning the models from the legacy course data, validating the proposed concept

    A Survey of Fuzzy Systems Software: Taxonomy, Current Research Trends, and Prospects

    Get PDF
    Fuzzy systems have been used widely thanks to their ability to successfully solve a wide range of problems in different application fields. However, their replication and application require a high level of knowledge and experience. Furthermore, few researchers publish the software and/or source code associated with their proposals, which is a major obstacle to scientific progress in other disciplines and in industry. In recent years, most fuzzy system software has been developed in order to facilitate the use of fuzzy systems. Some software is commercially distributed, but most software is available as free and open-source software, reducing such obstacles and providing many advantages: quicker detection of errors, innovative applications, faster adoption of fuzzy systems, etc. In this paper, we present an overview of freely available and open-source fuzzy systems software in order to provide a well-established framework that helps researchers to find existing proposals easily and to develop well-founded future work. To accomplish this, we propose a two-level taxonomy, and we describe the main contributions related to each field. Moreover, we provide a snapshot of the status of the publications in this field according to the ISI Web of Knowledge. Finally, some considerations regarding recent trends and potential research directions are presentedThis work was supported in part by the Spanish Ministry of Economy and Competitiveness under Grants TIN2014-56633-C3-3-R and TIN2014-57251-P, the Andalusian Government under Grants P10-TIC-6858 and P11-TIC-7765, and the GENIL program of the CEI BioTIC GRANADA under Grant PYR-2014-2S

    The diversity-accuracy duality in ensembles of classifiersd

    Get PDF
    Horizontal scaling of Machine Learning algorithms has the potential to tackle concerns over the scalability and sustainability of Deep Learning methods, viz. their consumption of energy and computational resources, as well their increasing inaccessibility to researchers. One way to enact horizontal scaling is by employing ensemble learning methods, since they enable distribution. There is a consensus on the point that diversity between individual learners leads to better performance, which is why we have focused on it as the criterion for distributing the base models of an ensemble. However, there is no standard agreement on how diversity should be defined and thus how to exploit it to construct a high-performing classifier. Therefore, we have proposed different definitions of diversity and innovative algorithms which promote it in a systematic way. We have first considered architectural diversity with an algorithm called WILDA: Wide Learning of Diverse Architectures. In a distributed fashion, this algorithm evolves a set of neural networks that are pretrained on the target task and diverse w.r.t. architectural feature descriptors. We have then generalised this notion by defining behavioural diversity on the basis of the divergence between the errors made by different models on a dataset. We have defined several diversity metrics and used them to guide a novelty search algorithm which builds an ensemble of behaviourally diverse classifiers. The algorithm promotes diversity in ensembles by explicitly searching for it, without selecting for accuracy. We have then extended this approach with a surrogate diversity model, which reduces the computational burden of this search by eliminating the need to train each network in the population with stochastic gradient descent at each step. These methods have enabled us to investigate the role that both architectural and behavioural diversity play in contributing to the performance of an ensemble. In order to study the relationship between diversity and accuracy in classifier ensembles, we have then proposed several methods that extend the novelty search with accuracy objectives. Surprisingly, we have observed that, with the highest-performing diversity metrics, there is an equivalence between searching for diversity objectives and searching for accuracy objectives. This contradicts widespread assumptions that a trade-off must be found by balancing diversity and accuracy objectives. We therefore posit the existence of a diversity-accuracy duality in ensembles of classifiers. An implication of this is the possibility of evolving diverse ensembles without detriment to their accuracy, since it is implicitly ensured.Open Acces

    Hybrid intelligent approaches for business process sequential analysis.

    Get PDF
    The quality of customer services is an important differentiator for service oriented com- panies like telecommunication providers. In order to deliver good customer service, the underlying processes within the operations of a company have to run smoothly and must be well controlled. It is of great importance to be able to predict if processes are likely to fail and to be aware of developing problems as early as possible. A failure in a customer service process typically results in a negative experience for a customer and companies are keen to avoid this from happening. Process performance prediction allows companies to pro-actively adapt with process execution in order to prevent process problems from affect- ing their customers. Process analytics is often compounded by a number of factors. Very often processes are only poorly documented because they have evolved over time together with the legacy IT systems that were used to implement them. The workflow data that is collected during process execution is high dimensional and can contain complex attributes and very diverse values. Since workflow data is sequential in nature, there are a number of data mining methods such as sequential pattern mining and probabilistic models that can be useful for predicting process transitions or process outcomes. None of these techniques alone can adequately cope with workflow data. The purpose of this thesis is to contribute a combination of methods that can analyse data from business process in execution in order to predict severe process incidents. In order to best exploit the sequential nature of the data we have used a number of sequential data mining approaches coupled with sequence alignment and a strategy for dealing with similar sequences. The methods have been applied to real process data from a large telecommunication provider and we have conducted a number of experiments demonstrating how to predict process steps and process outcomes. Finally, we show that the performance of the proposed models can be significantly improved if they are applied to individual clusters of workflow data rather than the complete set of process data
    corecore