68 research outputs found

    About Challenges in Data Analytics and Machine Learning for Social Good

    Get PDF
    The large number of new services and applications and, in general, all our everyday activities resolve in data mass production: all these data can become a golden source of information that might be used to improve our lives, wellness and working days. (Interpretable) Machine Learning approaches, the use of which is increasingly ubiquitous in various settings, are definitely one of the most effective tools for retrieving and obtaining essential information from data. However, many challenges arise in order to effectively exploit them. In this paper, we analyze key scenarios in which large amounts of data and machine learning techniques can be used for social good: social network analytics for enhancing cultural heritage dissemination; game analytics to foster Computational Thinking in education; medical analytics to improve the quality of life of the elderly and reduce health care expenses; exploration of work datafication potential in improving the management of human resources (HRM). For the first two of the previously mentioned scenarios, we present new results related to previously published research, framing these results in a more general discussion over challenges arising when adopting machine learning techniques for social good

    Optimal Decision Trees for Local Image Processing Algorithms

    Get PDF
    In this paper we present a novel algorithm to synthesize an optimal decision tree from OR-decision tables, an extension of standard decision tables, complete with the formal proof of optimality and computational cost analysis. As many problems which require to recognize particular patterns can be modeled with this formalism, we select two common binary image processing algorithms, namely connected components labeling and thinning, to show how these can be represented with decision tables, and the benets of their implementation as optimal decision trees in terms of reduced memory accesses. Experiments are reported, to show the computational time improvements over state of the art implementations

    Investigating Power and Limitations of Ensemble Motif Finders Using Metapredictor CE3

    Get PDF
    Ensemble methods represent a relatively new approach to motif discovery that combines the results returned by "third-party" finders with the aim of achieving a better accuracy than that obtained by the single tools. Besides the choice of the external finders, another crucial element for the success of an ensemble method is the particular strategy adopted to combine the finders' results, a.k.a. learning function. Results appeared in the literature seem to suggest that ensemble methods can provide noticeable improvements over the quality of the most popular tools available for motif discovery. With the goal of better understanding potentials and limitations of ensemble methods, we developed a general software architecture whose major feature is the flexibility with respect to the crucial aspects of ensemble methods mentioned above. The architecture provides facilities for the easy addition of virtually any third-party tool for motif discovery whose code is publicly available, and for the definition of new learning functions. We present a prototype implementation of our architecture, called CE3 (Customizable and Easily Extensible Ensemble). Using CE3, and available ensemble methods, we performed experiments with three well-known datasets. The results presented here are varied. On the one hand, they confirm that ensemble methods cannot be just considered as the universal remedy for "in-silico" motif discovery. On the other hand, we found some encouraging regularities that may help to find a general set up for CE3 (and other ensemble methods as well) able to guarantee substantial improvements over single finders in a systematic way

    CE3: Customizable and Easily Extensible Ensemble Tool for Motif Discovery

    Get PDF
    Ensemble methods (or simply ensembles) for motif discov- ery represent a relatively new approach to improve the ac- curacy of stand-alone motif finders. The performance of an ensemble is clearly determined by the included finders as well as the strategy to combine the results returned by the latter (the so called learning rule). A potential obstacle to a widespread adoption of ensembles is that the choice of the particular finders included is closed. Although possible in principle, the addition to an ensemble of a new "promising" tool requires knowledge of the internals of the ensemble and usually non trivial programming skills. In this research we propose a general architecture for ensem- bles and a prototype called CE3: Customizable and Easily Extensible Ensemble, which is meant to be extensible and customizable at the level of the two key components mod- ules namely external tools finding and learning rule. In this way the user will be able to essentially "simulate" any ex- isting ensemble, create his/her own ensemble according to his/her preferences on finding tools and learning functions and, finally, keep it up to date when new tools and new ideas for learning functions are proposed in literature. These fea- tures also make CE3 a suitable tool to perform experiments that may lead to a proper configuration of ensembles in the research of novel motifs

    Direct vs 2-stage approaches to structured motif finding

    Get PDF
    BACKGROUND: The notion of DNA motif is a mathematical abstraction used to model regions of the DNA (known as Transcription Factor Binding Sites, or TFBSs) that are bound by a given Transcription Factor to regulate gene expression or repression. In turn, DNA structured motifs are a mathematical counterpart that models sets of TFBSs that work in concert in the gene regulations processes of higher eukaryotic organisms. Typically, a structured motif is composed of an ordered set of isolated (or simple) motifs, separated by a variable, but somewhat constrained number of “irrelevant” base-pairs. Discovering structured motifs in a set of DNA sequences is a computationally hard problem that has been addressed by a number of authors using either a direct approach, or via the preliminary identification and successive combination of simple motifs. RESULTS: We describe a computational tool, named SISMA, for the de-novo discovery of structured motifs in a set of DNA sequences. SISMA is an exact, enumerative algorithm, meaning that it finds all the motifs conforming to the specifications. It does so in two stages: first it discovers all the possible component simple motifs, then combines them in a way that respects the given constraints. We developed SISMA mainly with the aim of understanding the potential benefits of such a 2-stage approach w.r.t. direct methods. In fact, no 2-stage software was available for the general problem of structured motif discovery, but only a few tools that solved restricted versions of the problem. We evaluated SISMA against other published tools on a comprehensive benchmark made of both synthetic and real biological datasets. In a significant number of cases, SISMA outperformed the competitors, exhibiting a good performance also in most of the cases in which it was inferior. CONCLUSIONS: A reflection on the results obtained lead us to conclude that a 2-stage approach can be implemented with many advantages over direct approaches. Some of these have to do with greater modularity, ease of parallelization, and the possibility to perform adaptive searches of structured motifs. As another consideration, we noted that most hard instances for SISMA were easy to detect in advance. In these cases one may initially opt for a direct method; or, as a viable alternative in most laboratories, one could run both direct and 2-stage tools in parallel, halting the computations when the first halts

    CMStalker: a combinatorial tool for composite motif discovery

    Get PDF
    Controlling the differential expression of many thousands different genes at any given time is a fundamental task of metazoan organisms and this complex orchestration is controlled by the so-called regulatory genome encoding complex regulatory networks: several Transcription Factors bind to precise DNA regions, so to perform in a cooperative manner a specific regulation task for nearby genes. The in silico prediction of these binding sites is still an open problem, notwithstanding continuous progress and activity in the last two decades. In this paper we describe a new efficient combinatorial approach to the problem of detecting sets of cooperating binding sites in promoter sequences, given in input a database of Transcription Factor Binding Sites encoded as Position Weight Matrices. We present CMStalker, a software tool for composite motif discovery which embodies a new approach that combines a constraint satisfaction formulation with a parameter relaxation technique to explore efficiently the space of possible solutions. Extensive experiments with twelve data sets and eleven state-of-the-art tools are reported, showing an average value of the correlation coefficient of 0.54 (against a value 0.41 of the closest competitor). This improvements in output quality due to CMStalker is statistically significant

    TRank: ranking Twitter users according to specific topics

    Get PDF
    Abstract-Twitter is the most popular real-time microblogging service and it is a platform where users provide and obtain information at rapid pace. In this scenario, one of the biggest challenge is to find a way to automatically identify the most influential users of a given topic. Currently, there are several approaches that try to address this challenge using different Twitter signals (e.g., number of followers, lists, metadata), but results are not clear and sometimes conflicting. In this paper, we propose TRank, a novel method designed to address the problem of identifying the most influential Twitter users on specific topics identified with hashtags. The novelty of our approach is that it combines different Twitter signals (that represent both the user and the user's tweets) to provide three different indicators that are intended to capture different aspects of being influent. The computation of these indicators is not based on the magnitude of the Twitter signals alone, but they are computed taking into consideration also human factors, as for example the fact that a user with many active followings might have a very noisy time lime and, thus, miss to read many tweets. The experimental assessment confirms that our approach provides results that are more reasonable than the one obtained by mechanisms based on the sole magnitude of data

    FPF-SB: a Scalable Algorithm for Microarray Gene Expression Data Clustering

    Get PDF
    Efficient and effective analysis of large datasets from microarray gene expression data is one of the keys to time-critical personalized medicine. The issue we address here is the scalability of the data processing software for clustering gene expression data into groups with homogeneous expression profile. In this paper we propose /FPF-SB/, a novel clustering algorithm based on a combination of the Furthest-Point-First (FPF) heuristic for solving the /k/-center problem and a stability-based method for determining the number of clusters /k/. Our algorithm improves the state of the art: it is scalable to large datasets without sacrificing output quality

    Efficient Strategies for Partitioning and Querying a Hirerchical Document Space

    Get PDF
    We consider a problem arising in the efficient management of a Hierachical Document Space, i,e.,partitioning the leaves of a tree among a set of servers in such a way that it is possible to take full advantage of the hierarchical system to efficiently answer user\u27s queries. After providing that the problem is NP-Hard, we devise efficient approximate solutions, and we make a number of experiments which show that allowing for very little space inefficiency can be instrumental to acheiving a significant improvement in the query efficiency
    corecore