40,370 research outputs found

    Studying patterns of use of transport modes through data mining - Application to U.S. national household travel survey data set

    Get PDF
    Data collection activities related to travel require large amounts of financial and human resources to be conducted successfully. When available resources are scarce, the information hidden in these data sets needs to be exploited, both to increase their added value and to gain support among decision makers not to discontinue such efforts. This study assessed the use of a data mining technique, association analysis, to understand better the patterns of mode use from the 2009 U.S. National Household Travel Survey. Only variables related to self-reported levels of use of the different transportation means are considered, along with those useful to the socioeconomic characterization of the respondents. Association rules potentially showed a substitution effect between cars and public transportation, in economic terms but such an effect was not observed between public transportation and nonmotorized modes (e.g., bicycling and walking). This effect was a policy-relevant finding, because transit marketing should be targeted to car drivers rather than to bikers or walkers for real improvement in the environmental performance of any transportation system. Given the competitive advantage of private modes extensively discussed in the literature, modal diversion from car to transit is seldom observed in practice. However, after such a factor was controlled, the results suggest that modal diversion should mainly occur from cars to transit rather than from nonmotorized modes to transi

    New probabilistic interest measures for association rules

    Full text link
    Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic

    Re-mining item associations: methodology and a case study in apparel retailing

    Get PDF
    Association mining is the conventional data mining technique for analyzing market basket data and it reveals the positive and negative associations between items. While being an integral part of transaction data, pricing and time information have not been integrated into market basket analysis in earlier studies. This paper proposes a new approach to mine price, time and domain related attributes through re-mining of association mining results. The underlying factors behind positive and negative relationships can be characterized and described through this second data mining stage. The applicability of the methodology is demonstrated through the analysis of data coming from a large apparel retail chain, and its algorithmic complexity is analyzed in comparison to the existing techniques

    Structured Review of Code Clone Literature

    Get PDF
    This report presents the results of a structured review of code clone literature. The aim of the review is to assemble a conceptual model of clone-related concepts which helps us to reason about clones. This conceptual model unifies clone concepts from a wide range of literature, so that findings about clones can be compared with each other

    Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix

    Full text link
    One of the most powerful techniques to study protein structures is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent spatial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the set of discovered frequent subgraphs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent subgraphs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative subgraphs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach is able to considerably decrease the number of motifs while enhancing their interestingness

    THE QUALITY OF VOLUNTARY SUSTAINABILITY REPORT ASSURANCE STATEMENTS: EVIDENCE FROM FORTUNE GLOBAL 500

    Get PDF
    Number of companies adopting sustainability report assurance is increasing rapidly. Prior researches have explored factors that might drive companies to voluntary adopt assurance on their sustainability reports. But, few researches focus on the quality of sustainability report assurance statements provided. The first objective of this research is to investigate how the quality of assurance statement differs among different assurance providers. The second objective of the research is to explore whether quality of assurance statement is jointly affected by national legal environment where company is located and the company’s choice of assurance provider. Population of this Research is Fortune Global 500 Companies 2014 list. Final sample of this research is 135 companies. Independent sample t-test is used to test how the quality of assurance statement differs among different assurance providers. Multivariate regression analysis is used to test whether quality of assurance statement is jointly affected by national legal environment and assurance provider. The analysis’ result indicates that national legal environment has a negative and significant effect on assurance statement quality. Assurance provider also has a negative and significant effect on quality of assurance statement, while industry has a negative and slightly significant effect on it

    Data Cube Approximation and Mining using Probabilistic Modeling

    Get PDF
    On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data. Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches
    • …
    corecore