314,860 research outputs found

    An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

    Full text link
    As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. We present extensive experimental results to substantiate the effectiveness of our methodology.Comment: A preliminary version of this work was presented in ACM PODS 2009. 20 pages, 0 figure

    In search of grammaticalization in synchronic dialect data: General extenders in north-east England

    Get PDF
    In this paper, we draw on a socially stratified corpus of dialect data collected in north-east England to test recent proposals that grammaticalization processes are implicated in the synchronic variability of general extenders (GEs), i.e., phrase- or clause-final constructions such as and that and or something. Combining theoretical insights from the framework of grammaticalization with the empirical methods of variationist sociolinguistics, we operationalize key diagnostics of grammaticalization (syntagmatic length, decategorialization, semantic-pragmatic change) as independent factor groups in the quantitative analysis of GE variability. While multivariate analyses reveal rapid changes in apparent time to the social conditioning of some GE variants in our data, they do not reveal any evidence of systematic changes in the linguistic conditioning of variants in apparent time that would confirm an interpretation of ongoing grammaticalization. These results lead us to questio

    Deconstructing comprehensibility: identifying the linguistic influences on listeners' L2 comprehensibility ratings

    Get PDF
    Comprehensibility, a major concept in second language (L2) pronunciation research that denotes listeners’ perceptions of how easily they understand L2 speech, is central to interlocutors’ communicative success in real-world contexts. Although comprehensibility has been modeled in several L2 oral proficiency scales—for example, the Test of English as a Foreign Language (TOEFL) or the International English Language Testing System (IELTS)—shortcomings of existing scales (e.g., vague descriptors) reflect limited empirical evidence as to which linguistic aspects influence listeners’ judgments of L2 comprehensibility at different ability levels. To address this gap, a mixed-methods approach was used in the present study to gain a deeper understanding of the linguistic aspects underlying listeners’ L2 comprehensibility ratings. First, speech samples of 40 native French learners of English were analyzed using 19 quantitative speech measures, including segmental, suprasegmental, fluency, lexical, grammatical, and discourse-level variables. These measures were then correlated with 60 native English listeners’ scalar judgments of the speakers’ comprehensibility. Next, three English as a second language (ESL) teachers provided introspective reports on the linguistic aspects of speech that they attended to when judging L2 comprehensibility. Following data triangulation, five speech measures were identified that clearly distinguished between L2 learners at different comprehensibility levels. Lexical richness and fluency measures differentiated between low-level learners; grammatical and discourse-level measures differentiated between high-level learners; and word stress errors discriminated between learners of all levels

    From buildings to cities: techniques for the multi-scale analysis of urban form and function

    Get PDF
    The built environment is a significant factor in many urban processes, yet direct measures of built form are seldom used in geographical studies. Representation and analysis of urban form and function could provide new insights and improve the evidence base for research. So far progress has been slow due to limited data availability, computational demands, and a lack of methods to integrate built environment data with aggregate geographical analysis. Spatial data and computational improvements are overcoming some of these problems, but there remains a need for techniques to process and aggregate urban form data. Here we develop a Built Environment Model of urban function and dwelling type classifications for Greater London, based on detailed topographic and address-based data (sourced from Ordnance Survey MasterMap). The multi-scale approach allows the Built Environment Model to be viewed at fine-scales for local planning contexts, and at city-wide scales for aggregate geographical analysis, allowing an improved understanding of urban processes. This flexibility is illustrated in the two examples, that of urban function and residential type analysis, where both local-scale urban clustering and city-wide trends in density and agglomeration are shown. While we demonstrate the multi-scale Built Environment Model to be a viable approach, a number of accuracy issues are identified, including the limitations of 2D data, inaccuracies in commercial function data and problems with temporal attribution. These limitations currently restrict the more advanced applications of the Built Environment Model

    Semantic Stability in Social Tagging Streams

    Full text link
    One potential disadvantage of social tagging systems is that due to the lack of a centralized vocabulary, a crowd of users may never manage to reach a consensus on the description of resources (e.g., books, users or songs) on the Web. Yet, previous research has provided interesting evidence that the tag distributions of resources may become semantically stable over time as more and more users tag them. At the same time, previous work has raised an array of new questions such as: (i) How can we assess the semantic stability of social tagging systems in a robust and methodical way? (ii) Does semantic stabilization of tags vary across different social tagging systems and ultimately, (iii) what are the factors that can explain semantic stabilization in such systems? In this work we tackle these questions by (i) presenting a novel and robust method which overcomes a number of limitations in existing methods, (ii) empirically investigating semantic stabilization processes in a wide range of social tagging systems with distinct domains and properties and (iii) detecting potential causes for semantic stabilization, specifically imitation behavior, shared background knowledge and intrinsic properties of natural language. Our results show that tagging streams which are generated by a combination of imitation dynamics and shared background knowledge exhibit faster and higher semantic stability than tagging streams which are generated via imitation dynamics or natural language streams alone

    Applied Research Automatic Self-Talk Questionnaire for Sports (ASTQS): Development and Preliminary Validation of a Measure Identifying the Structure of Athletes’ Self-Talk

    Get PDF
    The aim of the present investigation was to develop an instrument assessing the con­tent and the structure of athletes’ self-talk. The study was conducted in three stages. In the first stage, a large pool of items was generated and content analysis was used to organize the items into categories. Furthermore, item-content relevance analysis was conducted to help identifying the most appropriate items. In Stage 2, the factor struc­ture of the instrument was examined by a series of exploratory factor analyses (Sample A: N = 507), whereas in Stage 3 the results of the exploratory factor analysis were retested through confirmatory factor analyses (Sample B: N = 766) and at the same time concurrent validity were assessed. The analyses revealed eight factors, four pos­itive (psych up, confidence, anxiety control and instruction), three negative (worry, disengagement and somatic fatigue) and one neutral (irrelevant thoughts). The find­ings of the study provide evidence regarding the multidimensionality of self-talk, suggesting that ASTQS seems a psychometrically sound instrument that could help us developing cognitive-behavioral theories and interventions to examine and modify athletes’ self-talk
    corecore