314,860 research outputs found
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
As advances in technology allow for the collection, storage, and analysis of
vast amounts of data, the task of screening and assessing the significance of
discovered patterns is becoming a major challenge in data mining applications.
In this work, we address significance in the context of frequent itemset
mining. Specifically, we develop a novel methodology to identify a meaningful
support threshold s* for a dataset, such that the number of itemsets with
support at least s* represents a substantial deviation from what would be
expected in a random dataset with the same number of transactions and the same
individual item frequencies. These itemsets can then be flagged as
statistically significant with a small false discovery rate. We present
extensive experimental results to substantiate the effectiveness of our
methodology.Comment: A preliminary version of this work was presented in ACM PODS 2009. 20
pages, 0 figure
In search of grammaticalization in synchronic dialect data: General extenders in north-east England
In this paper, we draw on a socially stratified corpus of dialect data collected in north-east England to test recent proposals that grammaticalization processes are implicated in the synchronic variability of general extenders (GEs), i.e., phrase- or clause-final constructions such as and that and or something. Combining theoretical insights from the framework of grammaticalization with the empirical methods of variationist sociolinguistics, we operationalize key diagnostics of grammaticalization (syntagmatic length, decategorialization, semantic-pragmatic change) as independent factor groups in the quantitative analysis of GE variability. While multivariate analyses reveal rapid changes in apparent time to the social conditioning of some GE variants in our data, they do not reveal any evidence of systematic changes in the linguistic conditioning of variants in apparent time that would confirm an interpretation of ongoing grammaticalization. These results lead us to questio
Deconstructing comprehensibility: identifying the linguistic influences on listeners' L2 comprehensibility ratings
Comprehensibility, a major concept in second language (L2) pronunciation research that denotes listeners’ perceptions of how easily they understand L2 speech, is central to interlocutors’ communicative success in real-world contexts. Although comprehensibility has been modeled in several L2 oral proficiency scales—for example, the Test of English as a Foreign Language (TOEFL) or the International English Language Testing System (IELTS)—shortcomings of existing scales (e.g., vague descriptors) reflect limited empirical evidence as to which linguistic aspects influence listeners’ judgments of L2 comprehensibility at different ability levels. To address this gap, a mixed-methods approach was used in the present study to gain a deeper understanding of the linguistic aspects underlying listeners’ L2 comprehensibility ratings. First, speech samples of 40 native French learners of English were analyzed using 19 quantitative speech measures, including segmental, suprasegmental, fluency, lexical, grammatical, and discourse-level variables. These measures were then correlated with 60 native English listeners’ scalar judgments of the speakers’ comprehensibility. Next, three English as a second language (ESL) teachers provided introspective reports on the linguistic aspects of speech that they attended to when judging L2 comprehensibility. Following data triangulation, five speech measures were identified that clearly distinguished between L2 learners at different comprehensibility levels. Lexical richness and fluency measures differentiated between low-level learners; grammatical and discourse-level measures differentiated between high-level learners; and word stress errors discriminated between learners of all levels
From buildings to cities: techniques for the multi-scale analysis of urban form and function
The built environment is a significant factor in many urban processes, yet direct measures of built form are
seldom used in geographical studies. Representation and analysis of urban form and function could provide
new insights and improve the evidence base for research. So far progress has been slow due to limited data
availability, computational demands, and a lack of methods to integrate built environment data with
aggregate geographical analysis. Spatial data and computational improvements are overcoming some of
these problems, but there remains a need for techniques to process and aggregate urban form data. Here we
develop a Built Environment Model of urban function and dwelling type classifications for Greater
London, based on detailed topographic and address-based data (sourced from Ordnance Survey
MasterMap). The multi-scale approach allows the Built Environment Model to be viewed at fine-scales for
local planning contexts, and at city-wide scales for aggregate geographical analysis, allowing an improved
understanding of urban processes. This flexibility is illustrated in the two examples, that of urban function
and residential type analysis, where both local-scale urban clustering and city-wide trends in density and
agglomeration are shown. While we demonstrate the multi-scale Built Environment Model to be a viable
approach, a number of accuracy issues are identified, including the limitations of 2D data, inaccuracies in
commercial function data and problems with temporal attribution. These limitations currently restrict the
more advanced applications of the Built Environment Model
Semantic Stability in Social Tagging Streams
One potential disadvantage of social tagging systems is that due to the lack
of a centralized vocabulary, a crowd of users may never manage to reach a
consensus on the description of resources (e.g., books, users or songs) on the
Web. Yet, previous research has provided interesting evidence that the tag
distributions of resources may become semantically stable over time as more and
more users tag them. At the same time, previous work has raised an array of new
questions such as: (i) How can we assess the semantic stability of social
tagging systems in a robust and methodical way? (ii) Does semantic
stabilization of tags vary across different social tagging systems and
ultimately, (iii) what are the factors that can explain semantic stabilization
in such systems? In this work we tackle these questions by (i) presenting a
novel and robust method which overcomes a number of limitations in existing
methods, (ii) empirically investigating semantic stabilization processes in a
wide range of social tagging systems with distinct domains and properties and
(iii) detecting potential causes for semantic stabilization, specifically
imitation behavior, shared background knowledge and intrinsic properties of
natural language. Our results show that tagging streams which are generated by
a combination of imitation dynamics and shared background knowledge exhibit
faster and higher semantic stability than tagging streams which are generated
via imitation dynamics or natural language streams alone
Applied Research Automatic Self-Talk Questionnaire for Sports (ASTQS): Development and Preliminary Validation of a Measure Identifying the Structure of Athletes’ Self-Talk
The aim of the present investigation was to develop an instrument assessing the content and the structure of athletes’ self-talk. The study was conducted in three stages. In the first stage, a large pool of items was generated and content analysis was used to organize the items into categories. Furthermore, item-content relevance analysis was conducted to help identifying the most appropriate items. In Stage 2, the factor structure of the instrument was examined by a series of exploratory factor analyses (Sample A: N = 507), whereas in Stage 3 the results of the exploratory factor analysis were retested through confirmatory factor analyses (Sample B: N = 766) and at the same time concurrent validity were assessed. The analyses revealed eight factors, four positive (psych up, confidence, anxiety control and instruction), three negative (worry, disengagement and somatic fatigue) and one neutral (irrelevant thoughts). The findings of the study provide evidence regarding the multidimensionality of self-talk, suggesting that ASTQS seems a psychometrically sound instrument that could help us developing cognitive-behavioral theories and interventions to examine and modify athletes’ self-talk
- …