53 research outputs found
Extracting Clusters of Specialist Terms from Unstructured Text
Automatically identifying related specialist terms is a difficult and important task required to understand the lexical structure of language. This paper develops a corpus-based method of extracting coherent clusters of satellite terminology — terms on the edge of the lexicon — using co-occurrence networks of unstructured text. Term clusters are identified by extracting communities in the co-occurrence graph, after which the largest is discarded and the remaining words are ranked by centrality within a community. The method is tractable on large corpora, requires no document structure and minimal normalization. The results suggest that the model is able to extract coherent groups of satellite terms in corpora with varying size, content and structure. The findings also confirm that language consists of a densely connected core (observed in dictionaries) and systematic, se mantically coherent groups of terms at the edges of the lexicon
Is level of prematurity a risk/plasticity factor at three years of age?
Children born preterm have poorer outcomes than children born full-term, but the caregiving environment can ameliorate some of these differences. Recent research has proposed that preterm birth may be a plasticity factor, leading to better outcomes for preterm than full-term infants in higher quality environments. This analysis uses data from two waves of an Irish study of children (at 9 months and 3 years of age, n=11,134 children) and their caregivers (n=11,132 mothers, n=9,998 fathers) to investigate differences in how caregiving affects social, cognitive, and motor skills between full-term, late preterm, and very preterm children. Results indicate that parental emotional distress and quality of attachment are important for child outcomes. Both being born very preterm and late preterm continue to be risk factors for poorer outcomes at 3 years of age. Only fathers’ emotional distress significantly moderated the effect of prematurity on infants’ cognitive and social outcomes – no other interactions between prematurity and environment were significant. These interactions were somewhat in line with diathesis stress, but the effect sizes were too small to provide strong support for this model. There is no evidence that preterm birth is a plasticity factor
It's Distributions All The Way Down!
The textual, big-data literature misses Bentley, O’Brien, & Brock’s (Bentley et al.’s) message on distributions; it largely examines the first-order effects of how a single, signature distribution can predict population behaviour, neglecting second- order effects involving distributional shifts, either between signature distributions or within a given signature distribution. Indeed, Bentley et al. themselves under-emphasise the potential richness of the latter, within-distribution effects
Recommended from our members
Measuring discursive influence across scholarship
Assessing scholarly influence is critical for understanding the collective system of scholarship and the history of academic inquiry. Influence is multifaceted, and citations reveal only part of it. Citation counts exhibit preferential attachment and follow a rigid “news cycle” that can miss sustained and indirect forms of influence. Building on dynamic topic models that track distributional shifts in discourse over time, we introduce a variant that incorporates features, such as authorship, affiliation, and publication venue, to assess how these contexts interact with content to shape future scholarship. We perform in-depth analyses on collections of physics research (500,000 abstracts; 102 years) and scholarship generally (JSTOR repository: 2 million full-text articles; 130 years). Our measure of document influence helps predict citations and shows how outcomes, such as winning a Nobel Prize or affiliation with a highly ranked institution, boost influence. Analysis of citations alongside discursive influence reveals that citations tend to credit authors who persist in their fields over time and discount credit for works that are influential over many topics or are “ahead of their time.” In this way, our measures provide a way to acknowledge diverse contributions that take longer and travel farther to achieve scholarly appreciation, enabling us to correct citation biases and enhance sensitivity to the full spectrum of scholarly impact
A Secure Data Enclave and Analytics Platform for Social Scientists
Data-driven research is increasingly ubiquitous and data itself is a defining asset for researchers, particularly in the computational social sciences and humanities. Entire careers and research communities are built around valuable, proprietary or sensitive datasets. However, many existing computation resources fail to support secure and cost-effective storage of data while also enabling secure and flexible analysis of the data. To address these needs we present CLOUD KOTTA, a cloud-based architecture for the secure management and analysis of social science data. CLOUD KOTTA leverages reliable, secure, and scalable cloud resources to deliver capabilities to users, and removes the need for users to manage complicated infrastructure.CLOUD KOTTA implements automated, cost-aware models for efficiently provisioning tiered storage and automatically scaled compute resources.CLOUD KOTTA has been used in production for several months and currently manages approximately 10TB of data and has been used to process more than 5TB of data with over 75,000 CPU hours. It has been used for a broad variety of text analysis workflows, matrix factorization, and various machine learning algorithms, and more broadly, it supports fast, secure and cost-effective research
Reflexive Regular Equivalence in Bipartite Data
Bipartite data is common in data engineering and brings unique challenges, particularly when it comes to clustering tasks that impose strong structural assumptions. This work presents an unsupervised method for assessing similarity in bipartite data. The method is based on regular equivalence in graphs and uses spectral properties of a bipartite adjacency matrix to estimate similarity in both dimensions. The method is reflexive in that similarity in one dimension informs similarity in the other. The method also uses local graph transitivities, a contribution governed by its only free parameter. Reflexive regular equivalence can be used to validate assumptions of co-similarity, which are required but often untested in co-clustering analyses. The method is robust to noise and asymmetric data, making it particularly suited for cluster analysis and recommendation in data of unknown structure
Cloud Kotta: Enabling Secure and Scalable Data Analytics in the Cloud
Distributed communities of researchers rely increasingly on valuable, proprietary, or sensitive datasets. Given the growth of such data, especially in fields new to data-driven research like the social sciences and humanities, coupled with what are often strict and complex data-use agreements, many research communities now require methods that allow secure, scalable and cost-effective storage and analysis. Here we present CLOUD KOTTA: a cloud-based data management and analytics framework. CLOUD KOTTA delivers an end-to-end solution for coordinating secure access to large datasets, and an execution model that provides both automated infrastructure scaling and support for executing analytics near to the data. CLOUD KOTTA implements a fine-grained security model ensuring that only authorized users may access, analyze, and download protected data. It also implements automated methods for acquiring and configuring low-cost storage and compute resources as they are needed. We present the architecture and implementation of CLOUD KOTTA and demonstrate the advantages it provides in terms of increased performance and flexibility. We show that CLOUD KOTTA’s elastic provisioning model can reduce costs by up to 16x when compared with statically provisioned models
- …