112 research outputs found
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified
single-cell genomes, and metagenomes has enabled investigation of a wide range
of organisms and ecosystems. However, sampling variation in short-read data
sets and high sequencing error rates of modern sequencers present many new
computational challenges in data interpretation. These challenges have led to
the development of new classes of mapping tools and {\em de novo} assemblers.
These algorithms are challenged by the continued improvement in sequencing
throughput. We here describe digital normalization, a single-pass computational
algorithm that systematizes coverage in shotgun sequencing data sets, thereby
decreasing sampling variation, discarding redundant data, and removing the
majority of errors. Digital normalization substantially reduces the size of
shotgun data sets and decreases the memory and time requirements for {\em de
novo} sequence assembly, all without significantly impacting content of the
generated contigs. We apply digital normalization to the assembly of microbial
genomic data, amplified single-cell genomic data, and transcriptomic data. Our
implementation is freely available for use and modification
Representation Bias of Adolescents in AI: A Bilingual, Bicultural Study
Popular and news media often portray teenagers with sensationalism, as both a
risk to society and at risk from society. As AI begins to absorb some of the
epistemic functions of traditional media, we study how teenagers in two
countries speaking two languages: 1) are depicted by AI, and 2) how they would
prefer to be depicted. Specifically, we study the biases about teenagers
learned by static word embeddings (SWEs) and generative language models (GLMs),
comparing these with the perspectives of adolescents living in the U.S. and
Nepal. We find English-language SWEs associate teenagers with societal
problems, and more than 50% of the 1,000 words most associated with teenagers
in the pretrained GloVe SWE reflect such problems. Given prompts about
teenagers, 30% of outputs from GPT2-XL and 29% from LLaMA-2-7B GLMs discuss
societal problems, most commonly violence, but also drug use, mental illness,
and sexual taboo. Nepali models, while not free of such associations, are less
dominated by social problems. Data from workshops with N=13 U.S. adolescents
and N=18 Nepalese adolescents show that AI presentations are disconnected from
teenage life, which revolves around activities like school and friendship.
Participant ratings of how well 20 trait words describe teens are decorrelated
from SWE associations, with Pearson's r=.02, n.s. in English FastText and
r=.06, n.s. in GloVe; and r=.06, n.s. in Nepali FastText and r=-.23, n.s. in
GloVe. U.S. participants suggested AI could fairly present teens by
highlighting diversity, while Nepalese participants centered positivity.
Participants were optimistic that, if it learned from adolescents, rather than
media sources, AI could help mitigate stereotypes. Our work offers an
understanding of the ways SWEs and GLMs misrepresent a developmentally
vulnerable group and provides a template for less sensationalized
characterization.Comment: Accepted at Artificial Intelligence, Ethics, and Society 202
ML-EAT: A Multilevel Embedding Association Test for Interpretable and Transparent Social Science
This research introduces the Multilevel Embedding Association Test (ML-EAT), a method designed for interpretable and transparent measurement of intrinsic bias in language technologies. The ML-EAT addresses issues of ambiguity and difficulty in interpreting the traditional EAT measurement by quantifying bias at three levels of increasing granularity: the differential association between two target concepts with two attribute concepts; the individual effect size of each target concept with two attribute concepts; and the association between each individual target concept and each individual attribute concept. Using the ML-EAT, this research defines a taxonomy of EAT patterns describing the nine possible outcomes of an embedding association test, each of which is associated with a unique EAT-Map, a novel four-quadrant visualization for interpreting the ML-EAT. Empirical analysis of static and diachronic word embeddings, GPT-2 language models, and a CLIP language-and-image model shows that EAT patterns add otherwise unobservable information about the component biases that make up an EAT; reveal the effects of prompting in zero-shot models; and can also identify situations when cosine similarity is an ineffective metric, rendering an EAT unreliable. Our work contributes a method for rendering bias more observable and interpretable, improving the transparency of computational investigations into human minds and societies.Accepted at Artificial Intelligence, Ethics, and Society 202
Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI
Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.Accepted at Artificial Intelligence, Ethics, and Society 202
Sensory Communication
Contains table of contents for Section 2, an introduction and reports on fourteen research projects.National Institutes of Health Grant RO1 DC00117National Institutes of Health Grant RO1 DC02032National Institutes of Health/National Institute on Deafness and Other Communication Disorders Grant R01 DC00126National Institutes of Health Grant R01 DC00270National Institutes of Health Contract N01 DC52107U.S. Navy - Office of Naval Research/Naval Air Warfare Center Contract N61339-95-K-0014U.S. Navy - Office of Naval Research/Naval Air Warfare Center Contract N61339-96-K-0003U.S. Navy - Office of Naval Research Grant N00014-96-1-0379U.S. Air Force - Office of Scientific Research Grant F49620-95-1-0176U.S. Air Force - Office of Scientific Research Grant F49620-96-1-0202U.S. Navy - Office of Naval Research Subcontract 40167U.S. Navy - Office of Naval Research/Naval Air Warfare Center Contract N61339-96-K-0002National Institutes of Health Grant R01-NS33778U.S. Navy - Office of Naval Research Grant N00014-92-J-184
Meta-analysis of SHANK Mutations in Autism Spectrum Disorders: A Gradient of Severity in Cognitive Impairments.
International audienceSHANK genes code for scaffold proteins located at the post-synaptic density of glutamatergic synapses. In neurons, SHANK2 and SHANK3 have a positive effect on the induction and maturation of dendritic spines, whereas SHANK1 induces the enlargement of spine heads. Mutations in SHANK genes have been associated with autism spectrum disorders (ASD), but their prevalence and clinical relevance remain to be determined. Here, we performed a new screen and a meta-analysis of SHANK copy-number and coding-sequence variants in ASD. Copy-number variants were analyzed in 5,657 patients and 19,163 controls, coding-sequence variants were ascertained in 760 to 2,147 patients and 492 to 1,090 controls (depending on the gene), and, individuals carrying de novo or truncating SHANK mutations underwent an extensive clinical investigation. Copy-number variants and truncating mutations in SHANK genes were present in ∼1% of patients with ASD: mutations in SHANK1 were rare (0.04%) and present in males with normal IQ and autism; mutations in SHANK2 were present in 0.17% of patients with ASD and mild intellectual disability; mutations in SHANK3 were present in 0.69% of patients with ASD and up to 2.12% of the cases with moderate to profound intellectual disability. In summary, mutations of the SHANK genes were detected in the whole spectrum of autism with a gradient of severity in cognitive impairment. Given the rare frequency of SHANK1 and SHANK2 deleterious mutations, the clinical relevance of these genes remains to be ascertained. In contrast, the frequency and the penetrance of SHANK3 mutations in individuals with ASD and intellectual disability-more than 1 in 50-warrant its consideration for mutation screening in clinical practice
Ripe to be Heard: Worker Voice in the Fair Food Programme
The Fair Food Program (FFP) provides a mechanism through which agricultural workers’ collective voice is expressed, heard and responded to within global value chains. The FFP's model of worker-driven social responsibility presents an alternative to traditional corporate social responsibility. This article identifies the FFP's key components and demonstrates its resilience by identifying the ways in which the issues faced by a new group of migrant workers – recruited through a “guest-worker” scheme – were incorporated and dealt with. This case study highlights the important potential presented by the programme to address labour abuses across transnationalized labour markets while considering early replication possibilities
Morphological Diversity between Culture Strains of a Chlorarachniophyte, Lotharella globosa
Chlorarachniophytes are marine unicellular algae that possess secondary plastids of green algal origin. Although chlorarachniophytes are a small group (the phylum of Chlorarachniophyta contains 14 species in 8 genera), they have variable and complex life cycles that include amoeboid, coccoid, and/or flagellate cells. The majority of chlorarachniophytes possess two or more cell types in their life cycles, and which cell types are found is one of the principle morphological criteria used for species descriptions. Here we describe an unidentified chlorarachniophyte that was isolated from an artificial coral reef that calls this criterion into question. The life cycle of the new strain includes all three major cell types, but DNA barcoding based on the established nucleomorph ITS sequences showed it to share 100% sequence identity with Lotharella globosa. The type strain of L. globosa was also isolated from a coral reef, but is defined as completely lacking an amoeboid stage throughout its life cycle. We conclude that L. globosa possesses morphological diversity between culture strains, and that the new strain is a variety of L. globosa, which we describe as Lotharella globosa var. fortis var. nov. to include the amoeboid stage in the formal description of L. globosa. This intraspecies variation suggest that gross morphological stages maybe lost rather rapidly, and specifically that the type strain of L. globosa has lost the ability to form the amoeboid stage, perhaps recently. This in turn suggests that even major morphological characters used for taxonomy of this group may be variable in natural populations, and therefore misleading
A New Chicken Genome Assembly Provides Insight into Avian Genome Structure
The importance of the Gallus gallus (chicken) as a model organism and agricultural animal merits a continuation of sequence assembly improvement efforts. We present a new version of the chicken genome assembly (Gallus_gallus-5.0; GCA_000002315.3), built from combined long single molecule sequencing technology, finished BACs, and improved physical maps. In overall assembled bases, we see a gain of 183 Mb, including 16.4 Mb in placed chromosomes with a corresponding gain in the percentage of intact repeat elements characterized. Of the 1.21 Gb genome, we include three previously missing autosomes, GGA30, 31, and 33, and improve sequence contig length 10-fold over the previous Gallus_gallus-4.0. Despite the significant base representation improvements made, 138 Mb of sequence is not yet located to chromosomes. When annotated for gene content, Gallus_gallus-5.0 shows an increase of 4679 annotated genes (2768 noncoding and 1911 protein-coding) over those in Gallus_gallus-4.0. We also revisited the question of what genes are missing in the avian lineage, as assessed by the highest quality avian genome assembly to date, and found that a large fraction of the original set of missing genes are still absent in sequenced bird species. Finally, our new data support a detailed map of MHC-B, encompassing two segments: one with a highly stable gene copy number and another in which the gene copy number is highly variable. The chicken model has been a critical resource for many other fields of study, and this new reference assembly will substantially further these efforts
- …
