31,123 research outputs found

    Broad expertise retrieval in sparse data environments

    Get PDF
    Expertise retrieval has been largely unexplored on data other than the W3C collection. At the same time, many intranets of universities and other knowledge-intensive organisations offer examples of relatively small but clean multilingual expertise data, covering broad ranges of expertise areas. We first present two main expertise retrieval tasks, along with a set of baseline approaches based on generative language modeling, aimed at finding expertise relations between topics and people. For our experimental evaluation, we introduce (and release) a new test set based on a crawl of a university site. Using this test set, we conduct two series of experiments. The first is aimed at determining the effectiveness of baseline expertise retrieval methods applied to the new test set. The second is aimed at assessing refined models that exploit characteristic features of the new test set, such as the organizational structure of the university, and the hierarchical structure of the topics in the test set. Expertise retrieval models are shown to be robust with respect to environments smaller than the W3C collection, and current techniques appear to be generalizable to other settings

    Authorship attribution in portuguese using character N-grams

    Get PDF
    For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)

    The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples

    Get PDF
    Shotgun metagenomics is increasingly used to characterise microbial communities, particularly for the investigation of antimicrobial resistance (AMR) in different animal and environmental contexts. There are many different approaches for inferring the taxonomic composition and AMR gene content of complex community samples from shotgun metagenomic data, but there has been little work establishing the optimum sequencing depth, data processing and analysis methods for these samples. In this study we used shotgun metagenomics and sequencing of cultured isolates from the same samples to address these issues. We sampled three potential environmental AMR gene reservoirs (pig caeca, river sediment, effluent) and sequenced samples with shotgun metagenomics at high depth (~ 200 million reads per sample). Alongside this, we cultured single-colony isolates of Enterobacteriaceae from the same samples and used hybrid sequencing (short- and long-reads) to create high- quality assemblies for comparison to the metagenomic data. To automate data processing, we developed an open- source software pipeline, ‘ResPipe’

    Ribosome signatures aid bacterial translation initiation site identification

    Get PDF
    Background: While methods for annotation of genes are increasingly reliable, the exact identification of translation initiation sites remains a challenging problem. Since the N-termini of proteins often contain regulatory and targeting information, developing a robust method for start site identification is crucial. Ribosome profiling reads show distinct patterns of read length distributions around translation initiation sites. These patterns are typically lost in standard ribosome profiling analysis pipelines, when reads from footprints are adjusted to determine the specific codon being translated. Results: Utilising these signatures in combination with nucleotide sequence information, we build a model capable of predicting translation initiation sites and demonstrate its high accuracy using N-terminal proteomics. Applying this to prokaryotic translatomes, we re-annotate translation initiation sites and provide evidence of N-terminal truncations and extensions of previously annotated coding sequences. These re-annotations are supported by the presence of structural and sequence-based features next to N-terminal peptide evidence. Finally, our model identifies 61 novel genes previously undiscovered in the Salmonella enterica genome. Conclusions: Signatures within ribosome profiling read length distributions can be used in combination with nucleotide sequence information to provide accurate genome-wide identification of translation initiation sites

    Probabilistic Fluorescence-Based Synapse Detection

    Get PDF
    Brain function results from communication between neurons connected by complex synaptic networks. Synapses are themselves highly complex and diverse signaling machines, containing protein products of hundreds of different genes, some in hundreds of copies, arranged in precise lattice at each individual synapse. Synapses are fundamental not only to synaptic network function but also to network development, adaptation, and memory. In addition, abnormalities of synapse numbers or molecular components are implicated in most mental and neurological disorders. Despite their obvious importance, mammalian synapse populations have so far resisted detailed quantitative study. In human brains and most animal nervous systems, synapses are very small and very densely packed: there are approximately 1 billion synapses per cubic millimeter of human cortex. This volumetric density poses very substantial challenges to proteometric analysis at the critical level of the individual synapse. The present work describes new probabilistic image analysis methods for single-synapse analysis of synapse populations in both animal and human brains.Comment: Current awaiting peer revie

    Profiling time course expression of virus genes---an illustration of Bayesian inference under shape restrictions

    Get PDF
    There have been several studies of the genome-wide temporal transcriptional program of viruses, based on microarray experiments, which are generally useful in the construction of gene regulation network. It seems that biological interpretations in these studies are directly based on the normalized data and some crude statistics, which provide rough estimates of limited features of the profile and may incur biases. This paper introduces a hierarchical Bayesian shape restricted regression method for making inference on the time course expression of virus genes. Estimates of many salient features of the expression profile like onset time, inflection point, maximum value, time to maximum value, area under curve, etc. can be obtained immediately by this method. Applying this method to a baculovirus microarray time course expression data set, we indicate that many biological questions can be formulated quantitatively and we are able to offer insights into the baculovirus biology.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS258 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Modeling the functional genomics of autism using human neurons.

    Get PDF
    Human neural progenitors from a variety of sources present new opportunities to model aspects of human neuropsychiatric disease in vitro. Such in vitro models provide the advantages of a human genetic background combined with rapid and easy manipulation, making them highly useful adjuncts to animal models. Here, we examined whether a human neuronal culture system could be utilized to assess the transcriptional program involved in human neural differentiation and to model some of the molecular features of a neurodevelopmental disorder, such as autism. Primary normal human neuronal progenitors (NHNPs) were differentiated into a post-mitotic neuronal state through addition of specific growth factors and whole-genome gene expression was examined throughout a time course of neuronal differentiation. After 4 weeks of differentiation, a significant number of genes associated with autism spectrum disorders (ASDs) are either induced or repressed. This includes the ASD susceptibility gene neurexin 1, which showed a distinct pattern from neurexin 3 in vitro, and which we validated in vivo in fetal human brain. Using weighted gene co-expression network analysis, we visualized the network structure of transcriptional regulation, demonstrating via this unbiased analysis that a significant number of ASD candidate genes are coordinately regulated during the differentiation process. As NHNPs are genetically tractable and manipulable, they can be used to study both the effects of mutations in multiple ASD candidate genes on neuronal differentiation and gene expression in combination with the effects of potential therapeutic molecules. These data also provide a step towards better understanding of the signaling pathways disrupted in ASD

    The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism

    Full text link
    Computer vision and other biometrics data science applications have commenced a new project of profiling people. Rather than using 'transaction generated information', these systems measure the 'real world' and produce an assessment of the 'world state' - in this case an assessment of some individual trait. Instead of using proxies or scores to evaluate people, they increasingly deploy a logic of revealing the truth about reality and the people within it. While these profiling knowledge claims are sometimes tentative, they increasingly suggest that only through computation can these excesses of reality be captured and understood. This article explores the bases of those claims in the systems of measurement, representation, and classification deployed in computer vision. It asks if there is something new in this type of knowledge claim, sketches an account of a new form of computational empiricism being operationalised, and questions what kind of human subject is being constructed by these technological systems and practices. Finally, the article explores legal mechanisms for contesting the emergence of computational empiricism as the dominant knowledge platform for understanding the world and the people within it
    corecore