43,843 research outputs found
Annotating Synapses in Large EM Datasets
Reconstructing neuronal circuits at the level of synapses is a central
problem in neuroscience and becoming a focus of the emerging field of
connectomics. To date, electron microscopy (EM) is the most proven technique
for identifying and quantifying synaptic connections. As advances in EM make
acquiring larger datasets possible, subsequent manual synapse identification
({\em i.e.}, proofreading) for deciphering a connectome becomes a major time
bottleneck. Here we introduce a large-scale, high-throughput, and
semi-automated methodology to efficiently identify synapses. We successfully
applied our methodology to the Drosophila medulla optic lobe, annotating many
more synapses than previous connectome efforts. Our approaches are extensible
and will make the often complicated process of synapse identification
accessible to a wider-community of potential proofreaders
Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy
Data collection for scientific applications is increasing exponentially and
is forecasted to soon reach peta- and exabyte scales. Applications which
process and analyze scientific data must be scalable and focus on execution
performance to keep pace. In the field of radio astronomy, in addition to
increasingly large datasets, tasks such as the identification of transient
radio signals from extrasolar sources are computationally expensive. We present
a scalable approach to radio pulsar detection written in Scala that
parallelizes candidate identification to take advantage of in-memory task
processing using Apache Spark on a YARN distributed system. Furthermore, we
introduce a novel automated multiclass supervised machine learning technique
that we combine with feature selection to reduce the time required for
candidate classification. Experimental testing on a Beowulf cluster with 15
data nodes shows that the parallel implementation of the identification
algorithm offers a speedup of up to 5X that of a similar multithreaded
implementation. Further, we show that the combination of automated multiclass
classification and feature selection speeds up the execution performance of the
RandomForest machine learning algorithm by an average of 54% with less than a
2% average reduction in the algorithm's ability to correctly classify pulsars.
The generalizability of these results is demonstrated by using two real-world
radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel
Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page
Automated legal sensemaking: the centrality of relevance and intentionality
Introduction: In a perfect world, discovery would ideally be conducted by the senior litigator who is
responsible for developing and fully understanding all nuances of their client’s legal strategy. Of
course today we must deal with the explosion of electronically stored information (ESI) that
never is less than tens-of-thousands of documents in small cases and now increasingly involves
multi-million-document populations for internal corporate investigations and litigations.
Therefore scalable processes and technologies are required as a substitute for the authority’s
judgment. The approaches taken have typically either substituted large teams of surrogate
human reviewers using vastly simplified issue coding reference materials or employed
increasingly sophisticated computational resources with little focus on quality metrics to insure
retrieval consistent with the legal goal. What is required is a system (people, process, and
technology) that replicates and automates the senior litigator’s human judgment.
In this paper we utilize 15 years of sensemaking research to establish the minimum acceptable
basis for conducting a document review that meets the needs of a legal proceeding. There is
no substitute for a rigorous characterization of the explicit and tacit goals of the senior litigator.
Once a process has been established for capturing the authority’s relevance criteria, we argue
that literal translation of requirements into technical specifications does not properly account for
the activities or states-of-affairs of interest. Having only a data warehouse of written records, it
is also necessary to discover the intentions of actors involved in textual communications. We
present quantitative results for a process and technology approach that automates effective
legal sensemaking
PhenDisco: phenotype discovery system for the database of genotypes and phenotypes.
The database of genotypes and phenotypes (dbGaP) developed by the National Center for Biotechnology Information (NCBI) is a resource that contains information on various genome-wide association studies (GWAS) and is currently available via NCBI's dbGaP Entrez interface. The database is an important resource, providing GWAS data that can be used for new exploratory research or cross-study validation by authorized users. However, finding studies relevant to a particular phenotype of interest is challenging, as phenotype information is presented in a non-standardized way. To address this issue, we developed PhenDisco (phenotype discoverer), a new information retrieval system for dbGaP. PhenDisco consists of two main components: (1) text processing tools that standardize phenotype variables and study metadata, and (2) information retrieval tools that support queries from users and return ranked results. In a preliminary comparison involving 18 search scenarios, PhenDisco showed promising performance for both unranked and ranked search comparisons with dbGaP's search engine Entrez. The system can be accessed at http://pfindr.net
- …