5,458 research outputs found
Data-Mining a Large Digital Sky Survey: From the Challenges to the Scientific Results
The analysis and an efficient scientific exploration of the Digital Palomar
Observatory Sky Survey (DPOSS) represents a major technical challenge. The
input data set consists of 3 Terabytes of pixel information, and contains a few
billion sources. We describe some of the specific scientific problems posed by
the data, including searches for distant quasars and clusters of galaxies, and
the data-mining techniques we are exploring in addressing them.
Machine-assisted discovery methods may become essential for the analysis of
such multi-Terabyte data sets. New and future approaches involve unsupervised
classification and clustering analysis in the Giga-object data space, including
various Bayesian techniques. In addition to the searches for known types of
objects in this data base, these techniques may also offer the possibility of
discovering previously unknown, rare types of astronomical objects.Comment: Invited paper, to appear in Applications of Digital Image Processing
XX, ed. A. Tescher, Proc. S.P.I.E. vol. 3164, in press; 10 pages, a
self-contained TeX file, and 3 separate postscript figure
Approaches to Automated Morphological Classification of Galaxies
There is an obvious need for automated classification of galaxies, as the
number of observed galaxies increases very fast. We examine several approaches
to this problem, utilising {\em Artificial Neural Networks} (ANNs). We quote
results from a recent study which show that ANNs can classsify galaxies
morphologically as well as humans can.Comment: 8 pages, uu-encoded compressed postscript file (containing 2 figures
Comparison Between Supervised and Unsupervised Classifications of Neuronal Cell Types: A Case Study
In the study of neural circuits, it becomes essential to discern the different neuronal cell types that build the circuit. Traditionally, neuronal cell types have been classified using qualitative descriptors. More recently, several attempts have been made to classify neurons quantitatively, using unsupervised clustering methods. While useful, these algorithms do not take advantage of previous information known to the investigator, which could improve the classification task. For neocortical GABAergic interneurons, the problem to discern among different cell types is particularly difficult and better methods are needed to perform objective classifications. Here we explore the use of supervised classification algorithms to classify neurons based on their morphological features, using a database of 128 pyramidal cells and 199 interneurons from mouse neocortex. To evaluate the performance of different algorithms we used, as a “benchmark,” the test to automatically distinguish between pyramidal cells and interneurons, defining “ground truth” by the presence or absence of an apical dendrite. We compared hierarchical clustering with a battery of different supervised classification algorithms, finding that supervised classifications outperformed hierarchical clustering. In addition, the selection of subsets of distinguishing features enhanced the classification accuracy for both sets of algorithms. The analysis of selected variables indicates that dendritic features were most useful to distinguish pyramidal cells from interneurons when compared with somatic and axonal morphological variables. We conclude that supervised classification algorithms are better matched to the general problem of distinguishing neuronal cell types when some information on these cell groups, in our case being pyramidal or interneuron, is known a priori. As a spin-off of this methodological study, we provide several methods to automatically distinguish neocortical pyramidal cells from interneurons, based on their morphologies
What Your Username Says About You
Usernames are ubiquitous on the Internet, and they are often suggestive of
user demographics. This work looks at the degree to which gender and language
can be inferred from a username alone by making use of unsupervised morphology
induction to decompose usernames into sub-units. Experimental results on the
two tasks demonstrate the effectiveness of the proposed morphological features
compared to a character n-gram baseline
Methods for Amharic part-of-speech tagging
The paper describes a set of experiments
involving the application of three state-of-
the-art part-of-speech taggers to Ethiopian
Amharic, using three different tagsets.
The taggers showed worse performance
than previously reported results for Eng-
lish, in particular having problems with
unknown words. The best results were
obtained using a Maximum Entropy ap-
proach, while HMM-based and SVM-
based taggers got comparable results
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
- …