369,714 research outputs found
Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences
Given the lack of word delimiters in written Japanese, word segmentation is
generally considered a crucial first step in processing Japanese texts. Typical
Japanese segmentation algorithms rely either on a lexicon and syntactic
analysis or on pre-segmented data; but these are labor-intensive, and the
lexico-syntactic techniques are vulnerable to the unknown word problem. In
contrast, we introduce a novel, more robust statistical method utilizing
unsegmented training data. Despite its simplicity, the algorithm yields
performance on long kanji sequences comparable to and sometimes surpassing that
of state-of-the-art morphological analyzers over a variety of error metrics.
The algorithm also outperforms another mostly-unsupervised statistical
algorithm previously proposed for Chinese.
Additionally, we present a two-level annotation scheme for Japanese to
incorporate multiple segmentation granularities, and introduce two novel
evaluation metrics, both based on the notion of a compatible bracket, that can
account for multiple granularities simultaneously.Comment: 22 pages. To appear in Natural Language Engineerin
Tsallis non-extensive statistics, intermittent turbulence, SOC and chaos in the solar plasma. Part one: Sunspot dynamics
In this study, the nonlinear analysis of the sunspot index is embedded in the
non-extensive statistical theory of Tsallis. The triplet of Tsallis, as well as
the correlation dimension and the Lyapunov exponent spectrum were estimated for
the SVD components of the sunspot index timeseries. Also the multifractal
scaling exponent spectrum, the generalized Renyi dimension spectrum and the
spectrum of the structure function exponents were estimated experimentally and
theoretically by using the entropy principle included in Tsallis non extensive
statistical theory, following Arimitsu and Arimitsu. Our analysis showed
clearly the following: a) a phase transition process in the solar dynamics from
high dimensional non Gaussian SOC state to a low dimensional non Gaussian
chaotic state, b) strong intermittent solar turbulence and anomalous
(multifractal) diffusion solar process, which is strengthened as the solar
dynamics makes phase transition to low dimensional chaos in accordance to
Ruzmaikin, Zeleny and Milovanov studies c) faithful agreement of Tsallis non
equilibrium statistical theory with the experimental estimations of i)
non-Gaussian probability distribution function, ii) multifractal scaling
exponent spectrum and generalized Renyi dimension spectrum, iii) exponent
spectrum of the structure functions estimated for the sunspot index and its
underlying non equilibrium solar dynamics.Comment: 40 pages, 11 figure
Financial advisors: a case of babysitters?
We merge administrative information from a large German discount brokerage firm with regional data to examine if financial advisors improve portfolio performance. Our data track accounts of 32,751 randomly selected individual customers over 66 months and allow direct comparison of performance across self-managed accounts and accounts run by, or in consultation with, independent financial advisors. In contrast to the picture painted by simple descriptive statistics, econometric analysis that corrects for the endogeneity of the choice of having a financial advisor suggests that advisors are associated with lower total and excess account returns, higher portfolio risk and probabilities of losses, and higher trading frequency and portfolio turnover relative to what account owners of given characteristics tend to achieve on their own. Regression analysis of who uses an IFA suggests that IFAs are matched with richer, older investors rather than with poorer, younger ones
Measures of Analysis of Time Series (MATS): A MATLAB Toolkit for Computation of Multiple Measures on Time Series Data Bases
In many applications, such as physiology and finance, large time series data
bases are to be analyzed requiring the computation of linear, nonlinear and
other measures. Such measures have been developed and implemented in commercial
and freeware softwares rather selectively and independently. The Measures of
Analysis of Time Series ({\tt MATS}) {\tt MATLAB} toolkit is designed to handle
an arbitrary large set of scalar time series and compute a large variety of
measures on them, allowing for the specification of varying measure parameters
as well. The variety of options with added facilities for visualization of the
results support different settings of time series analysis, such as the
detection of dynamics changes in long data records, resampling (surrogate or
bootstrap) tests for independence and linearity with various test statistics,
and discrimination power of different measures and for different combinations
of their parameters. The basic features of {\tt MATS} are presented and the
implemented measures are briefly described. The usefulness of {\tt MATS} is
illustrated on some empirical examples along with screenshots.Comment: 25 pages, 9 figures, two tables, the software can be downloaded at
http://eeganalysis.web.auth.gr/indexen.ht
- âŠ