1,744 research outputs found
Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news
The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, including lexica, written newspaper corpora and speech transcriptions. Finite state transducers produced the best results for written newspaper corpora, but the maximum entropy approach also proved to be a good choice, suitable for the capitalization of speech transcriptions, and allowing straightforward on-the-fly capitalization. Evaluation results are presented both for written newspaper corpora and for broadcast news speech transcriptions. The frequency of each punctuation mark in BN speech transcriptions was analyzed for three different languages: English, Spanish and Portuguese. The punctuation task was performed using a maximum entropy modeling approach, which combines different types of information both lexical and acoustic. The contribution of each feature was analyzed individually and separated results for each focus condition are given, making it possible to analyze the performance differences between planned and spontaneous speech. All results were evaluated on speech transcriptions of a Portuguese broadcast news corpus. The benefits of enriching speech recognition with punctuation and capitalization are shown in an example, illustrating the effects of described experiments into spoken texts.info:eu-repo/semantics/acceptedVersio
Complexity in economic and social systems: cryptocurrency market at around COVID-19
Social systems are characterized by an enormous network of connections and
factors that can influence the structure and dynamics of these systems. All
financial markets, including the cryptocurrency market, belong to the
economical sphere of human activity that seems to be the most interrelated and
complex. The cryptocurrency market complexity can be studied from different
perspectives. First, the dynamics of the cryptocurrency exchange rates to other
cryptocurrencies and fiat currencies can be studied and quantified by means of
multifractal formalism. Second, coupling and decoupling of the cryptocurrencies
and the conventional assets can be investigated with the advanced
cross-correlation analyses based on fractal analysis. Third, an internal
structure of the cryptocurrency market can also be a subject of analysis that
exploits, for example, a network representation of the market. We approach this
subject from all three perspectives based on data recorded between January 2019
and June 2020. This period includes the Covid-19 pandemic and we pay particular
attention to this event and investigate how strong its impact on the structure
and dynamics of the market was. Besides, the studied data covers a few other
significant events like double bull and bear phases in 2019. We show that,
throughout the considered interval, the exchange rate returns were multifractal
with intermittent signatures of bifractality that can be associated with the
most volatile periods of the market dynamics like a bull market onset in April
2019 and the Covid-19 outburst in March 2020. The topology of a minimal
spanning tree representation of the market also used to alter during these
events from a distributed type without any dominant node to a highly
centralized type with a dominating hub of USDT. However, the MST topology
during the pandemic differs in some details from other volatile periods
Automatic truecasing of video subtitles using BERT: a multilingual adaptable approach
This paper describes an approach for automatic capitalization of text without case information, such as spoken transcripts of video subtitles, produced by automatic speech recognition systems. Our approach is based on pre-trained contextualized word embeddings, requires only a small portion of data for training when compared with traditional approaches, and is able to achieve state-of-the-art results. The paper reports experiments both on general written data from the European Parliament, and on video subtitles, revealing that the proposed approach is suitable for performing capitalization, not only in each one of the domains, but also in a cross-domain scenario. We have also created a versatile multilingual model, and the conducted experiments show that good results can be achieved both for monolingual and multilingual data. Finally, we applied domain adaptation by finetuning models, initially trained on general written data, on video subtitles, revealing gains over other approaches not only in performance but also in terms of computational cost.info:eu-repo/semantics/publishedVersio
Statistically validated network of portfolio overlaps and systemic risk
Common asset holding by financial institutions, namely portfolio overlap, is
nowadays regarded as an important channel for financial contagion with the
potential to trigger fire sales and thus severe losses at the systemic level.
In this paper we propose a method to assess the statistical significance of the
overlap between pairs of heterogeneously diversified portfolios, which then
allows us to build a validated network of financial institutions where links
indicate potential contagion channels due to realized portfolio overlaps. The
method is implemented on a historical database of institutional holdings
ranging from 1999 to the end of 2013, but can be in general applied to any
bipartite network where the presence of similar sets of neighbors is of
interest. We find that the proportion of validated network links (i.e., of
statistically significant overlaps) increased steadily before the 2007-2008
global financial crisis and reached a maximum when the crisis occurred. We
argue that the nature of this measure implies that systemic risk from fire
sales liquidation was maximal at that time. After a sharp drop in 2008,
systemic risk resumed its growth in 2009, with a notable acceleration in 2013,
reaching levels not seen since 2007. We finally show that market trends tend to
be amplified in the portfolios identified by the algorithm, such that it is
possible to have an informative signal about financial institutions that are
about to suffer (enjoy) the most significant losses (gains)
The dynamics of correlated novelties
One new thing often leads to another. Such correlated novelties are a
familiar part of daily life. They are also thought to be fundamental to the
evolution of biological systems, human society, and technology. By opening new
possibilities, one novelty can pave the way for others in a process that
Kauffman has called "expanding the adjacent possible". The dynamics of
correlated novelties, however, have yet to be quantified empirically or modeled
mathematically. Here we propose a simple mathematical model that mimics the
process of exploring a physical, biological or conceptual space that enlarges
whenever a novelty occurs. The model, a generalization of Polya's urn, predicts
statistical laws for the rate at which novelties happen (analogous to Heaps'
law) and for the probability distribution on the space explored (analogous to
Zipf's law), as well as signatures of the hypothesized process by which one
novelty sets the stage for another. We test these predictions on four data sets
of human activity: the edit events of Wikipedia pages, the emergence of tags in
annotation systems, the sequence of words in texts, and listening to new songs
in online music catalogues. By quantifying the dynamics of correlated
novelties, our results provide a starting point for a deeper understanding of
the ever-expanding adjacent possible and its role in biological, linguistic,
cultural, and technological evolution
The Market Fraction Hypothesis under different GP algorithms
In a previous work, inspired by observations made in many agent-based financial models, we formulated and presented the Market Fraction Hypothesis, which basically predicts a short duration for any dominant type of agents, but then a uniform distribution over all types in the long run. We then proposed a two-step approach, a rule-inference step and a rule-clustering step, to testing this hypothesis. We employed genetic programming as the rule inference engine, and applied self-organizing maps to cluster the inferred rules. We then ran tests for 10 international markets and provided a general examination of the plausibility of the hypothesis. However, because of the fact that the tests took place under a GP system, it could be argued that these results are dependent on the nature of the GP algorithm. This chapter thus serves as an extension to our previous work. We test the Market Fraction Hypothesis under two new different GP algorithms, in order to prove that the previous results are rigorous and are not sensitive to the choice of GP. We thus test again the hypothesis under the same 10 empirical datasets that were used in our previous experiments. Our work shows that certain parts of the hypothesis are indeed sensitive on the algorithm. Nevertheless, this sensitivity does not apply to all aspects of our tests. This therefore allows us to conclude that our previously derived results are rigorous and can thus be generalized
Analisis Kesalahan Penggunaan Huruf Kapital dan Pemilihan Kata pada Karangan Deskripsi Siswa Kelas VIII SMP NU 1 Wonosegoro
Abstract
The purpose of this study to describe a form of misuse of capital letters and essay description wording eighth grade students, to explain the factors behind the occurrence of errors capitalization and wording essay description eighth grade students. This study used descriptive qualitative method. The object of this research is the improper use of capital letters and the choice of words in the essay description. Source data is a result of the students' work is a description essay. The data collection techniques that researchers use in this study to obtain data that refer to using the techniques and technical notes. Analysis of data using a unified method. The validity of the data using triangulation. (1) errors of wording redundant as much as 56.6%, (2) the error uppercase letters used as pertawa beginning of the sentence as much as 33.33%, (3) error uppercase letters used as the first name of the person as much as 23.68%, ( 4) errors in geography name has an error rate of 9.64%, (5) improper use of capital letters in the name of the day, month and year as much as 7.89%, (6) fault capital letters used as the name of the nation, tribe, and nation as much as 7.01%, (7) morphological errors as much as 7.01%, (8) berimbuhan word usage errors as much as 4.38%, (9) the error capitalization is used as the first letters of the title as much as 2.63%, ( 10) phonological errors (writing the next word) as much as 1.75%, (11) a typing error in the form of as much as 1.75%, (12) the error capitalization in titles as much as 0.87%. The cause of the error is lack on linguistic rules, the difficulty in determining the theme and choose a good word, students are less conscientious in doing their jobs.
Keywords: Authorship descriptions, errors capital letters, and the error wording
- …