1,744 research outputs found

    Recovering capitalization and punctuation marks for automatic speech recognition: case study for Portuguese broadcast news

    Get PDF
    The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, including lexica, written newspaper corpora and speech transcriptions. Finite state transducers produced the best results for written newspaper corpora, but the maximum entropy approach also proved to be a good choice, suitable for the capitalization of speech transcriptions, and allowing straightforward on-the-fly capitalization. Evaluation results are presented both for written newspaper corpora and for broadcast news speech transcriptions. The frequency of each punctuation mark in BN speech transcriptions was analyzed for three different languages: English, Spanish and Portuguese. The punctuation task was performed using a maximum entropy modeling approach, which combines different types of information both lexical and acoustic. The contribution of each feature was analyzed individually and separated results for each focus condition are given, making it possible to analyze the performance differences between planned and spontaneous speech. All results were evaluated on speech transcriptions of a Portuguese broadcast news corpus. The benefits of enriching speech recognition with punctuation and capitalization are shown in an example, illustrating the effects of described experiments into spoken texts.info:eu-repo/semantics/acceptedVersio

    Complexity in economic and social systems: cryptocurrency market at around COVID-19

    Full text link
    Social systems are characterized by an enormous network of connections and factors that can influence the structure and dynamics of these systems. All financial markets, including the cryptocurrency market, belong to the economical sphere of human activity that seems to be the most interrelated and complex. The cryptocurrency market complexity can be studied from different perspectives. First, the dynamics of the cryptocurrency exchange rates to other cryptocurrencies and fiat currencies can be studied and quantified by means of multifractal formalism. Second, coupling and decoupling of the cryptocurrencies and the conventional assets can be investigated with the advanced cross-correlation analyses based on fractal analysis. Third, an internal structure of the cryptocurrency market can also be a subject of analysis that exploits, for example, a network representation of the market. We approach this subject from all three perspectives based on data recorded between January 2019 and June 2020. This period includes the Covid-19 pandemic and we pay particular attention to this event and investigate how strong its impact on the structure and dynamics of the market was. Besides, the studied data covers a few other significant events like double bull and bear phases in 2019. We show that, throughout the considered interval, the exchange rate returns were multifractal with intermittent signatures of bifractality that can be associated with the most volatile periods of the market dynamics like a bull market onset in April 2019 and the Covid-19 outburst in March 2020. The topology of a minimal spanning tree representation of the market also used to alter during these events from a distributed type without any dominant node to a highly centralized type with a dominating hub of USDT. However, the MST topology during the pandemic differs in some details from other volatile periods

    Automatic truecasing of video subtitles using BERT: a multilingual adaptable approach

    Get PDF
    This paper describes an approach for automatic capitalization of text without case information, such as spoken transcripts of video subtitles, produced by automatic speech recognition systems. Our approach is based on pre-trained contextualized word embeddings, requires only a small portion of data for training when compared with traditional approaches, and is able to achieve state-of-the-art results. The paper reports experiments both on general written data from the European Parliament, and on video subtitles, revealing that the proposed approach is suitable for performing capitalization, not only in each one of the domains, but also in a cross-domain scenario. We have also created a versatile multilingual model, and the conducted experiments show that good results can be achieved both for monolingual and multilingual data. Finally, we applied domain adaptation by finetuning models, initially trained on general written data, on video subtitles, revealing gains over other approaches not only in performance but also in terms of computational cost.info:eu-repo/semantics/publishedVersio

    Statistically validated network of portfolio overlaps and systemic risk

    Get PDF
    Common asset holding by financial institutions, namely portfolio overlap, is nowadays regarded as an important channel for financial contagion with the potential to trigger fire sales and thus severe losses at the systemic level. In this paper we propose a method to assess the statistical significance of the overlap between pairs of heterogeneously diversified portfolios, which then allows us to build a validated network of financial institutions where links indicate potential contagion channels due to realized portfolio overlaps. The method is implemented on a historical database of institutional holdings ranging from 1999 to the end of 2013, but can be in general applied to any bipartite network where the presence of similar sets of neighbors is of interest. We find that the proportion of validated network links (i.e., of statistically significant overlaps) increased steadily before the 2007-2008 global financial crisis and reached a maximum when the crisis occurred. We argue that the nature of this measure implies that systemic risk from fire sales liquidation was maximal at that time. After a sharp drop in 2008, systemic risk resumed its growth in 2009, with a notable acceleration in 2013, reaching levels not seen since 2007. We finally show that market trends tend to be amplified in the portfolios identified by the algorithm, such that it is possible to have an informative signal about financial institutions that are about to suffer (enjoy) the most significant losses (gains)

    The dynamics of correlated novelties

    Full text link
    One new thing often leads to another. Such correlated novelties are a familiar part of daily life. They are also thought to be fundamental to the evolution of biological systems, human society, and technology. By opening new possibilities, one novelty can pave the way for others in a process that Kauffman has called "expanding the adjacent possible". The dynamics of correlated novelties, however, have yet to be quantified empirically or modeled mathematically. Here we propose a simple mathematical model that mimics the process of exploring a physical, biological or conceptual space that enlarges whenever a novelty occurs. The model, a generalization of Polya's urn, predicts statistical laws for the rate at which novelties happen (analogous to Heaps' law) and for the probability distribution on the space explored (analogous to Zipf's law), as well as signatures of the hypothesized process by which one novelty sets the stage for another. We test these predictions on four data sets of human activity: the edit events of Wikipedia pages, the emergence of tags in annotation systems, the sequence of words in texts, and listening to new songs in online music catalogues. By quantifying the dynamics of correlated novelties, our results provide a starting point for a deeper understanding of the ever-expanding adjacent possible and its role in biological, linguistic, cultural, and technological evolution

    The Market Fraction Hypothesis under different GP algorithms

    Get PDF
    In a previous work, inspired by observations made in many agent-based financial models, we formulated and presented the Market Fraction Hypothesis, which basically predicts a short duration for any dominant type of agents, but then a uniform distribution over all types in the long run. We then proposed a two-step approach, a rule-inference step and a rule-clustering step, to testing this hypothesis. We employed genetic programming as the rule inference engine, and applied self-organizing maps to cluster the inferred rules. We then ran tests for 10 international markets and provided a general examination of the plausibility of the hypothesis. However, because of the fact that the tests took place under a GP system, it could be argued that these results are dependent on the nature of the GP algorithm. This chapter thus serves as an extension to our previous work. We test the Market Fraction Hypothesis under two new different GP algorithms, in order to prove that the previous results are rigorous and are not sensitive to the choice of GP. We thus test again the hypothesis under the same 10 empirical datasets that were used in our previous experiments. Our work shows that certain parts of the hypothesis are indeed sensitive on the algorithm. Nevertheless, this sensitivity does not apply to all aspects of our tests. This therefore allows us to conclude that our previously derived results are rigorous and can thus be generalized

    Analisis Kesalahan Penggunaan Huruf Kapital dan Pemilihan Kata pada Karangan Deskripsi Siswa Kelas VIII SMP NU 1 Wonosegoro

    Get PDF
    Abstract The purpose of this study to describe a form of misuse of capital letters and essay description wording eighth grade students, to explain the factors behind the occurrence of errors capitalization and wording essay description eighth grade students. This study used descriptive qualitative method. The object of this research is the improper use of capital letters and the choice of words in the essay description. Source data is a result of the students' work is a description essay. The data collection techniques that researchers use in this study to obtain data that refer to using the techniques and technical notes. Analysis of data using a unified method. The validity of the data using triangulation. (1) errors of wording redundant as much as 56.6%, (2) the error uppercase letters used as pertawa beginning of the sentence as much as 33.33%, (3) error uppercase letters used as the first name of the person as much as 23.68%, ( 4) errors in geography name has an error rate of 9.64%, (5) improper use of capital letters in the name of the day, month and year as much as 7.89%, (6) fault capital letters used as the name of the nation, tribe, and nation as much as 7.01%, (7) morphological errors as much as 7.01%, (8) berimbuhan word usage errors as much as 4.38%, (9) the error capitalization is used as the first letters of the title as much as 2.63%, ( 10) phonological errors (writing the next word) as much as 1.75%, (11) a typing error in the form of as much as 1.75%, (12) the error capitalization in titles as much as 0.87%. The cause of the error is lack on linguistic rules, the difficulty in determining the theme and choose a good word, students are less conscientious in doing their jobs. Keywords: Authorship descriptions, errors capital letters, and the error wording
    • …
    corecore