6 research outputs found

    Relating Turing's Formula and Zipf's Law

    Full text link
    An asymptote is derived from Turing's local reestimation formula for population frequencies, and a local reestimation formula is derived from Zipf's law for the asymptotic behavior of population frequencies. The two are shown to be qualitatively different asymptotically, but nevertheless to be instances of a common class of reestimation-formula-asymptote pairs, in which they constitute the upper and lower bounds of the convergence region of the cumulative of the frequency function, as rank tends to infinity. The results demonstrate that Turing's formula is qualitatively different from the various extensions to Zipf's law, and suggest that it smooths the frequency estimates towards a geometric distribution.Comment: 9 pages, uuencoded, gzipped PostScript; some typos remove

    HOW MANY WORDS ARE THERE?

    Get PDF
    The commonsensical assumption that any language has only finitely many words is shown to be false by a combination of formal and empirical arguments. Zipf's Law and related formulas are investigated and a more complex model is offered

    BIBLIOMETRIC ANALYSIS OF NEURAL BASIS EXPANSION ANALYSIS FOR INTERPRETABLE TIME SERIES (N-BEATS) FOR RESEARCH TREND MAPPING

    Get PDF
    Bibliometrics is the statistical analysis of articles, books, and other forms of publication. The bibliometrics analysis is performed with data on the number and authorship of scientific publications and articles, and citations to measure the work of individuals or groups of researchers, organizations, and countries to identify national and international networks and map developments in new multidisciplinary fields of science and technology. In addition, bibliometrics assesses and maps the research, organization, and country of researchers at a given time period. The Bibliometric analysis also has advantages which include mapping relationships between concepts, mapping research directions or trends, mapping state of the art (the novelty of the results of research conducted), and providing insights related to fields, topics, and research problems for future works. This study aims to determine the growth and development of N-BEATS publications, their distribution, variable keywords, and author collaboration using a bibliometric network. The research method used in this paper, through screening of articles obtained from the Scopus database page in 2008-2022, is used for citations in the form of metrics. At the same time, they are visualizing the metadata with VOSviewer. Data was collected from the direct science database with the keyword N-BEATS. The results show that 2022 has the highest number of publications, reaching 310 publications (14.90%). The distribution of research publications on N-BEATS shows a perfect distribution. Terms in the N-BEATS variable that often appear and are associated with other variables

    Network analysis of large scale object oriented software systems

    Get PDF
    PhD ThesisThe evolution of software engineering knowledge, technology, tools, and practices has seen progressive adoption of new design paradigms. Currently, the predominant design paradigm is object oriented design. Despite the advocated and demonstrated benefits of object oriented design, there are known limitations of static software analysis techniques for object oriented systems, and there are many current and legacy object oriented software systems that are difficult to maintain using the existing reverse engineering techniques and tools. Consequently, there is renewed interest in dynamic analysis of object oriented systems, and the emergence of large and highly interconnected systems has fuelled research into the development of new scalable techniques and tools to aid program comprehension and software testing. In dynamic analysis, a key research problem is efficient interpretation and analysis of large volumes of precise program execution data to facilitate efficient handling of software engineering tasks. Some of the techniques, employed to improve the efficiency of analysis, are inspired by empirical approaches developed in other fields of science and engineering that face comparable data analysis challenges. This research is focused on application of empirical network analysis measures to dynamic analysis data of object oriented software. The premise of this research is that the methods that contribute significantly to the object collaboration network's structural integrity are also important for delivery of the software system’s function. This thesis makes two key contributions. First, a definition is proposed for the concept of the functional importance of methods of object oriented software. Second, the thesis proposes and validates a conceptual link between object collaboration networks and the properties of a network model with power law connectivity distribution. Results from empirical software engineering experiments on JHotdraw and Google Chrome are presented. The results indicate that five considered standard centrality based network measures can be used to predict functionally important methods with a significant level of accuracy. The search for functional importance of software elements is an essential starting point for program comprehension and software testing activities. The proposed definition and application of network analysis has the potential to improve the efficiency of post release phase software engineering activities by facilitating rapid identification of potentially functionally important methods in object oriented software. These results, with some refinement, could be used to perform change impact prediction and a host of other potentially beneficial applications to improve software engineering techniques

    Significado, distribución y frecuencia de la categoría preposicional en español. Una aproximación computacional

    Get PDF
    [spa] La categoría preposicional ha sido, tradicionalmente, una clase de palabra provista de rasgos lingüísticos y conductas gramaticales controvertidas. En la tesis, Significado, distribución y frecuencia de la categoría preposicional del español. Una aproximación computacional, se examina la naturaleza de esa controversia a la luz de una metodología cuantitativa, computacional y de lingüística de corpus. La brecha más inexplicada en la historia de su análisis gramatical ha sido cómo identificar su significado. Con frecuencia las nociones descriptivas sobre la semántica de la preposición han sido observadas como subsidiarias de su papel sintáctico, vía caso o asignación de papeles temáticos. Este hecho, sin embargo, no es más que el reconocimiento de que su significado incide, también, en su tarea global dentro de la gramática. Desde una concepción neodistribucionalista, según la cual el significado de las piezas o elementos lingüísticos se encuentra en su distribución contextual, la hipótesis que se plantea es que esa expresión semántica de las preposiciones en español se produce de manera gradual. La denominada Hipótesis Gradual del Significado, aplicada a la categoría preposicional en español, nos permite establecer subclases preposicionales, desde la funcionalidad hasta la lexicidad, pasado por clases intermedias como semifuncionales y semiléxicas. La justificación empírica de la Hipótesis Gradual del Significado se realiza a partir de cuatro experimentos. El primero de ellos se inserta en la metodología del aprendizaje automático (machine learning). Para ello, y usando la técnica del clustering, observamos un conjunto de 79.097 tripletas de la forma X – P – Z, donde P es una preposición del español –basadas en sintagmas preposicionales complementos-. Estas tripletas dependen de una serie de verbos de movimiento del español para las preposiciones a, hacia y hasta extraídas de cuatro corpus lingüísticos muy reconocidos del español. Una vez obtenidas las agrupaciones automáticas se evalúan porcentualmente a partir de la coincidencia entre las predicciones del anotador humano –las clases preposicionales sugeridas- y de la máquina –los clusters-. En el segundo y el tercer experimentos utilizamos otra metodología y acudimos a la medición de la entropía –magnitud de la Teoría de la Información-. En el segundo clasificamos los nombres de 3.898 tripletas que dependen de una serie de verbos del español que representan la mayoría de campos semánticos; y en el tercero son 3903 que complementan a otros nombres. Esta clasificación de los nombres se realiza a partir de una propuesta de seis tipos de categorías semánticas: Animado, Inanimado, Entidad abstracta, Locativo, Temporal y Evento. Una vez clasificados los nombres se mide su organización entrópica y se verifica que existe una correlación entre el grado de entropía y la clase preposicional: a mayor entropía mayor significado. El cuarto experimento parte del uso preposicional. A partir de un test de 90 respuestas con las clases preposicionales de la hipótesis se recogen las respuestas de 366 participantes y se analiza el grado de variación de esas respuestas según la clase preposicional. De nuevo volvemos a usar la entropía como índice de identificación del significado. Sometemos los resultados a ciertas pruebas de control estadístico para verificar la fiabilidad de las muestras, de la significancia y de la coincidencia entre observadores (coeficiente kappa de Cohen). El balance de los cuatro experimentos –a través de los resultados- es favorable a la predicción de la hipótesis. Asimismo, la diversidad de herramientas de análisis es una forma metodológicamente robusta para la investigación y sus conclusiones. Finalmente, se indica que la hipótesis abre perspectivas de futuro en áreas como el contraste interlingüístico –de familias tipológicamente diversas en la expresión adposicional-, o la afasiología como disciplina que se interroga por las relaciones entre errores y valores gramaticales.[eng] The prepositional category has traditionally been a word group endowed controversial traits concerning both its linguistic features and its grammatical behaviors. In this thesis the controversy is examined from a quantitative, computational and linguistic methodology point of wiew. The most unexplained gap in the story of this difficulty of analysis lies in the fact of how its meaning can be identified. From a neo-distributionalism conception, according to which the meaning of the linguistic pieces is in their contextual distribution, the hypothesis that arises is that this semantic expression of the prepositions in Spanish is gradual. The so-called Gradual Meaning Hypothesis establishes four prepositional subclasses, from functional to lexical, through intermediate phases such as semi-functional and semi-lexical. The empirical justification of the Gradual Hypothesis of Meaning is performed with four experiments. The first one experiment is inserted in the machine learning methodology. To do this, and using the clustering technique, we observed a set of 79,097 triplets of the form X - P - Z, where P is a preposition of Spanish - based on complementary prepositional phrases. They are triplets with the prepositions a, hacia and hasta of movement verbs, and they are extracted from four well-known linguistic corpus of Spanish. Once the automatic groupings have been obtained, we indicate to what extent, the percentage between the predictions of the human scorer - the suggested prepositional classes - and the machine - the clusters - are confirmed. In the second and third experiments, we changed our methodology and turned to the measurement of entropy –variable in Information Theory-. In the second onewe classify the names of 3,898 triplets that depend on verbs that appear on most semantic fields in Spanish; and in the third one we classify 3903 triplets that complement other names. This name classification is based on a proposal of six types of semantic categories: Animate, Inanimate, Abstract Entity, Locative, Temporary and Event. Once the names are classified, their entropic organization is measured and it is verified that there is a correlation between the degree of entropy and the prepositional class: the greater the entropy, the greater the meaning. The fourth experiment starts with prepositional use. From a test the degree of variation of these responses is analyzed according to the prepositional class. Again we use entropy as an index of identification of meaning. The balance of the four experiments - through the results - is favorable to the prediction of the hypothesis. The diversity of analysis tools is a methodologically robust way for the research and its conclusions
    corecore