19 research outputs found
New Fundamental Technologies in Data Mining
The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining
A comparative analysis of metal subgenres in terms of lexical richness and keyness
Metal music is realized under a vast variety of subgenres all of which have their unique (or shared) characteristics not only in sound but also in their lyrics. Much research has been done to distinguish or classify subgenres but little has addressed the linguistic differences across them. This study seeks to find out the lexical richness and keyness levels of heavy metal, thrash metal and death metal using a corpus of 200 songs from each subgenre with a total of 600 songs. The selection of the bands and songs was carried out finding references in the metal literature. The metal literature in the present study takes into account the academic books and articles on metal as well as noteworthy media productions, websites and metal blogs such as Metal Evolution and Encyclopaedia Metallum.
The song lyrics were manually processed and meta-data, mark-ups and repeats have been removed so that the differences in repeat lengths do not affect the comparisons. Furthermore, the analyses used in the study are sensitive to repeats as they measure the frequencies and repeat ratios of the words. The song lengths – after the processing – were limited to lower and upper thresholds of 100 and 400 words.
The songs were analyzed for their lexical richness levels in three aspects: 1) lexical variation, 2) lexical sophistication and 3) lexical density. Lexical variation was operationalized as TTR, Guiraud, Uber and HD-D. Lexical sophistication was measured using lexical frequency profile with two different frequency lists – the GSL and the BNC/COCA – by looking at the ratios of tokens and types which fell beyond the most frequent two thousand words (Laufer 1995). Another sophistication measure – P_Lex – which also runs on GSL, was applied. Lexical density analysis was based on the ratio of content words to all tokens in the texts. In order to complement this quantitative and data-driven approach, a keyness analysis was administered to add a qualitative dimension to the research.
All lexical richness analyses pointed out to statistically significant differences between all subgenres, marking heavy metal as the least and death metal as the most lexically rich one. Keyness analysis indicated differences among all three subgenres as well. Heavy metal key words tended to be Dionysian whereas thrash and death metal keywords were more Chaotic as proposed by Weinstein (2000). Finally, a correlation analysis showed that all lexical richness measures were statistically significantly correlated to each other. Based on the findings, it could be claimed that 1) these three subgenres differ from each other not only in terms of music but also of lexical richness levels and key words and 2) lexical richness analyses, coupled with keyness, are capable of reflecting the genre differences in song lyrics. However, as a result of a discriminant analysis of the present corpus, a reverse approach whereby genres are attempted to be classified based on lexical features does not provide a pattern which fully corresponds to the existing classifications
An Initial Framework Assessing the Safety of Complex Systems
Trabajo presentado en la Conference on Complex Systems, celebrada online del 7 al 11 de diciembre de 2020.Atmospheric blocking events, that is large-scale nearly stationary atmospheric pressure patterns, are often associated with extreme weather in the mid-latitudes, such as heat waves and cold spells which have significant consequences on ecosystems, human health and economy. The high impact of blocking events has motivated numerous studies. However, there is not yet a comprehensive theory explaining their onset, maintenance and decay and their numerical prediction remains a challenge. In recent years, a number of studies have successfully employed complex network descriptions of fluid transport to characterize dynamical patterns in geophysical flows. The aim of the current work is to investigate the potential of so called Lagrangian flow networks for the detection and perhaps forecasting of atmospheric blocking events. The network is constructed by associating nodes to regions of the atmosphere and establishing links based on the flux of material between these nodes during a given time interval. One can then use effective tools and metrics developed in the context of graph theory to explore the atmospheric flow properties. In particular, Ser-Giacomi et al. [1] showed how optimal paths in a Lagrangian flow network highlight distinctive circulation patterns associated with atmospheric blocking events. We extend these results by studying the behavior of selected network measures (such as degree, entropy and harmonic closeness centrality)at the onset of and during blocking situations, demonstrating their ability to trace the spatio-temporal characteristics of these events.This research was conducted as part of the CAFE (Climate Advanced Forecasting of sub-seasonal Extremes) Innovative Training Network which has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 813844
A comparative analysis of metal subgenres in terms of lexical richness and keyness
Metal music is realized under a vast variety of subgenres all of which have their unique (or shared) characteristics not only in sound but also in their lyrics. Much research has been done to distinguish or classify subgenres but little has addressed the linguistic differences across them. This study seeks to find out the lexical richness and keyness levels of heavy metal, thrash metal and death metal using a corpus of 200 songs from each subgenre with a total of 600 songs. The selection of the bands and songs was carried out finding references in the metal literature. The metal literature in the present study takes into account the academic books and articles on metal as well as noteworthy media productions, websites and metal blogs such as Metal Evolution and Encyclopaedia Metallum.
The song lyrics were manually processed and meta-data, mark-ups and repeats have been removed so that the differences in repeat lengths do not affect the comparisons. Furthermore, the analyses used in the study are sensitive to repeats as they measure the frequencies and repeat ratios of the words. The song lengths – after the processing – were limited to lower and upper thresholds of 100 and 400 words.
The songs were analyzed for their lexical richness levels in three aspects: 1) lexical variation, 2) lexical sophistication and 3) lexical density. Lexical variation was operationalized as TTR, Guiraud, Uber and HD-D. Lexical sophistication was measured using lexical frequency profile with two different frequency lists – the GSL and the BNC/COCA – by looking at the ratios of tokens and types which fell beyond the most frequent two thousand words (Laufer 1995). Another sophistication measure – P_Lex – which also runs on GSL, was applied. Lexical density analysis was based on the ratio of content words to all tokens in the texts. In order to complement this quantitative and data-driven approach, a keyness analysis was administered to add a qualitative dimension to the research.
All lexical richness analyses pointed out to statistically significant differences between all subgenres, marking heavy metal as the least and death metal as the most lexically rich one. Keyness analysis indicated differences among all three subgenres as well. Heavy metal key words tended to be Dionysian whereas thrash and death metal keywords were more Chaotic as proposed by Weinstein (2000). Finally, a correlation analysis showed that all lexical richness measures were statistically significantly correlated to each other. Based on the findings, it could be claimed that 1) these three subgenres differ from each other not only in terms of music but also of lexical richness levels and key words and 2) lexical richness analyses, coupled with keyness, are capable of reflecting the genre differences in song lyrics. However, as a result of a discriminant analysis of the present corpus, a reverse approach whereby genres are attempted to be classified based on lexical features does not provide a pattern which fully corresponds to the existing classifications
COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGY
Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role in organizing terminology used to describe—among other concepts—the functions, locations, and processes of genes and gene-products. Due to the consistency and level of automation that ontologies provide for such annotations, methods for finding enriched biological terminology from a set of differentially identified genes in a tissue or cell sample have been developed to aid in the elucidation of disease pathology and unknown biochemical pathways. However, despite their immense utility, biomedical ontologies have significant limitations and caveats. One major issue is that gene annotation enrichment analyses often result in many redundant, individually enriched ontological terms that are highly specific and weakly justified by statistical significance. These large sets of weakly enriched terms are difficult to interpret without manually sorting into appropriate functional or descriptive categories. Also, relationships that organize the terminology within these ontologies do not contain descriptions of semantic scoping or scaling among terms. Therefore, there exists some ambiguity, which complicates the automation of categorizing terms to improve interpretability.
We emphasize that existing methods enable the danger of producing incorrect mappings to categories as a result of these ambiguities, unless simplified and incomplete versions of these ontologies are used which omit problematic relations. Such ambiguities could have a significant impact on term categorization, as we have calculated upper boundary estimates of potential false categorizations as high as 121,579 for the misinterpretation of a single scoping relation, has_part, which accounts for approximately 18% of the total possible mappings between terms in the Gene Ontology. However, the omission of problematic relationships results in a significant loss of retrievable information. In the Gene Ontology, this accounts for a 6% reduction for the omission of a single relation. However, this percentage should increase drastically when considering all relations in an ontology. To address these issues, we have developed methods which categorize individual ontology terms into broad, biologically-related concepts to improve the interpretability and statistical significance of gene-annotation enrichment studies, meanwhile addressing the lack of semantic scoping and scaling descriptions among ontological relationships so that annotation enrichment analyses can be performed across a more complete representation of the ontological graph.
We show that, when compared to similar term categorization methods, our method produces categorizations that match hand-curated ones with similar or better accuracy, while not requiring the user to compile lists of individual ontology term IDs. Furthermore, our handling of problematic relations produces a more complete representation of ontological information from a scoping perspective, and we demonstrate instances where medically-relevant terms--and by extension putative gene targets--are identified in our annotation enrichment results that would be otherwise missed when using traditional methods. Additionally, we observed a marginal, yet consistent improvement of statistical power in enrichment results when our methods were used, compared to traditional enrichment analyses that utilize ontological ancestors. Finally, using scalable and reproducible data workflow pipelines, we have applied our methods to several genomic, transcriptomic, and proteomic collaborative projects
Automatic Analysis of Linguistic Complexity and Its Application in Language Learning Research
The construct of complexity, together with accuracy and fluency have become the central foci of language learning research in recent years. This dissertation focuses on complexity, a multidimensional construct that has its own working mechanism, cognitive and psycholinguistic processes, and developmental dynamics. Six studies revolving around complexity, including its conceptualization, automatic measurement, and application in language acquisition research are reported.
The basis of these studies is the automatic multidimensional analysis of linguistic complexity, which was implemented into a Web platform called Common Text Analysis Platform by making use of state-of-the-art Natural Language Processing (NLP) technologies . The system provides a rich set of complexity measures that are easily accessible by normal users and supports collaborative development of complexity feature extractors.
An application study zooming into characterizing the text-level readability with the word-level feature of lexical frequency is reported next. It was found that the lexical complexity measure of word frequency was highly predictive of text readability. Another application study focuses on investigating the developmental interrelationship between complexity and accuracy, an issue that conflicting theories and research results have been reported. Our findings support the simultaneous development account.
The other few studies are about applying automatic complexity analysis to promote language development, which involves analyzing both learning input and learner production, as well as linking the two spaces. We first proposed and validated the approach to link input and production with complexity feature vector distances. Then the ICALL system SyB implementing the approach was developed and demonstrated. An effective test of the system was conducted with a randomized control experiment that tested the effects of different levels of input challenge on L2 development. Results of the experiment supported the comprehensible input hypothesis in Second Language Acquisition (SLA) and provided an automatizable operationalization of the theory.
The series of studies in this dissertation demonstrates how language learning research can benefit from NLP technologies. On the other hand, it also demonstrates how these technologies can be applied to build practical language learning systems based on solid theoretical and research foundations in SLA
DOCODE 3.0 (DOcument COpy DEtector): A system for plagiarism detection by applying an information fusion process from multiple documental data sources
Plagiarism refers to the act of presenting external words, thoughts, or ideas as one’s own, without providing references to the sources from which they were taken. The exponential growth of different digital document sources available on the Web has facilitated the spread of this practice, making the accurate detection of it a crucial task for educational institutions. In this article, we present DOCODE 3.0, a Web system for educational institutions that performs automatic analysis of large quantities of digital documents in relation to their degree of originality. Since plagiarism is a complex problem, frequently tackled at different levels, our system applies algorithms in order to perform an information fusion process from multi data source to all these levels. These algorithms have been successfully tested in the scientific community in solving tasks like the identification of plagiarized passages and the retrieval of source candidates from the Web, among other multi data sources as digital libraries, and have proven to be very effective. We integrate these algorithms into a multi-tier, robust and scalable JEE architecture, allowing many different types of clients with different requirements to consume our services. For users, DOCODE produces a number of visualizations and reports from the different outputs to let teachers and professors gain insights on the originality of the documents they review, allowing them to discover, understand and handle possible plagiarism cases and making it easier and much faster to analyze a vast number of documents. Our experience here is so far focused on the Chilean situation and the Spanish language, offering solutions to Chilean educational institutions in any of their preferred Virtual Learning Environments. However, DOCODE can easily be adapted to increase language coverage
Advances in knowledge discovery and data mining Part II
19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p