12 research outputs found

    Inducing Language Networks from Continuous Space Word Representations

    Full text link
    Recent advancements in unsupervised feature learning have developed powerful latent representations of words. However, it is still not clear what makes one representation better than another and how we can learn the ideal representation. Understanding the structure of latent spaces attained is key to any future advancement in unsupervised learning. In this work, we introduce a new view of continuous space word representations as language networks. We explore two techniques to create language networks from learned features by inducing them for two popular word representation methods and examining the properties of their resulting networks. We find that the induced networks differ from other methods of creating language networks, and that they contain meaningful community structure.Comment: 14 page

    On the Modeling of Musical Solos as Complex Networks

    Full text link
    Notes in a musical piece are building blocks employed in non-random ways to create melodies. It is the "interaction" among a limited amount of notes that allows constructing the variety of musical compositions that have been written in centuries and within different cultures. Networks are a modeling tool that is commonly employed to represent a set of entities interacting in some way. Thus, notes composing a melody can be seen as nodes of a network that are connected whenever these are played in sequence. The outcome of such a process results in a directed graph. By using complex network theory, some main metrics of musical graphs can be measured, which characterize the related musical pieces. In this paper, we define a framework to represent melodies as networks. Then, we provide an analysis on a set of guitar solos performed by main musicians. Results of this study indicate that the presented model can have an impact on audio and multimedia applications such as music classification, identification, e-learning, automatic music generation, multimedia entertainment.Comment: to appear in Information Science, Elsevier. Please cite the paper including such information. arXiv admin note: text overlap with arXiv:1603.0497

    An investigation into the effects and effectiveness of correlation network filtration methods with financial returns

    Get PDF
    When studying financial markets, we often look at estimating a correlation matrix from asset returns. These tend to be noisy, with many more dimensions than samples, so often the resulting correlation matrix is filtered. Popular methods to do this include the minimum spanning tree, planar maximally filtered graph and the triangulated maximally filtered graph, which involve using the correlation network as the adjacency matrix of a graph and then using tools from graph theory. These assume the data fits some form of shape. We do not necessarily have a reason to believe that the data does fit into this shape, and there have been few empirical investigations comparing how the methods perform. In this paper we look at how the filtered networks are changed from the original networks using stock returns from the US, UK, German, Indian and Chinese markets, and at how these methods affect our ability to distinguish between datasets created from different correlation matrices using a graph embedding algorithm. We find that the relationship between the full and filtered networks depends on the data and the state of the market, and decreases as we increase the size of networks, and that the filtered networks do not provide an improvement in classification accuracy compared to the full networks

    A Corpus-based Language Network Analysis of Near-synonyms in a Specialized Corpus

    Get PDF
    As the international medium of communication for seafarers throughout the world, the importance of English has long been recognized in the maritime industry. Many studies have been conducted on Maritime English teaching and learning, nevertheless, although there are many near-synonyms existing in the language, few studies have been conducted on near-synonyms used in the maritime industry. The objective of this study is to answer the following three questions. First, what are the differences and similarities between different near-synonyms in English? Second, can collocation network analysis provide a new perspective to explain the distinctions of near-synonyms from a micro-scopic level? Third, is semantic domain network analysis useful to distinguish one near-synonym from the other at the macro-scopic level? In pursuit of these research questions, I first illustrated how the idea of incorporating collocates in corpus linguistics, Maritime English, near-synonyms, semantic domains and language network was studied. Then important concepts such as Maritime English, English for Specific Purposes, corpus linguistics, synonymy, collocation, semantic domains and language network analysis were introduced. Third, I compiled a 2.5 million word specialized Maritime English Corpus and proposed a new method of tagging English multi-word compounds, discussing the comparison of with and without multi-word compounds with regard to tokens, types, STTR and mean word length. Fourth, I examined collocates of five groups of near-synonyms, i.e., ship vs. vessel, maritime vs. marine, ocean vs. sea, safety vs. security, and harbor vs. port, drawing data through WordSmith 6.0, tagging semantic domains in Wmatrix 3.0, and conducting network analyses using NetMiner 4.0. In the final stage, from the results and discussions, I was able to answer the research questions. First, maritime near-synonyms generally show clear preference to specific collocates. Due to the specialty of Maritime English, general definitions are not helpful for the distinction between near-synonyms, therefore a new perspective is needed to view the behaviors of maritime words. Second, as a special visualization method, collocation network analysis can provide learners with a direct vision of the relationships between words. Compared with traditional collocation tables, learners are able to more quickly identify the collocates and find the relationship between several node words. In addition, it is much easier for learners to find the collocates exclusive to a specific word, thereby helping them to understand the meaning specific to that word. Third, if the collocation network shows learners relationships of words, the semantic domain network is able to offer guidance cognitively: when a person has a specific word, how he can process it in his mind and therefore find the more appropriate synonym to collocate with. Main semantic domain network analysis shows us the exclusive domains to a certain near-synonym, and therefore defines the concepts exclusive to that near-synonym: furthermore, main semantic domain network analysis and sub-semantic domain network analysis together are able to tell us how near-synonyms show preference or tendency for one synonym rather than another, even when they have shared semantic domains. The options in identifying relationships of near-synonyms can be presented through the classic metaphor of "the forest and the trees." Generally speaking, we see only the vein of a tree leaf through the traditional way of sentence-level analysis. We see the full leaf through collocation network analysis. We see the tree, even the whole forest, through semantic domain network analysis.Contents Chapter 1. Introduction 1 1.1 Focus of Inquiry 1 1.2 Outline of the Thesis 5 Chapter 2. Literature Review 8 2.1 A Brief Synopsis 8 2.2 Maritime English as an English for Specific Purposes (ESP) 9 2.2.1 What is ESP? 9 2.2.2 Maritime English as ESP 10 2.2.3 ESP and Corpus Linguistics 11 2.3 Synonymy 12 2.3.1 Definition of Synonymy 13 2.3.2 Synonymy as a Matter of Degree 15 2.3.3 Criteria for Synonymy Differentiation 18 2.3.4 Near-synonyms in Corpus Linguistics 19 2.4 Collocation 21 2.4.1 Definition of Collocation 21 2.4.2 Collocation in Corpus Linguistics 22 2.4.2.1 Definition of Collocation in Corpus Linguistics 23 2.4.2.2 Collocation vs. Colligation 24 2.4.3 Lexical Priming of Collocation in Psychology 25 2.5 Language Network Analysis 26 2.5.1 Definition 26 2.5.2 Classification 27 2.5.3 Basic Concepts 31 2.5.4 Previous Studies 33 2.6 Semantic Domain Analysis 39 2.6.1 Concepts of Semantic Domains 39 2.6.2 Previous Studies on Semantic Domain Analysis 39 Chapter 3. Data and Methodology 41 3.1 Maritime English Corpus 41 3.1.1 What is a Corpus? 41 3.1.2 Characteristics of a Corpus 42 3.1.2.1 Corpus-driven vs. Corpus-based research 42 3.1.2.2 Specialized Corpora for Specialized Discourse 43 3.1.3 Maritime English Corpus (MEC) 44 3.1.3.1 Sampling of the MEC 45 3.1.3.2 Size, Balance, and Representativeness 51 3.1.3.3 Multi-word Compounds in the MEC 53 3.1.3.4 Basic Information of the MEC 56 3.2 Methodology for Collocates Extraction 60 3.3 Methodology for Networks Visualization 63 3.4 Methodology for Semantic Tagging 65 3.5 Process of Data Analysis 69 Chapter 4. Collocation Network Analysis of Near-synonyms 70 4.1 Meaning Differences 71 4.1.1 Ship vs. Vessel 71 4.1.2 Maritime vs. Marine 72 4.1.3 Sea vs. Ocean 73 4.1.4 Safety vs. Security 74 4.1.5 Port vs. Harbor 76 4.2 Similarity Degree of Groups of Near-synonyms 76 4.2.1 Similarity Degree Based on Number of Shared Collocates 77 4.2.2 Similarity Degree Based on MI3 Cosine Similarity 78 4.3 Collocation Network Analysis 80 4.3.1 Ship vs. Vessel 80 4.3.2 Maritime vs. Marine 82 4.3.3 Sea vs. Ocean 84 4.3.4 Safety vs. Security 85 4.3.5 Port vs. Harbor 87 4.4 Advantages and Limitations of Collocation Network Analysis 88 Chapter 5. Semantic Domain Network Analysis of Near-synonyms 89 5.1 Comparison between Collocation and Semantic Domain Analysis 89 5.2 Semantic Domain Network Analysis of Exclusiveness 92 5.2.1 Ship vs. Vessel 93 5.2.2 Maritime vs. Marine 96 5.2.3 Sea vs. Ocean 99 5.2.4 Safety vs. Security 102 5.2.5 Port vs. Harbor 105 5.3 Analysis of Shared Semantic Domains 108 5.4 Advantages and Limitations of Semantic Domain Network Analysis 112 Chapter 6. Conclusion 113 6.1 Summary 113 6.2 Limitations and Implications 116 References 118 Appendix: Collocates of Near-synonyms 136Docto

    Development of Computer-aided Concepts for the Optimization of Single-Molecules and their Integration for High-Throughput Screenings

    Get PDF
    In the field of synthetic biology, highly interdisciplinary approaches for the design and modelling of functional molecules using computer-assisted methods have become established in recent decades. These computer-assisted methods are mainly used when experimental approaches reach their limits, as computer models are able to e.g., elucidate the temporal behaviour of nucleic acid polymers or proteins by single-molecule simulations, as well as to illustrate the functional relationship of amino acid residues or nucleotides to each other. The knowledge raised by computer modelling can be used continuously to influence the further experimental process (screening), and also shape or function (rational design) of the considered molecule. Such an optimization of the biomolecules carried out by humans is often necessary, since the observed substrates for the biocatalysts and enzymes are usually synthetic (``man-made materials'', such as PET) and the evolution had no time to provide efficient biocatalysts. With regard to the computer-aided design of single-molecules, two fundamental paradigms share the supremacy in the field of synthetic biology. On the one hand, probabilistic experimental methods (e.g., evolutionary design processes such as directed evolution) are used in combination with High-Throughput Screening (HTS), on the other hand, rational, computer-aided single-molecule design methods are applied. For both topics, computer models/concepts were developed, evaluated and published. The first contribution in this thesis describes a computer-aided design approach of the Fusarium Solanie Cutinase (FsC). The activity loss of the enzyme during a longer incubation period was investigated in detail (molecular) with PET. For this purpose, Molecular Dynamics (MD) simulations of the spatial structure of FsC and a water-soluble degradation product of the synthetic substrate PET (ethylene glycol) were computed. The existing model was extended by combining it with Reduced Models. This simulation study has identified certain areas of FsC which interact very strongly with PET (ethylene glycol) and thus have a significant influence on the flexibility and structure of the enzyme. The subsequent original publication establishes a new method for the selection of High-Throughput assays for the use in protein chemistry. The selection is made via a meta-optimization of the assays to be analyzed. For this purpose, control reactions are carried out for the respective assay. The distance of the control distributions is evaluated using classical static methods such as the Kolmogorov-Smirnov test. A performance is then assigned to each assay. The described control experiments are performed before the actual experiment (screening), and the assay with the highest performance is used for further screening. By applying this generic method, high success rates can be achieved. We were able to demonstrate this experimentally using lipases and esterases as an example. In the area of green chemistry, the above-mentioned processes can be useful for finding enzymes for the degradation of synthetic materials more quickly or modifying enzymes that occur naturally in such a way that these enzymes can efficiently convert synthetic substrates after successful optimization. For this purpose, the experimental effort (consumption of materials) is kept to a minimum during the practical implementation. Especially for large-scale screenings, a prior consideration or restriction of the possible sequence-space can contribute significantly to maximizing the success rate of screenings and minimizing the total time they require. In addition to classical methods such as MD simulations in combination with reduced models, new graph-based methods for the presentation and analysis of MD simulations have been developed. For this purpose, simulations were converted into distance-dependent dynamic graphs. Based on this reduced representation, efficient algorithms for analysis were developed and tested. In particular, network motifs were investigated to determine whether this type of semantics is more suitable for describing molecular structures and interactions within MD simulations than spatial coordinates. This concept was evaluated for various MD simulations of molecules, such as water, synthetic pores, proteins, peptides and RNA structures. It has been shown that this novel form of semantics is an excellent way to describe (bio)molecular structures and their dynamics. Furthermore, an algorithm (StreAM-Tg) has been developed for the creation of motif-based Markov models, especially for the analysis of single molecule simulations of nucleic acids. This algorithm is used for the design of RNAs. The insights obtained from the analysis with StreAM-Tg (Markov models) can provide useful design recommendations for the (re)design of functional RNA. In this context, a new method was developed to quantify the environment (i.e. water; solvent context) and its influence on biomolecules in MD simulations. For this purpose, three vertex motifs were used to describe the structure of the individual water molecules. This new method offers many advantages. With this method, the structure and dynamics of water can be accurately described. For example, we were able to reproduce the thermodynamic entropy of water in the liquid and vapor phase along the vapor-liquid equilibrium curve from the triple point to the critical point. Another major field covered in this thesis is the development of new computer-aided approaches for HTS for the design of functional RNA. For the production of functional RNA (e.g., aptamers and riboswitches), an experimental, round-based HTS (like SELEX) is typically used. By using Next Generation Sequencing (NGS) in combination with the SELEX process, this design process can be studied at the nucleotide and secondary structure levels for the first time. The special feature of small RNA molecules compared to proteins is that the secondary structure (topology), with a minimum free energy, can be determined directly from the nucleotide sequence, with a high degree of certainty. Using the combination of M. Zuker's algorithm, NGS and the SELEX method, it was possible to quantify the structural diversity of individual RNA molecules under consideration of the genetic context. This combination of methods allowed the prediction of rounds in which the first ciprofloxacin-riboswitch emerged. In this example, only a simple structural comparison was made for the quantification (Levenshtein distance) of the diversity of each round. To improve this, a new representation of the RNA structure as a directed graph was modeled, which was then compared with a probabilistic subgraph isomorphism. Finally, the NGS dataset (ciprofloxacin-riboswitch) was modeled as a dynamic graph and analyzed after the occurrence of defined seven-vertex motifs. For this purpose, motif-based semantics were integrated into HTS for RNA molecules for the first time. The identified motifs could be assigned to secondary structural elements that were identified experimentally in the ciprofloxacin aptamer R10k6. Finally, all the algorithms presented were integrated into an R library, published and made available to scientists from all over the world
    corecore