38 research outputs found

    Testing word embeddings for Polish

    Get PDF
    Testing word embeddings for Polish Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results. 聽 Testowanie wektorowych reprezentacji dystrybucyjnych s艂贸w j臋zyka polskiego Semantyka dystrybucyjna opiera si臋 na za艂o偶eniu, 偶e znaczenie s艂贸w wyra偶one jest za pomoc膮 wektor贸w reprezentuj膮cych, w spos贸b bezpo艣redni b膮d藕 po艣redni, konteksty, w jakich s艂owo to jest u偶ywane w du偶ym zbiorze tekst贸w. Niniejszy artyku艂 dotyczy ewaluacji wielu takich modeli skonstruowanych dla j臋zyka polskiego. W pracy por贸wnano skuteczno艣膰 modeli opartych na lematach i formach s艂贸w, utworzonych przy wykorzystaniu sieci neuronowych na danych z dw贸ch r贸偶nych korpus贸w j臋zyka polskiego. Ewaluacji dokonano na podstawie wynik贸w dw贸ch typowych zada艅 rozwi膮zywanych za pomoc膮 metod semantyki dystrybucyjnej, tzn. rozpoznania wyst臋powania synonimii i analogii mi臋dzy konkretnymi parami s艂贸w. Uzyskane wyniki dowodz膮, 偶e nie mo偶na wskaza膰 jednego uniwersalnego podej艣cia do tworzenia modeli dystrybucyjnych, gdy偶 ich skuteczno艣膰 jest r贸偶na w zale偶no艣ci od zastosowania. Najwa偶niejsz膮 cech膮 wp艂ywaj膮c膮 na jako艣膰 modelu jest jako艣膰 oraz rozmiar danych, ale wybory r贸偶nych strategii uczenia sieci mog膮 r贸wnie偶 prowadzi膰 do istotnie odmiennych wynik贸w

    Wspomnienie o Doktor Czes艂awie Leszczyk (1932鈥2016)

    Get PDF

    Simultaneous Reconstruction of Duplication Episodes and Gene-Species Mappings

    Get PDF
    We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of gene trees with missing labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events

    Conflict Resolution Algorithms for Deep Coalescence Phylogenetic Networks

    Get PDF
    We address the problem of inferring an optimal tree displayed by a network, given a gene tree G and a tree-child network N, under the deep coalescence cost. We propose an O(|G||N|)-time dynamic programming algorithm (DP) to compute a lower bound of the optimal displayed tree cost, where |G| and |N| are the sizes of G and N, respectively. This algorithm has the ability to state whether the cost is exact or is a lower bound. In addition, our algorithm provides a set of reticulation edges that correspond to the obtained cost. If the cost is exact, the set induces an optimal displayed tree that yields the cost. If the cost is a lower bound, the set contains pairs of conflicting edges, that is, edges sharing a reticulation node. Next, we show a conflict resolution algorithm that requires 2^{r+1}-1 invocations of DP in the worst case, where r is a number of reticulations. We propose a similar O(2^k|G||N|)-time algorithm for level-k networks and a branch and bound solution to compute lower and upper bounds of optimal costs. We also show how our algorithms can be extended to a broader class of phylogenetic networks. Despite their exponential complexity in the worst case, our solutions perform significantly well on empirical and simulated datasets, thanks to the strategy of resolving internal dissimilarities between gene trees and networks. In particular, experiments on simulated data indicate that the runtime of our solution is ?(2^{0.543 k}|G||N|) on average. Therefore, our solution is an efficient alternative to enumeration strategies commonly proposed in the literature and enables analyses of complex networks with dozens of reticulations

    A word from the editors

    Get PDF
    (no abstract

    Clustering of human liver proteins according to their taxonomic profiles

    No full text
    <p>liver protein clusters</p