14 research outputs found

    Utilisation de PLSI en recherche d'information

    Get PDF
    The PLSI model (ā€œProbabilistic Latent Semantic Indexingā€) offers a document indexing scheme based on probabilistic latent category models. It entailed applications in diverse ļ¬elds, notably in information retrieval (IR). Nevertheless, PLSI cannot process documents not seen during parameter inference, a major liability for queries in IR. A method known as ā€œfolding-inā€ allows to circumvent this problem up to a point, but has its own weaknesses. The present paper introduces a new document-query similarity measure for PLSI based on language models that entirely avoids the problem a query projection. We compare this similarity to Fisher kernels, the state of the art similarities for PLSI. Moreover, we present an evaluation of PLSI on a particularly large training set of almost 7500 document and over one million term occurrence large, created from the TRECā€“AP collection

    Variational Gaussian Inference for Bilinear Models of Count Data

    Get PDF
    Bilinear models of count data with Poisson distribution are popular in applications such as matrix factorization for recommendation systems, modeling of receptive fields of sensory neurons, and modeling of neural-spike trains. Bayesian inference in such models remains challenging due to the product term of two Gaussian random vectors. In this paper, we propose new algorithms for such models based on variational Gaussian (VG) inference. We make two contributions. First, we show that the VG lower bound for these models, previously known to be intractable, is available in closed form under certain non-trivial constraints on the form of the posterior. Second, we show that the lower bound is biconcave and can be efficiently optimized for mean-field approximations. We also show that bi-concavity generalizes to the larger family of log-concave likelihoods, that subsume the Poisson distribution. We present new inference algorithms based on these results and demonstrate better performance on real-world problems at the cost of a modest increase in computation. Our contributions in this paper, therefore, provide more choices for Bayesian inference in terms of a speed-vs-accuracy tradeoff

    The Bayesian Learning Rule

    Full text link
    We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones

    Latentna semantička analiza, varijante i primjene

    Get PDF
    U danaÅ”nje vrijeme sve viÅ”e težimo tome da omogućimo da računalo izvrÅ”ava zadatke, koje čovjek čini rutinski, jednako brzo i efikasno. Jedan od takvih zadataka je i pronalazak par dokumenata iz kolekcije koji su najrelevantniji za korisnikov upit. Prvi korak u rjeÅ”avanju tog problema je reprezentacija kolekcije dokumenata pojmovno-dokumentnom matricom, čiji elementi predstavljaju tf-idf težine riječi u dokumentu. Na taj način smo svaki dokument prikazali vektorom u prostoru pojmova. Ako i upit prikažemo vektorom, onda za usporedbu upita i dokumenta iz kolekcije, možemo iskoristiti standardne mjere sličnosti, poput kosinusne. U takvom prostoru, sinonimi će biti ortogonalni, a viÅ”eznačnice će biti predstavljene jednim vektorom, neovisno o kontekstu u kojem se riječ nalazi. Motivirani tom činjenicom i velikom dimenzijom pojmovno-dokumentne matrice, odlučili smo ju aproksimirati matricom nižeg ranga. Aproksimaciju je omogućila singularna dekompozicija matrice (SVD). Pokazali smo da aproksimacijom uzimamo u obzir kontekst u kojem se riječ nalazi. Kako bismo korisnikov upit mogli usporediti s vektorima dokumenata u novonastalom prostoru i njega transformiramo. Pokazali smo kako u slučaju dinamičke kolekcije možemo dodati nove dokumente i pojmove u već postojeći latentni prostor. Iako je opisana metoda, koju kraće zovemo LSA, donekle rijeÅ”ila problem sinonima, preostao je problem s viÅ”eznačnicama. Osim toga, LSA pretpostavlja da Å”um uzorka podataka (dobiven zbog jezične varijabilnosti) ima Gaussovu distribuciju, Å”to nije prirodna pretpostavka. Sljedećom metodom, pLSA, pretpostavili smo da svaki dokument dolazi iz nekog generativnog, vjerojatnosnog procesa čije parametre tražimo maksimizacijom izglednosti. Svaki dokument je mjeÅ”avina latentnih koncepata i tražimo posteriorne vjerojatnosti tih koncepata uz dana opažanja. Međutim, pLSA ih shvaća kao parametar modela, Å”to dovodi do prenaučenosti. Zato smo prezentirali joÅ” jedan model, LDA, koji te vjerojatnosti tretira kao distribuciju koja ovisi o nekom parametru. Kao i pLSA, i LDA reprezentira dokumente kao mjeÅ”avinu latentnih tema, ali teme su sada distribucije riječi iz rječnika. Zato je bilo potrebno definirati neku distribuciju distribucija, gdje se prirodno nametnula Diricheltova distribucija. Na kraju smo ukratko prikazali modeliranje tema na kolekciji članaka iz Wikipedije.Nowadays, more and more important is to make a computer that performs tasks that man does routinely, as fast and efficiently. One of these tasks is finding a few documents from the given collection, that are most relevant for userā€™s query. The first step in solving this problem is representing the collection of documents as a term-document matrix, whose elements are tf-idf weights of words in the document. In this way, we represent each document as a vector in the space of terms. If the query is represented as a vector as well, standard similarity measures, such as a cosine similarity, can be used for comparison of the query and documents. In such space, synonyms will be orthogonal and polysemies will be presented with one vector, regardless of the context of the word. Motivated by this fact, and a large dimension of the term-document matrix, a lower rank approximation of the matrix is done. The approximation is gained using a singular value decomposition (SVD) of the matrix. We have shown that the approximation takes into account the context of the words. The query needs to be transformed into a new space as well, so it can be compared with vectors in this lower dimensional space. We showed how can we add new documents and terms in the case of a dynamic collection. While this method, solves the problem of synonyms to some extent, the problem with polysemies remains unsolved. In addition, LSA assumes that the data noise (gained from language variability) has a Gaussian distribution, which is not a natural assumption. The following method, pLSA, assumes that each document comes from a generative, probabilistic process, whose parameters we seek with maximization of likelihood. Each document is a mixture of latent concepts and we look for posterior probabilities of these concepts when observations are given. However, pLSA assumes these probabilities are parameters of model which leads to over-fitting of the model. Therefore, we present another model, LDA, that treats these probabilities as a distribution that depends on some parameter. Documents are, again, represented as a mixture of latent topics, but these topics are a distribution of words from the dictionary. Therefore, it is necessary to define a distribution of distributions and a natural choice is the Dirichelt distribution. Finally, we have briefly presented a topic modeling of the collection of articles from Wikipedia

    Latentna semantička analiza, varijante i primjene

    Get PDF
    U danaÅ”nje vrijeme sve viÅ”e težimo tome da omogućimo da računalo izvrÅ”ava zadatke, koje čovjek čini rutinski, jednako brzo i efikasno. Jedan od takvih zadataka je i pronalazak par dokumenata iz kolekcije koji su najrelevantniji za korisnikov upit. Prvi korak u rjeÅ”avanju tog problema je reprezentacija kolekcije dokumenata pojmovno-dokumentnom matricom, čiji elementi predstavljaju tf-idf težine riječi u dokumentu. Na taj način smo svaki dokument prikazali vektorom u prostoru pojmova. Ako i upit prikažemo vektorom, onda za usporedbu upita i dokumenta iz kolekcije, možemo iskoristiti standardne mjere sličnosti, poput kosinusne. U takvom prostoru, sinonimi će biti ortogonalni, a viÅ”eznačnice će biti predstavljene jednim vektorom, neovisno o kontekstu u kojem se riječ nalazi. Motivirani tom činjenicom i velikom dimenzijom pojmovno-dokumentne matrice, odlučili smo ju aproksimirati matricom nižeg ranga. Aproksimaciju je omogućila singularna dekompozicija matrice (SVD). Pokazali smo da aproksimacijom uzimamo u obzir kontekst u kojem se riječ nalazi. Kako bismo korisnikov upit mogli usporediti s vektorima dokumenata u novonastalom prostoru i njega transformiramo. Pokazali smo kako u slučaju dinamičke kolekcije možemo dodati nove dokumente i pojmove u već postojeći latentni prostor. Iako je opisana metoda, koju kraće zovemo LSA, donekle rijeÅ”ila problem sinonima, preostao je problem s viÅ”eznačnicama. Osim toga, LSA pretpostavlja da Å”um uzorka podataka (dobiven zbog jezične varijabilnosti) ima Gaussovu distribuciju, Å”to nije prirodna pretpostavka. Sljedećom metodom, pLSA, pretpostavili smo da svaki dokument dolazi iz nekog generativnog, vjerojatnosnog procesa čije parametre tražimo maksimizacijom izglednosti. Svaki dokument je mjeÅ”avina latentnih koncepata i tražimo posteriorne vjerojatnosti tih koncepata uz dana opažanja. Međutim, pLSA ih shvaća kao parametar modela, Å”to dovodi do prenaučenosti. Zato smo prezentirali joÅ” jedan model, LDA, koji te vjerojatnosti tretira kao distribuciju koja ovisi o nekom parametru. Kao i pLSA, i LDA reprezentira dokumente kao mjeÅ”avinu latentnih tema, ali teme su sada distribucije riječi iz rječnika. Zato je bilo potrebno definirati neku distribuciju distribucija, gdje se prirodno nametnula Diricheltova distribucija. Na kraju smo ukratko prikazali modeliranje tema na kolekciji članaka iz Wikipedije.Nowadays, more and more important is to make a computer that performs tasks that man does routinely, as fast and efficiently. One of these tasks is finding a few documents from the given collection, that are most relevant for userā€™s query. The first step in solving this problem is representing the collection of documents as a term-document matrix, whose elements are tf-idf weights of words in the document. In this way, we represent each document as a vector in the space of terms. If the query is represented as a vector as well, standard similarity measures, such as a cosine similarity, can be used for comparison of the query and documents. In such space, synonyms will be orthogonal and polysemies will be presented with one vector, regardless of the context of the word. Motivated by this fact, and a large dimension of the term-document matrix, a lower rank approximation of the matrix is done. The approximation is gained using a singular value decomposition (SVD) of the matrix. We have shown that the approximation takes into account the context of the words. The query needs to be transformed into a new space as well, so it can be compared with vectors in this lower dimensional space. We showed how can we add new documents and terms in the case of a dynamic collection. While this method, solves the problem of synonyms to some extent, the problem with polysemies remains unsolved. In addition, LSA assumes that the data noise (gained from language variability) has a Gaussian distribution, which is not a natural assumption. The following method, pLSA, assumes that each document comes from a generative, probabilistic process, whose parameters we seek with maximization of likelihood. Each document is a mixture of latent concepts and we look for posterior probabilities of these concepts when observations are given. However, pLSA assumes these probabilities are parameters of model which leads to over-fitting of the model. Therefore, we present another model, LDA, that treats these probabilities as a distribution that depends on some parameter. Documents are, again, represented as a mixture of latent topics, but these topics are a distribution of words from the dictionary. Therefore, it is necessary to define a distribution of distributions and a natural choice is the Dirichelt distribution. Finally, we have briefly presented a topic modeling of the collection of articles from Wikipedia

    Latentna semantička analiza, varijante i primjene

    Get PDF
    U danaÅ”nje vrijeme sve viÅ”e težimo tome da omogućimo da računalo izvrÅ”ava zadatke, koje čovjek čini rutinski, jednako brzo i efikasno. Jedan od takvih zadataka je i pronalazak par dokumenata iz kolekcije koji su najrelevantniji za korisnikov upit. Prvi korak u rjeÅ”avanju tog problema je reprezentacija kolekcije dokumenata pojmovno-dokumentnom matricom, čiji elementi predstavljaju tf-idf težine riječi u dokumentu. Na taj način smo svaki dokument prikazali vektorom u prostoru pojmova. Ako i upit prikažemo vektorom, onda za usporedbu upita i dokumenta iz kolekcije, možemo iskoristiti standardne mjere sličnosti, poput kosinusne. U takvom prostoru, sinonimi će biti ortogonalni, a viÅ”eznačnice će biti predstavljene jednim vektorom, neovisno o kontekstu u kojem se riječ nalazi. Motivirani tom činjenicom i velikom dimenzijom pojmovno-dokumentne matrice, odlučili smo ju aproksimirati matricom nižeg ranga. Aproksimaciju je omogućila singularna dekompozicija matrice (SVD). Pokazali smo da aproksimacijom uzimamo u obzir kontekst u kojem se riječ nalazi. Kako bismo korisnikov upit mogli usporediti s vektorima dokumenata u novonastalom prostoru i njega transformiramo. Pokazali smo kako u slučaju dinamičke kolekcije možemo dodati nove dokumente i pojmove u već postojeći latentni prostor. Iako je opisana metoda, koju kraće zovemo LSA, donekle rijeÅ”ila problem sinonima, preostao je problem s viÅ”eznačnicama. Osim toga, LSA pretpostavlja da Å”um uzorka podataka (dobiven zbog jezične varijabilnosti) ima Gaussovu distribuciju, Å”to nije prirodna pretpostavka. Sljedećom metodom, pLSA, pretpostavili smo da svaki dokument dolazi iz nekog generativnog, vjerojatnosnog procesa čije parametre tražimo maksimizacijom izglednosti. Svaki dokument je mjeÅ”avina latentnih koncepata i tražimo posteriorne vjerojatnosti tih koncepata uz dana opažanja. Međutim, pLSA ih shvaća kao parametar modela, Å”to dovodi do prenaučenosti. Zato smo prezentirali joÅ” jedan model, LDA, koji te vjerojatnosti tretira kao distribuciju koja ovisi o nekom parametru. Kao i pLSA, i LDA reprezentira dokumente kao mjeÅ”avinu latentnih tema, ali teme su sada distribucije riječi iz rječnika. Zato je bilo potrebno definirati neku distribuciju distribucija, gdje se prirodno nametnula Diricheltova distribucija. Na kraju smo ukratko prikazali modeliranje tema na kolekciji članaka iz Wikipedije.Nowadays, more and more important is to make a computer that performs tasks that man does routinely, as fast and efficiently. One of these tasks is finding a few documents from the given collection, that are most relevant for userā€™s query. The first step in solving this problem is representing the collection of documents as a term-document matrix, whose elements are tf-idf weights of words in the document. In this way, we represent each document as a vector in the space of terms. If the query is represented as a vector as well, standard similarity measures, such as a cosine similarity, can be used for comparison of the query and documents. In such space, synonyms will be orthogonal and polysemies will be presented with one vector, regardless of the context of the word. Motivated by this fact, and a large dimension of the term-document matrix, a lower rank approximation of the matrix is done. The approximation is gained using a singular value decomposition (SVD) of the matrix. We have shown that the approximation takes into account the context of the words. The query needs to be transformed into a new space as well, so it can be compared with vectors in this lower dimensional space. We showed how can we add new documents and terms in the case of a dynamic collection. While this method, solves the problem of synonyms to some extent, the problem with polysemies remains unsolved. In addition, LSA assumes that the data noise (gained from language variability) has a Gaussian distribution, which is not a natural assumption. The following method, pLSA, assumes that each document comes from a generative, probabilistic process, whose parameters we seek with maximization of likelihood. Each document is a mixture of latent concepts and we look for posterior probabilities of these concepts when observations are given. However, pLSA assumes these probabilities are parameters of model which leads to over-fitting of the model. Therefore, we present another model, LDA, that treats these probabilities as a distribution that depends on some parameter. Documents are, again, represented as a mixture of latent topics, but these topics are a distribution of words from the dictionary. Therefore, it is necessary to define a distribution of distributions and a natural choice is the Dirichelt distribution. Finally, we have briefly presented a topic modeling of the collection of articles from Wikipedia
    corecore