35 research outputs found

    Extracting Hierarchies of Search Tasks & Subtasks via a Bayesian Nonparametric Approach

    Get PDF
    A significant amount of search queries originate from some real world information need or tasks. In order to improve the search experience of the end users, it is important to have accurate representations of tasks. As a result, significant amount of research has been devoted to extracting proper representations of tasks in order to enable search systems to help users complete their tasks, as well as providing the end user with better query suggestions, for better recommendations, for satisfaction prediction, and for improved personalization in terms of tasks. Most existing task extraction methodologies focus on representing tasks as flat structures. However, tasks often tend to have multiple subtasks associated with them and a more naturalistic representation of tasks would be in terms of a hierarchy, where each task can be composed of multiple (sub)tasks. To this end, we propose an efficient Bayesian nonparametric model for extracting hierarchies of such tasks \& subtasks. We evaluate our method based on real world query log data both through quantitative and crowdsourced experiments and highlight the importance of considering task/subtask hierarchies.Comment: 10 pages. Accepted at SIGIR 2017 as a full pape

    Automatic Taxonomy Construction from Keywords via Scalable Bayesian Rose Trees

    Get PDF
    Abstract-In this paper, we study a challenging problem of deriving a taxonomy from a set of keyword phrases. A solution can benefit many real-world applications because i) keywords give users the flexibility and ease to characterize a specific domain; and ii) in many applications, such as online advertisements, the domain of interest is already represented by a set of keywords. However, it is impossible to create a taxonomy out of a keyword set itself. We argue that additional knowledge and context are needed. To this end, we first use a general-purpose knowledgebase and keyword search to supply the required knowledge and context. Then we develop a Bayesian approach to build a hierarchical taxonomy for a given set of keywords. We reduce the complexity of previous hierarchical clustering approaches from O(n 2 log n) to O(n log n) using a nearest-neighbor-based approximation, so that we can derive a domain-specific taxonomy from one million keyword phrases in less than an hour. Finally, we conduct comprehensive large scale experiments to show the effectiveness and efficiency of our approach. A real life example of building an insurance-related Web search query taxonomy illustrates the usefulness of our approach for specific domains

    Estruturas hierárquicas orientadas por dados em aprendizado multi-tarefa

    Get PDF
    Orientador: Fernando José Von ZubenDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Em aprendizado multi-tarefa, um conjunto de tarefas é simultaneamente considerado durante o processo de aprendizado de modo a promover ganho de desempenho através da exploração de similaridades entre tarefas. Em um número significativo de abordagens, tais similaridades são codificadas como informação adicional na etapa de regularização. Embora algumas estruturas sejam levadas em consideração em muitas propostas, como a existência de grupos de tarefas ou um relacionamento baseado em grafo, outras propostas mostraram que usar uma estrutura hierárquica corretamente definida poderá guiar a resultados competitivos. Focando em um relacionamento hierárquico, a extensão buscada nesta pesquisa é baseada na ideia de aprender a estrutura diretamente dos dados, possibilitando que a metodologia multi-tarefa possa ser estendida a uma gama mais vasta de aplicações. Assim, a hipótese levantada é que obter um relacionamento representativo dos dados baseado em hierarquia entre tarefas e usar esta informação adicional como um termo de penalização dentro do formalismo de aprendizado regularizado seria benéfico, relaxando a necessidade de um especialista específico de domínio e melhorando o desempenho de predição. Portanto, a novidade em abordagens hierárquicas orientadas por dados propostas nesta dissertação para aprendizado multi-tarefa é que a troca de informação entre tarefas reais associadas é promovida por tarefas hipotéticas auxiliares presentes nos nós mais altos, dado que as tarefas reais não são diretamente conectadas na hierarquia. Uma vez que a ideia principal envolve obter uma estrutura hierárquica, estudos foram feitos com foco em combinar ambas as áreas de clusterização hierárquica e aprendizado multi-tarefa. Três estratégias promissoras para a obtenção automática de estruturas hierárquicas foram adaptadas ao contexto de aprendizado multi-tarefa. Duas delas são abordagens Bayesianas, sendo uma caracterizada por ramificações não binárias. A possibilidade de corte na estrutura também é investigada, sendo uma poderosa ferramenta para detecção de tarefas outliers. Além disso, um conceito geral chamado Hierarchical Multi-Task Learning Framework é proposto, agrupando módulos individualmente, os quais podem ser facilmente estendidos em pesquisas futuras. Experimentos extensivos são apresentados e discutidos, mostrando o potencial da utilização de estruturas hierárquicas obtidas diretamente dos dados para guiar a etapa de regularização. Foram adotados nos experimentos tanto conjuntos de dados sintéticos com relacionamento entre tarefas conhecido como conjuntos de dados reais utilizados na literatura, nos quais foi possível observar que o framework proposto consistentemente supera estratégias bem estabelecidas de aprendizado multi-tarefaAbstract: In multi-task learning, a set of learning tasks is simultaneously considered during the learning process so that it can leverage performance by exploring similarities among the tasks. In a significant number of approaches, such similarities are encoded as additional information within the regularization framework. Although some sort of structure is taken into account by several proposals, such as the existence of task clusters or a graph-based relationship, others have shown that using a properly defined hierarchical structure may lead to competitive results. Focusing on a hierarchical relationship, the extension pursued in this research is based on the idea of learning it directly from data, enabling a methodology like this to be extended to a wider range of applications. Thus, the hypothesis raised is that obtaining a representative hierarchy-based task relationship from data and using this additional information as a penalty term in the regularization framework would be beneficial, relaxing the necessity of a domain-specific specialist and improving overall generalization predictive performance. Therefore, the novelty of the data-driven hierarchical approaches proposed in this dissertation for multi-task learning is that information exchange among associated real tasks is promoted by auxiliary hypothetical tasks at the upper nodes, given that the real tasks are not directly connected in the hierarchy. Once the main idea involves obtaining a hierarchical structure, several studies were performed focusing on combining both hierarchical clustering and multi-task learning areas. Three promising strategies for automatically obtaining hierarchical structures were adapted to the context of multi-task learning. Two of them are Bayesian-based approaches and one of those two is characterized by non-binary branching. The possibility of cutting edges is also investigated, being a powerful tool to detect outlier tasks. Moreover, a general concept called Hierarchical Multi-Task Learning Framework is proposed, individually grouping modules, which can be easily extended in future research. Extensive experiments are presented and discussed, showing the potential of employing a hierarchical structure obtained directly from task data within the regularization framework. Both synthetic datasets with known underlying relations among tasks and real-world benchmark datasets from the literature are adopted in the experiments, providing evidence that the proposed framework consistently outperforms well-established multi-task learning strategiesMestradoEngenharia de ComputaçãoMestre em Engenharia ElétricaCAPE

    Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s gibbs latent dirichlet allocation

    Get PDF
    Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard

    Geographical queries reformulation using a parallel association rules generator to build spatial taxonomies

    Get PDF
    Geographical queries need a special process of reformulation by information retrieval systems (IRS) due to their specificities and hierarchical structure. This fact is ignored by most of web search engines. In this paper, we propose an automatic approach for building a spatial taxonomy, that models’ the notion of adjacency that will be used in the reformulation of the spatial part of a geographical query. This approach exploits the documents that are in top of the retrieved list when submitting a spatial entity, which is composed of a spatial relation and a noun of a city. Then, a transactional database is constructed, considering each document extracted as a transaction that contains the nouns of the cities sharing the country of the submitted query’s city. The algorithm frequent pattern growth (FP-growth) is applied to this database in his parallel version (parallel FP-growth: PFP) in order to generate association rules, that will form the country’s taxonomy in a Big Data context. Experiments has been conducted on Spark and their results show that query reformulation using the taxonomy constructed based on our proposed approach improves the precision and the effectiveness of the IRS
    corecore