61 research outputs found

    Improving the Community Question Retrieval Performance Using Attention-based Siamese LSTM

    Get PDF
    International audienceIn this paper, we focus on the problem of question retrieval in community Question Answering (cQA) which aims to retrieve from the community archives the previous questions that are semantically equivalent to the new queries. The major challenges in this crucial task are the shortness of the questions as well as the word mismatch problem as users can formulate the same query using different wording. While numerous attempts have been made to address this problem, most existing methods relied on supervised models which significantly depend on large training data sets and manual feature engineering. Such methods are mostly constrained by their specificities that put aside the word order and ignore syntactic and semantic relationships. In this work, we rely on Neural Networks (NNs) which can learn rich dense representations of text data and enable the prediction of the textual similarity between the community questions. We propose a deep learning approach based on a Siamese architecture with LSTM networks, augmented with an attention mechanism. We test different similarity measures to predict the semantic similarity between the community questions. Experiments conducted on real cQA data sets in English and Arabic show that the performance of question retrieval is improved as compared to other competitive methods

    Technological troubleshooting based on sentence embedding with deep transformers

    Get PDF
    AbstractIn nowadays manufacturing, each technical assistance operation is digitally tracked. This results in a huge amount of textual data that can be exploited as a knowledge base to improve these operations. For instance, an ongoing problem can be addressed by retrieving potential solutions among the ones used to cope with similar problems during past operations. To be effective, most of the approaches for semantic textual similarity need to be supported by a structured semantic context (e.g. industry-specific ontology), resulting in high development and management costs. We overcome this limitation with a textual similarity approach featuring three functional modules. The data preparation module provides punctuation and stop-words removal, and word lemmatization. The pre-processed sentences undergo the sentence embedding module, based on Sentence-BERT (Bidirectional Encoder Representations from Transformers) and aimed at transforming the sentences into fixed-length vectors. Their cosine similarity is processed by the scoring module to match the expected similarity between the two original sentences. Finally, this similarity measure is employed to retrieve the most suitable recorded solutions for the ongoing problem. The effectiveness of the proposed approach is tested (i) against a state-of-the-art competitor and two well-known textual similarity approaches, and (ii) with two case studies, i.e. private company technical assistance reports and a benchmark dataset for semantic textual similarity. With respect to the state-of-the-art, the proposed approach results in comparable retrieval performance and significantly lower management cost: 30-min questionnaires are sufficient to obtain the semantic context knowledge to be injected into our textual search engine

    Adversarial Domain Adaptation for Duplicate Question Detection

    Full text link
    We address the problem of detecting duplicate questions in forums, which is an important step towards automating the process of answering new questions. As finding and annotating such potential duplicates manually is very tedious and costly, automatic methods based on machine learning are a viable alternative. However, many forums do not have annotated data, i.e., questions labeled by experts as duplicates, and thus a promising solution is to use domain adaptation from another forum that has such annotations. Here we focus on adversarial domain adaptation, deriving important findings about when it performs well and what properties of the domains are important in this regard. Our experiments with StackExchange data show an average improvement of 5.6% over the best baseline across multiple pairs of domains.Comment: EMNLP 2018 short paper - camera ready. 8 page

    Semantic textual similarity with siamese neural networks

    Get PDF
    Calculating the Semantic Textual Similarity (STS) is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing method

    Identifying Semantically Duplicate Questions Using Data Science Approach: A Quora Case Study

    Get PDF
    Kaks küsimust on semantselt dubleeritud, arvestades, et täpselt sama vastus võib rahuldada mõlemaid küsimusi. Semantselt identsete küsimuste väljaselgitamine selliste sotsiaalmeedia platvormide kohta nagu Quora on erakordselt oluline, et tagada kasutajatele esitatud sisu kvaliteet ja kogus, lähtudes küsimuse kavatsusest ja nii rikastades üldist kasutajakogemust. Dubleerivate küsimuste avastamine on väljakutseks, sest looduskeel on väga väljendusrikas ning ainulaadset kavatsust saab edastada erinevate sõnade, fraaside ja lausekujunduse abil. Masinõppe ja sügava õppimise meetodid on teadaolevalt saavutanud paremaid tulemusi võrreldes traditsiooniliste loodusliku keeletöötlemise tehnikatega sarnaste tekstide väljaselgitamisel.Selles teoses, võttes Quora oma juhtumiuuringuks, uurisime ja kohaldasime erinevaid masinõppe- ja sügavõppetehnikaid ülesandel tuvastada Quora küsimuse paari andmestikul kahekordsed küsimused. Kasutades omaduste inseneritehnikat, eristavaid tähtsaid tehnikaid ning katsetades seitsme valitud masinõppe klassifikaatoriga, näitasime, et meie mudelid edestasid paari varasemat selle ülesandega seotud uuringut. Xgboost mudelil, mida söödetakse tähetaseme termilise sagedusega ja pöördsagedusega, saavutati teiste masinõppemudelite suhtes paremad tulemused ning edestati ka paari Deep learningi algmudelit.Meie kasutasime sügava õppimise tehnikat, et modelleerida neli erinevat sügavat neuralivõrgustikku, mis koosnevad Glove Embedding, Long Short Term Memory, Convolution, Max Pooling, Dense, Batch normaliseerimisest, aktuaalsetest funktsioonidest ja mudeli ühendamisest. Meie süvaõppemudelid saavutasid parema täpsuse kui masinõppemudelid. Kolm neljast väljapakutud arhitektuurist edestasid täpsust varasemast masinõppe- ja süvaõppetööst, kaks neljast mudelist edestasid täpsust varasemast sügava õppimise uuringust Quora küsitluspaari andmestik ning meie parim mudel saavutas täpsuse 85.82% mis on kunstilise seisundi Quora lähedane täpsus.Two questions are semantically duplicate, given that precisely the same answer can satisfy both the questions. Identifying semantically identical questions on, Question and Answering(QandA) social media platforms like Quora is exceptionally significant to ensure that the quality and the quantity of content are presented to users, based on the intent of the question and thus enriching overall user experience. Detecting duplicate questions is a challenging problem because natural language is very expressive, and a unique intent can be conveyed using different words, phrases, and sentence structuring. Machine learning and deep learning methods are known to have accomplished superior results over traditional natural language processing techniques in identifying similar texts.In this thesis, taking Quora for our case study, we explored and applied different machine learning and deep learning techniques on the task of identifying duplicate questions on Quora’s question pair dataset. By using feature engineering, feature importance techniques, and experimenting with seven selected machine learning classifiers, we demonstrated that our models outperformed a few of the previous studies on this task. Xgboost model, when fed with character level term frequency and inverse term frequency, achieved superior results to other machine learning models and also outperformed a few of the Deep learning baseline models.We applied deep learning techniques to model four different deep neural networks of multiple layers consisting of Glove embeddings, Long Short Term Memory, Convolution, Max pooling, Dense, Batch Normalization, Activation functions, and model merge. Our deep learning models achieved better accuracy than machine learning models. Three out of four proposed architectures outperformed the accuracy from previous machine learning and deep learning research work, two out of four models outperformed accuracy from previous deep learning study on Quora’s question pair dataset, and our best model achieved accuracy of 85.82% which is close to Quora state of the art accuracy

    Social Search: retrieving information in Online Social Platforms -- A Survey

    Full text link
    Social Search research deals with studying methodologies exploiting social information to better satisfy user information needs in Online Social Media while simplifying the search effort and consequently reducing the time spent and the computational resources utilized. Starting from previous studies, in this work, we analyze the current state of the art of the Social Search area, proposing a new taxonomy and highlighting current limitations and open research directions. We divide the Social Search area into three subcategories, where the social aspect plays a pivotal role: Social Question&Answering, Social Content Search, and Social Collaborative Search. For each subcategory, we present the key concepts and selected representative approaches in the literature in greater detail. We found that, up to now, a large body of studies model users' preferences and their relations by simply combining social features made available by social platforms. It paves the way for significant research to exploit more structured information about users' social profiles and behaviors (as they can be inferred from data available on social platforms) to optimize their information needs further
    corecore