125,529 research outputs found

    Deep Just-In-Time Inconsistency Detection Between Comments and Source Code

    Full text link
    Natural language comments convey key aspects of source code such as implementation, usage, and pre- and post-conditions. Failure to update comments accordingly when the corresponding code is modified introduces inconsistencies, which is known to lead to confusion and software bugs. In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code, in order to catch potential inconsistencies just-in-time, i.e., before they are committed to a code base. To achieve this, we develop a deep-learning approach that learns to correlate a comment with code changes. By evaluating on a large corpus of comment/code pairs spanning various comment types, we show that our model outperforms multiple baselines by significant margins. For extrinsic evaluation, we show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system which can both detect and resolve inconsistent comments based on code changes.Comment: Accepted in AAAI 202

    SemAxis: A Lightweight Framework to Characterize Domain-Specific Word Semantics Beyond Sentiment

    Full text link
    Because word semantics can substantially change across communities and contexts, capturing domain-specific word semantics is an important challenge. Here, we propose SEMAXIS, a simple yet powerful framework to characterize word semantics using many semantic axes in word- vector spaces beyond sentiment. We demonstrate that SEMAXIS can capture nuanced semantic representations in multiple online communities. We also show that, when the sentiment axis is examined, SEMAXIS outperforms the state-of-the-art approaches in building domain-specific sentiment lexicons.Comment: Accepted in ACL 2018 as a full pape

    Toxic comment classification using convolutional and recurrent neural networks

    Get PDF
    This thesis aims to provide a reasonable solution for categorizing automatically sentences into types of toxicity using different types of neural networks. There are six types of categories: Toxic, severe toxic, obscene, threat, insult and identity hate. Three different implementations have been studied to accomplish the objective: LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit) and convolutional neural networks. The thesis is not thought to aim on improving the performance of every individual model but on the comparison between them in terms of natural language processing adequacy. In addition, one differential aspect about this project is the research of LSTM neurons activations and thus the relationship of the words with the final sentence classificatory decision. In conclusion, the three models performed almost equally and the extraction of LSTM activations provided a very accurate and visual understanding of the decisions taken by the network.Esta tesis tiene como objetivo aportar una buena solución para la categorización automática de comentarios abusivos haciendo uso de distintos tipos de redes neuronales. Hay seis categorías: Tóxico, muy tóxico, obsceno, insulto, amenaza y racismo. Se ha hecho una investigación de tres implementaciones para llevar a cabo el objetivo: LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit) y redes convolucionales. El objetivo de este trabajo no es intentar mejorar al máximo el resultado de la clasificación sino hacer una comparación de los 3 modelos para los mismos parámetros e intentar saber cuál funciona mejor para este caso de procesado de lenguaje. Además, un aspecto diferencial de este proyecto es la investigación sobre las activaciones de las neuronas en el modelo LSTM y su relación con la importancia de las palabras respecto a la clasificación final de la frase. En conclusión, los tres modelos han funcionado de forma casi idéntica y la extracción de las activaciones han proporcionado un conocimiento muy preciso y visual de las decisiones tomadas por la red.Aquesta tesi té com a objectiu aportar una bona solució per categoritzar automàticament comentaris abusius usant diferents tipus de xarxes neuronals. Hi ha sis tipus de categories: Tòxic, molt tòxic, obscè, insult, amenaça i racisme. S'ha fet una recerca de tres implementacions per dur a terme l'objectiu: LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit) i xarxes convolucionals. L'objectiu d'aquest treball no és intentar millorar al màxim els resultats de classificació sinó fer una comparació dels 3 models pels mateixos paràmetres per tal d'esbrinar quin funciona millor en aquest cas de processat de llenguatge. A més, un aspecte diferencial d'aquest projecte és la recerca sobre les activacions de les neurones al model LSTM i la seva relació amb la importància de les paraules respecte la classificació final de la frase. En conclusió, els tres models han funcionat gairebé idènticament i l'extracció de les activacions van proporcionar un enteniment molt acurat i visual de les decisions preses per la xarxa

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
    corecore