1,252 research outputs found

    The big five: Discovering linguistic characteristics that typify distinct personality traits across Yahoo! answers members

    Get PDF
    Indexación: Scopus.This work was partially supported by the project FONDECYT “Bridging the Gap between Askers and Answers in Community Question Answering Services” (11130094) funded by the Chilean Government.In psychology, it is widely believed that there are five big factors that determine the different personality traits: Extraversion, Agreeableness, Conscientiousness and Neuroticism as well as Openness. In the last years, researchers have started to examine how these factors are manifested across several social networks like Facebook and Twitter. However, to the best of our knowledge, other kinds of social networks such as social/informational question-answering communities (e.g., Yahoo! Answers) have been left unexplored. Therefore, this work explores several predictive models to automatically recognize these factors across Yahoo! Answers members. As a means of devising powerful generalizations, these models were combined with assorted linguistic features. Since we do not have access to ask community members to volunteer for taking the personality test, we built a study corpus by conducting a discourse analysis based on deconstructing the test into 112 adjectives. Our results reveal that it is plausible to lessen the dependency upon answered tests and that effective models across distinct factors are sharply different. Also, sentiment analysis and dependency parsing proven to be fundamental to deal with extraversion, agreeableness and conscientiousness. Furthermore, medium and low levels of neuroticism were found to be related to initial stages of depression and anxiety disorders. © 2018 Lithuanian Institute of Philosophy and Sociology. All rights reserved.https://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/275

    A Maximum-Entropy approach for accurate document annotation in the biomedical domain

    Get PDF
    The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH)

    Authorship identification of translation algorithms.

    Get PDF
    Authorship analysis is a process of identifying a true writer of a given document and has been studied for decades. However, only a handful of studies of authorship analysis of translators are available despite the fact that online translations are widely available and also popularly employed in automatic translations of posts in social networking services. The identification of translation algorithms has potential to contribute to the investigation of cybercrimes, involving translation of scam messages by algorithmic translations to reach speakers of foreign languages. This study tested bag of words (BOW) approach in authorship attribution and the existing approaches to translator attribution. We also proposed a simple but accurate feature that extracts the combinations of lexical and syntactic information from texts. Our experiments show that the proposed feature is text size invariant

    Customer Behavior Analysis for Social Media

    Full text link
    It is essential for a business organization to get the customer feedback in order to grow as a company. Business organizations are collecting customer feedback using various methods. But the question is ‘are they efficient and effective?' In the current context, there is more of a customer oriented market and all the business organizations are competing to achieve customer delight through their products and services. Social Media plays a huge role in one's life. Customers tend to reveal their true opinion about certain brands on social media rather than giving routine feedback to the producers or sellers. Because of this reason, it is identified that social media can be used as a tool to analyze customer behavior. If relevant data can be gathered from the customers' social media feeds and if these data are analyzed properly, a clear idea to the companies what customers really think about their brand can be provided

    Aportaciones de las técnicas de aprendizaje automático a la clasificación de partes de alta hospitalarios reales en castellano

    Get PDF
    Hospitals attached to the Spanish Ministry of Health are currently using the International Classification of Diseases 9 Clinical Modification (ICD9-CM) to classify health discharge records. Nowadays, this work is manually done by experts. This paper tackles the automatic classification of real Discharge Records in Spanish following the ICD9-CM standard. The challenge is that the Discharge Records are written in spontaneous language. We explore several machine learning techniques to deal with the classification problem. Random Forest resulted in the most competitive one, achieving an F-measure of 0.876.La red de hospitales que configuran el sistema español de sanidad utiliza la Clasificación Internacional de Enfermedades Modificación Clínica (ICD9-CM) para codificar partes de alta hospitalaria. Hoy en día, este trabajo lo realizan a mano los expertos. Este artículo aborda la problemática de clasificar automáticamente partes reales de alta hospitalaria escritos en español teniendo en cuenta el estándar ICD9-CM. El desafío radica en que los partes hospitalarios están escritos con lenguaje espontáneo. Hemos experimentado con varios sistemas de aprendizaje automático para solventar este problema de clasificación. El algoritmo Random Forest es el más competitivo de los probados, obtiene un F-measure de 0.876.This work was partially supported by the European Commission (SEP-210087649), the Spanish Ministry of Science and Innovation (TIN2012-38584-C06-02) and the Industry of the Basque Government (IT344-10)

    Towards Automated Recipe Genre Classification using Semi-Supervised Learning

    Full text link
    Sharing cooking recipes is a great way to exchange culinary ideas and provide instructions for food preparation. However, categorizing raw recipes found online into appropriate food genres can be challenging due to a lack of adequate labeled data. In this study, we present a dataset named the ``Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking Recipe Dataset" that contains two million culinary recipes labeled in respective categories with extended named entities extracted from recipe descriptions. This collection of data includes various features such as title, NER, directions, and extended NER, as well as nine different labels representing genres including bakery, drinks, non-veg, vegetables, fast food, cereals, meals, sides, and fusions. The proposed pipeline named 3A2M+ extends the size of the Named Entity Recognition (NER) list to address missing named entities like heat, time or process from the recipe directions using two NER extraction tools. 3A2M+ dataset provides a comprehensive solution to the various challenging recipe-related tasks, including classification, named entity recognition, and recipe generation. Furthermore, we have demonstrated traditional machine learning, deep learning and pre-trained language models to classify the recipes into their corresponding genre and achieved an overall accuracy of 98.6\%. Our investigation indicates that the title feature played a more significant role in classifying the genre

    Deep learning in medical document classification

    Get PDF
    Text-based data is produced at an ever growing rate each year which has in turn increased the need for automatic text processing. Thus, to keep up with the amount of data, automatic natural language processing techniques have also been increasingly researched and developed, especially in the last decade or so. This has lead to substantial improvements in various natural language processing tasks such as classification, translation and information retrieval. A major breakthrough has been the utilization of deep neural networks and massive amounts of data to train them. Using such methods in areas where time is valuable, such as the medical field, could provide considerable value. In this thesis, an overview is given of natural language processing w.r.t deep learning and text classification. Additionally, a dataset of medical reports in Finnish was preprocessed and used to train and evaluate a number of text classifiers for diagnosis code prediction in order to define the feasibility of such methods for medical text classification. The chosen methods include deep learning -based FinBERT, ULMFiT and ELECTRA, and a simpler linear baseline classifier, fastText. The results show that with a limited dataset, linear methods like fastText work surprisingly well. Deep learning -based methods, on the other hand, seem work reasonably well, and show a lot of potential especially in utilizing larger amounts of training data. In order to define the full potential of such methods, further investigation is required with different datasets and classification tasks.Tekstipohjaista tietoa tuotetaan vuosi vuodelta enemmän mikä puolestaan on lisännyt tarvetta automaattiselle tekstinkäsittelylle. Täten myös automaattisia tekniikoita luonnollisen kielen käsittelyyn on enenevissä määrin tutkittu ja kehitetty, erityisesti viimeisen vuosikymmenen aikana. Tämä on johtanut huomattaviin parannuksiin erilaisissa luonnollisen kielen käsittelytehtävissä. Suuri läpimurto on ollut valtavilla tietomäärillä koulutettujen syvien neuroverkkojen käyttäminen. Tällaisten menetelmien käyttö alueilla joilla aika on arvokasta, kuten lääketiede, voisi tarjota huomattavaa lisäarvoa. Tämä tutkielma antaa yleiskuvan luonnollisen kielen käsittelystä keskittyen syväoppimiseen ja tekstinluokitteluun. Lisäksi erilaisten syväoppivien menetelmien käytettävyyttä arvioitiin kouluttamalla tekstiluokittelijoita ennustamaan suomenkielisten lääketieteellisten dokumenttien diagnoosikoodeja. Valittuihin menetelmiin kuuluvat syväoppimiseen perustuvat FinBERT, ULMFiT ja ELECTRA, sekä yksinkertaisempi lineaarinen luokittelija fastText. Tulokset osoittavat, että rajallisella aineistolla lineaariset menetelmät, kuten fastText, toimivat yllättävän hyvin. Syväoppimiselle perustuvat menetelmät taasen vaikuttavat toimivan kohtuullisen hyvin, vaikkakin niiden aito potentiaali pitäisi todentaa käyttäen suurempia datajoukkoja. Täten jatkotutkimusta syväoppiviin menetelmiin liittyen tarvitaan
    corecore