7 research outputs found

    Система визначення авторства тексту

    No full text
    Було розроблено систему ідентифікації та перевірки авторства документа, побудовану на основі машинного навчання. Оригінальність моделі обумовлена запропонованим унікальним профілем ознак автора, що дозволив, із застосуванням методу опорних векторів (SVM), отримати високі показники точності.A new effective system for identification and verification of text authorship has been developed. The system is created on the base of machine learning. The originality of the proposed model is caused by the unique profile of the author attributes that allows getting extra-high performance accuracy using the method of the Support Vector Machine (SVM)

    Authorship Authentication of Short Messages from Social Networks Machines

    Get PDF
    Dataset consists of 17000 tweets collected from Twitter, as 500 tweets for each of 34 authors that meet certain criteria. Raw data is collected by using the software Nvivo. The collected raw data is preprocessed to extract frequencies of 200 features. In the data analysis 128 of features are eliminated since they are rare in tweets. As a progressive presentation, five – fifteen – twenty – twenty five – thirty and thirty four of these authors are selected each time. Since recurrent artificial neural networks are more stable and in general ANNs are more successful distinguishing two classes, for N authors, N×N neural networks are trained for pair wise classification. These experts then organized in N competing teams (CANNT) to aggregate decisions of these NXN experts. Then this procedure is repeated seven times and committees with seven members voted for final decision. By a commonest type voting, the accuracy is boosted around ten percent. Number of authors is seen not so effective on the accuracy of the authentication, and around 80% accuracy is achieved for any number of authors

    Predicting the age of social network users from user-generated texts with word embeddings

    Get PDF
    © 2016 FRUCT.Many web-based applications such as advertising or recommender systems often critically depend on the demographic information, which may be unavailable for new or anonymous users. We study the problem of predicting demographic information based on user-generated texts on a Russian-language dataset from a large social network. We evaluate the efficiency of age prediction algorithms based on word2vec word embeddings and conduct a comprehensive experimental evaluation, comparing these algorithms with each other and with classical baseline approaches

    Discriminatory Expressions to Produce Interpretable Models in Short Documents

    Full text link
    Social Networking Sites (SNS) are one of the most important ways of communication. In particular, microblogging sites are being used as analysis avenues due to their peculiarities (promptness, short texts...). There are countless researches that use SNS in novel manners, but machine learning has focused mainly in classification performance rather than interpretability and/or other goodness metrics. Thus, state-of-the-art models are black boxes that should not be used to solve problems that may have a social impact. When the problem requires transparency, it is necessary to build interpretable pipelines. Although the classifier may be interpretable, resulting models are too complex to be considered comprehensible, making it impossible for humans to understand the actual decisions. This paper presents a feature selection mechanism that is able to improve comprehensibility by using less but more meaningful features while achieving good performance in microblogging contexts where interpretability is mandatory. Moreover, we present a ranking method to evaluate features in terms of statistical relevance and bias. We conducted exhaustive tests with five different datasets in order to evaluate classification performance, generalisation capacity and complexity of the model. Results show that our proposal is better and the most stable one in terms of accuracy, generalisation and comprehensibility

    Using Arabic Twitter to support analysis of the spread of Infectious Diseases

    Get PDF
    This study investigates how to use Arabic social media content, especially Twitter, to measure the incidence of infectious diseases. People use social media applications such as Twitter to find news related to diseases and/or express their opinions and feelings about them. As a result, a vast amount of information could be exploited by NLP researchers for a myriad of analyses despite the informal nature of social media writing style. Systematic monitoring of social media posts (infodemiology or infoveillance) could be useful to detect misinformation outbreaks as well as to reduce reporting lag time and to provide an independent complementary source of data compared with traditional surveillance approaches. However, there has been a lack of research about analysing Arabic tweets for health surveillance purposes, due to the lack of Arabic social media datasets in comparison with what is available for English and some other languages. Therefore, it is necessary for us to create our own corpus. In addition, building ontologies is a crucial part of the semantic web endeavour. In recent years, research interest has grown rapidly in supporting languages such as Arabic in NLP in general but there has been very little research on medical ontologies for Arabic. In this thesis, the first and the largest Arabic Twitter dataset in the area of health surveillance was created to use in training and testing in the research studies presented. The Machine Learning algorithms with NLP techniques especially for Arabic were used to classify tweets into five categories: academic, media, government, health professional, and the public, to assist in reliability and trust judgements by taking into account the source of the information alongside the content of tweets. An Arabic Infectious Diseases Ontology was presented and evaluated as part of a new method to bridge between formal and informal descriptions of Infectious Diseases. Different qualitative and quantitative studies were performed to analyse Arabic tweets that have been written during the pandemic, i.e. COVID-19, to show how Public Health Organisations can learn from social media. A system was presented that measures the spread of two infectious diseases based on our Ontology to illustrate what quantitative patterns and qualitative themes can be extracted

    Leveraging social relevance : using social networks to enhance literature access and microblog search

    Get PDF
    L'objectif principal d'un système de recherche d'information est de sélectionner les documents pertinents qui répondent au besoin en information exprimé par l'utilisateur à travers une requête. Depuis les années 1970-1980, divers modèles théoriques ont été proposés dans ce sens pour représenter les documents et les requêtes d'une part et les apparier d'autre part, indépendamment de tout utilisateur. Plus récemment, l'arrivée du Web 2.0 ou le Web social a remis en cause l'efficacité de ces modèles du fait qu'ils ignorent l'environnement dans lequel l'information se situe. En effet, l'utilisateur n'est plus un simple consommateur de l'information mais il participe également à sa production. Pour accélérer la production de l'information et améliorer la qualité de son travail, l'utilisateur échange de l'information avec son voisinage social dont il partage les mêmes centres d'intérêt. Il préfère généralement obtenir l'information d'un contact direct plutôt qu'à partir d'une source anonyme. Ainsi, l'utilisateur, influencé par son environnement socio-cultuel, donne autant d'importance à la proximité sociale de la ressource d'information autant qu'à la similarité des documents à sa requête. Dans le but de répondre à ces nouvelles attentes, la recherche d'information s'oriente vers l'implication de l'utilisateur et de sa composante sociale dans le processus de la recherche. Ainsi, le nouvel enjeu de la recherche d'information est de modéliser la pertinence compte tenu de la position sociale et de l'influence de sa communauté. Le second enjeu est d'apprendre à produire un ordre de pertinence qui traduise le mieux possible l'importance et l'autorité sociale. C'est dans ce cadre précis, que s'inscrit notre travail. Notre objectif est d'estimer une pertinence sociale en intégrant d'une part les caractéristiques sociales des ressources et d'autre part les mesures de pertinence basées sur les principes de la recherche d'information classique. Nous proposons dans cette thèse d'intégrer le réseau social d'information dans le processus de recherche d'information afin d'utiliser les relations sociales entre les acteurs sociaux comme une source d'évidence pour mesurer la pertinence d'un document en réponse à une requête. Deux modèles de recherche d'information sociale ont été proposés à des cadres applicatifs différents : la recherche d'information bibliographique et la recherche d'information dans les microblogs. Les importantes contributions de chaque modèle sont détaillées dans la suite. Un modèle social pour la recherche d'information bibliographique. Nous avons proposé un modèle générique de la recherche d'information sociale, déployé particulièrement pour l'accès aux ressources bibliographiques. Ce modèle représente les publications scientifiques au sein d'réseau social et évalue leur importance selon la position des auteurs dans le réseau. Comparativement aux approches précédentes, ce modèle intègre des nouvelles entités sociales représentées par les annotateurs et les annotations sociales. En plus des liens de coauteur, ce modèle exploite deux autres types de relations sociales : la citation et l'annotation sociale. Enfin, nous proposons de pondérer ces relations en tenant compte de la position des auteurs dans le réseau social et de leurs mutuelles collaborations. Un modèle social pour la recherche d'information dans les microblogs.} Nous avons proposé un modèle pour la recherche de tweets qui évalue la qualité des tweets selon deux contextes: le contexte social et le contexte temporel. Considérant cela, la qualité d'un tweet est estimé par l'importance sociale du blogueur correspondant. L'importance du blogueur est calculée par l'application de l'algorithme PageRank sur le réseau d'influence sociale. Dans ce même objectif, la qualité d'un tweet est évaluée selon sa date de publication. Les tweets soumis dans les périodes d'activité d'un terme de la requête sont alors caractérisés par une plus grande importance. Enfin, nous proposons d'intégrer l'importance sociale du blogueur et la magnitude temporelle avec les autres facteurs de pertinence en utilisant un modèle Bayésien.An information retrieval system aims at selecting relevant documents that meet user's information needs expressed with a textual query. For the years 1970-1980, various theoretical models have been proposed in this direction to represent, on the one hand, documents and queries and on the other hand to match information needs independently of the user. More recently, the arrival of Web 2.0, known also as the social Web, has questioned the effectiveness of these models since they ignore the environment in which the information is located. In fact, the user is no longer a simple consumer of information but also involved in its production. To accelerate the production of information and improve the quality of their work, users tend to exchange documents with their social neighborhood that shares the same interests. It is commonly preferred to obtain information from a direct contact rather than from an anonymous source. Thus, the user, under the influenced of his social environment, gives as much importance to the social prominence of the information as the textual similarity of documents at the query. In order to meet these new prospects, information retrieval is moving towards novel user centric approaches that take into account the social context within the retrieval process. Thus, the new challenge of an information retrieval system is to model the relevance with regards to the social position and the influence of individuals in their community. The second challenge is produce an accurate ranking of relevance that reflects as closely as possible the importance and the social authority of information producers. It is in this specific context that fits our work. Our goal is to estimate the social relevance of documents by integrating the social characteristics of resources as well as relevance metrics as defined in classical information retrieval field. We propose in this work to integrate the social information network in the retrieval process and exploit the social relations between social actors as a source of evidence to measure the relevance of a document in response to a query. Two social information retrieval models have been proposed in different application frameworks: literature access and microblog retrieval. The main contributions of each model are detailed in the following. A social information model for flexible literature access. We proposed a generic social information retrieval model for literature access. This model represents scientific papers within a social network and evaluates their importance according to the position of respective authors in the network. Compared to previous approaches, this model incorporates new social entities represented by annotators and social annotations (tags). In addition to co-authorships, this model includes two other types of social relationships: citation and social annotation. Finally, we propose to weight these relationships according to the position of authors in the social network and their mutual collaborations. A social model for information retrieval for microblog search. We proposed a microblog retrieval model that evaluates the quality of tweets in two contexts: the social context and temporal context. The quality of a tweet is estimated by the social importance of the corresponding blogger. In particular, blogger's importance is calculated by the applying PageRank algorithm on the network of social influence. With the same aim, the quality of a tweet is evaluated according to its date of publication. Tweets submitted in periods of activity of query terms are then characterized by a greater importance. Finally, we propose to integrate the social importance of blogger and the temporal magnitude tweets as well as other relevance factors using a Bayesian network model

    AUTHOR VERIFICATION OF ELECTRONIC MESSAGING SYSTEMS

    Get PDF
    Messaging systems have become a hugely popular new paradigm for sending and delivering text messages; however, online messaging platforms have also become an ideal place for criminals due to their anonymity, ease of use and low cost. Therefore, the ability to verify the identity of individuals involved in criminal activity is becoming increasingly important. The majority of research in this area has focused on traditional authorship problems that deal with single-domain datasets and large bodies of text. Few research studies have sought to explore multi-platform author verification as a possible solution to problems around forensics and security. Therefore, this research has investigated the ability to identify individuals on messaging systems, and has applied this to the modern messaging platforms of Email, Twitter, Facebook and Text messages, using different single-domain datasets for population-based and user-based verification approaches. Through a novel technique of cross-domain research using real scenarios, the domain incompatibilities of profiles from different distributions has been assessed, based on real-life corpora using data from 50 authors who use each of the aforementioned domains. The results show that the use of linguistics is likely be similar between platforms, on average, for a population-based approach. The best corpus experimental result achieved a low EER of 7.97% for Text messages, showing the usefulness of single-domain platforms where the use of linguistics is likely be similar, such as Text messages and Emails. For the user-based approach, there is very little evidence of a strong correlation of stylometry between platforms. It has been shown that linguistic features on some individual platforms have features in common with other platforms, and lexical features play a crucial role in the similarities between users’ modern platforms. Therefore, this research shows that the ability to identify individuals on messaging platforms may provide a viable solution to problems around forensics and security, and help against a range of criminal activities, such as sending spam texts, grooming children, and encouraging violence and terrorism.Royal Embassy of Saudi Arabia, Londo
    corecore