7 research outputs found

    Combining social-based data mining techniques to extract collective trends from twitter

    Full text link
    Social Networks have become an important environment for Collective Trends extraction. The interactions amongst users provide information of their preferences and relationships. This information can be used to measure the influence of ideas, or opinions, and how they are spread within the Network. Currently, one of the most relevant and popular Social Networks is Twitter. This Social Network was created to share comments and opinions. The information provided by users is especially useful in different fields and research areas such as marketing. This data is presented as short text strings containing different ideas expressed by real people. With this representation, different Data Mining techniques (such as classification or clustering) will be used for knowledge extraction to distinguish the meaning of the opinions. Complex Network techniques are also helpful to discover influential actors and study the information propagation inside the Social Network. This work is focused on how clustering and classification techniques can be combined to extract collective knowledge from Twitter. In an initial phase, clustering techniques are applied to extract the main topics from the user opinions. Later, the collective knowledge extracted is used to relabel the dataset according to the clusters obtained to improve the classification results. Finally, these results are compared against a dataset which has been manually labelled by human experts to analyse the accuracy of the proposed method.The preparation of this manuscript has been supported by the Spanish Ministry of Science and Innovation under the following projects: TIN2010-19872 and ECO2011-30105 (National Plan for Research, Development and Innovation), as well as the Multidisciplinary Project of Universidad Autónoma de Madrid (CEMU2012-034). The authors thank Ana M. Díaz-Martín and Mercedes Rozano for the manual classification of the Tweets

    A scalable framework for cross-lingual authorship identification

    Get PDF
    This is an accepted manuscript of an article published by Elsevier in Information Sciences on 10/07/2018, available online: https://doi.org/10.1016/j.ins.2018.07.009 The accepted version of the publication may differ from the final published version.© 2018 Elsevier Inc. Cross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is that the stylistic markers (features) used in one language may not be applicable to other languages in the corpus. Existing methods overcome this challenge by using external resources such as machine translation and part-of-speech tagging. However, such solutions are not applicable to languages with poor external resources (known as low resource languages). They also fail to scale as the number of candidate authors and/or the number of languages in the corpus increases. In this investigation, we analyze different types of stylometric features and identify 10 high-performance language-independent features for cross-lingual stylometric analysis tasks. Based on these stylometric features, we propose a cross-lingual authorship identification solution that can accurately handle a large number of authors. Specifically, we partition the documents into fragments where each fragment is further decomposed into fixed size chunks. Using a multilingual corpus of 400 authors with 825 documents written in 6 different languages, we show that our method can achieve an accuracy level of 96.66%. Our solution also outperforms the best existing solution that does not rely on external resources.Published versio

    Semantics-based language models for information retrieval and text mining

    Get PDF
    The language modeling approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smoothing techniques. In the thesis, we propose a novel context-sensitive semantic smoothing method referred to as a topic signature language model. It extracts explicit topic signatures from a document and then statistically maps them into individual words in the vocabulary. In order to support the new language model, we developed two automated algorithms to extract multiword phrases and ontological concepts, respectively, and an EM-based algorithm to learn semantic mapping knowledge from co-occurrence data. The topic signature language model is applied to three applications: information retrieval, text classification, and text clustering. The evaluations on news collection and biomedical literature prove the effectiveness of the topic signature language model.In the experiment of information retrieval, the topic signature language model consistently outperforms the baseline two-stage language model as well as the context-insensitive semantic smoothing method in all configurations. It also beats the state-of-the-art Okapi models in all configurations. In the experiment of text classification, when the size of training documents is small, the Bayesian classifier with semantic smoothing not only outperforms the classifiers with background smoothing and Laplace smoothing, but it also beats the active learning classifiers and SVM classifiers. On the task of clustering, whether or not the dataset to cluster is small, the model-based k-means with semantic smoothing performs significantly better than both the model-based k-means with background smoothing and Laplace smoothing. It is also superior to the spherical k-means in terms of effectiveness.In addition, we empirically prove that, within the framework of topic signature language models, the semantic knowledge learned from one collection could be effectively applied to other collections. In the thesis, we also compare three types of topic signatures (i.e., words, multiword phrases, and ontological concepts), with respect to their effectiveness and efficiency for semantic smoothing. In general, it is more expensive to extract multiword phrases and ontological concepts than individual words, but semantic mapping based on multiword phrases and ontological concepts are more effective in handling data sparsity than on individual words.Ph.D., Information Science and Technology -- Drexel University, 200

    Algorithmic tools for data-oriented law enforcement

    Get PDF
    The increase in capabilities of information technology of the last decade has led to a large increase in the creation of raw data. Data mining, a form of computer guided, statistical data analysis, attempts to draw knowledge from these sources that is usable, human understandable and was previously unknown. One of the potential application domains is that of law enforcement. This thesis describes a number of efforts in this direction and reports on the results reached on the application of its resulting algorithms on actual police data. The usage of specifically tailored data mining algorithms is shown to have a great potential in this area, which forebodes a future where algorithmic assistance in "combating" crime will be a valuable asset.NWOUBL - phd migration 201

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum

    JFPC 2019 - Actes des 15es Journées Francophones de Programmation par Contraintes

    Get PDF
    National audienceLes JFPC (Journées Francophones de Programmation par Contraintes) sont le principal congrès de la communauté francophone travaillant sur les problèmes de satisfaction de contraintes (CSP), le problème de la satisfiabilité d'une formule logique propositionnelle (SAT) et/ou la programmation logique avec contraintes (CLP). La communauté de programmation par contraintes entretient également des liens avec la recherche opérationnelle (RO), l'analyse par intervalles et différents domaines de l'intelligence artificielle.L'efficacité des méthodes de résolution et l'extension des modèles permettent à la programmation par contraintes de s'attaquer à des applications nombreuses et variées comme la logistique, l'ordonnancement de tâches, la conception d'emplois du temps, la conception en robotique, l'étude du génôme en bio-informatique, l'optimisation de pratiques agricoles, etc.Les JFPC se veulent un lieu convivial de rencontres, de discussions et d'échanges pour la communauté francophone, en particulier entre doctorants, chercheurs confirmés et industriels. L'importance des JFPC est reflétée par la part considérable (environ un tiers) de la communauté francophone dans la recherche mondiale dans ce domaine.Patronnées par l'AFPC (Association Française pour la Programmation par Contraintes), les JFPC 2019 ont lieu du 12 au 14 Juin 2019 à l'IMT Mines Albi et sont organisées par Xavier Lorca (président du comité scientifique) et par Élise Vareilles (présidente du comité d'organisation)

    On the Strength of Hyperclique Patterns for Text Categorization ⋆

    No full text
    The use of association patterns for text categorization has attracted great interest and a variety of useful methods have been developed. However, the key characteristics of pattern-based text categorization remain unclear. Indeed, there are still no concrete answers for the following two questions: Firstly, what kind of association pattern is the best candidate for pattern-based text categorization? Secondly, what is the most desirable way to use patterns for text categorization? In this paper, we focus on answering the above two questions. More specifically, we show that hyperclique patterns are more desirable than frequent patterns for text categorization. Along this line, we develop an algorithm for text categorization using hyperclique patterns. As demonstrated by our experimental results on various real-world text documents, our method provides much better computational performance than state-of-the-art methods while retaining classification accuracy
    corecore