230 research outputs found

    Event detection on streams of short texts for decision-making

    Get PDF
    L'objectif de cette thèse est de concevoir d'évènements sur les réseaux sociaux permettant d'assister les personnes en charge de prises de décisions dans des contextes industriels. Le but est de créer un système de détection d'évènement permettant de détecter des évènements à la fois ciblés, propres à des domaines particuliers mais aussi des évènements généraux. En particulier, nous nous intéressons à l'application de ce système aux chaînes d'approvisionnements et plus particulièrement celles liées aux matières premières. Le défi est de mettre en place un tel système de détection, mais aussi de déterminer quels sont les évènements potentiellement impactant dans ces contextes. Cette synthèse résume les différentes étapes des recherches menées pour répondre à ces problématiques. Architecture d'un système de détection d'évènements Dans un premier temps, nous introduisons les différents éléments nécessaires à la constitution d'un système de détection d'évènements. Ces systèmes sont classiquement constitués d'une étape de filtrage et de nettoyage des données, permettant de s'assurer de la qualité des données traitées par le reste du système. Ensuite, ces données sont représentées de manière à pouvoir être regroupées par similarité. Une fois ces regroupements de données établis, ils sont analysés de manière à savoir si les documents les constituants traitent d'un évènement ou non. Finalement, l'évolution dans le temps de ces évènements est suivie. Nous avons proposé au cours de cette thèse d'étudier les problématiques propres à chacune de ces étapes. Représentation textuelles de documents issus des réseaux sociaux Nous avons comparé différentes méthodes de représentations des données textuelles, dans le contexte de notre système de détection d'évènements. Nous avons comparé les performances de notre système de détection à l'algorithme First Story Detection (FSD), un algorithme ayant les mêmes objectifs. Nous avons d'abord démontré que le système que nous proposons est plus performant que le FSD, mais aussi que les architectures récentes de réseaux de neurones (transformeur) sont plus performantes que TF-IDF dans notre contexte, contrairement à ce qui avait été montré dans le contexte du FSD. Nous avons ensuite proposé de combiner différentes représentations textuelles afin d'exploiter conjointement leurs forces. Détection d'évènement, suivi et évaluation Nous avons proposé des approches pour les composantes d'analyse de regroupement de documents ainsi que pour le suivi de l'évolution de ces évènements. En particulier, nous utilisons l'entropie et la diversité d'utilisateurs introduits dans [Rajouter les citations] pour évaluer les regroupements. Nous suivons ensuite leur évolution au cours du temps en faisant des comparaisons entre regroupements à des instants différents, afin de créer des chaînes de regroupements. Enfin, nous avons étudié comment évaluer des systèmes de détection d'évènements dans des contextes où seulement peu de données annotées par des humains sont disponibles. Nous avons proposé une méthode permettant d'évaluer automatiquement les systèmes de détection d'évènement en exploitant des données partiellement annotées. Application au contexte des matières premières. Afin de spécifier les types d'évènements à superviser, nous avons mené une étude historique des évènements ayant impacté le cours des matières premières. En particulier, nous nous sommes focalisé sur le phosphate, une matière première stratégique. Nous avons étudié les différents facteurs ayant une influence, proposé une méthode reproductible pouvant être appliquée à d'autres matières premières ou d'autres domaines. Enfin, nous avons dressé une liste d'éléments à superviser pour permettre aux experts d'anticiper les variations des cours.The objective of this thesis is to design an event detection system on social networks to assist people in charge of decision-making in industrial contexts. The event detection system must be able to detect both targeted, domain-specific events and general events. In particular, we are interested in the application of this system to supply chains and more specifically those related to raw materials. The challenge is to build such a detection system, but also to determine which events are potentially influencing the raw materials supply chains. This synthesis summarizes the different stages of research conducted to answer these problems. Architecture of an event detection system First, we introduce the different building blocks of an event detection system. These systems are classically composed of a data filtering and cleaning step, ensuring the quality of the data processed by the system. Then, these data are embedded in such a way that they can be clustered by similarity. Once these data clusters are created, they are analyzed in order to know if the documents constituting them discuss an event or not. Finally, the evolution of these events is tracked. In this thesis, we have proposed to study the problems specific to each of these steps. Textual representation of documents from social networks We compared different text representation models, in the context of our event detection system. We also compared the performances of our event detection system to the First Story Detection (FSD) algorithm, an algorithm with the same objectives. We first demonstrated that our proposed system performs better than FSD, but also that recent neural network architectures perform better than TF-IDF in our context, contrary to what was shown in the context of FSD. We then proposed to combine different textual representations in order to jointly exploit their strengths. Event detection, monitoring, and evaluation We have proposed different approaches for event detection and event tracking. In particular, we use the entropy and user diversity introduced in ... to evaluate the clusters. We then track their evolution over time by making comparisons between clusters at different times, in order to create chains of clusters. Finally, we studied how to evaluate event detection systems in contexts where only few human-annotated data are available. We proposed a method to automatically evaluate event detection systems by exploiting partially annotated data. Application to the commodities context In order to specify the types of events to supervise, we conducted a historical study of events that have impacted the price of raw materials. In particular, we focused on phosphate, a strategic raw material. We studied the different factors having an influence, proposed a reproducible method that can be applied to other raw materials or other fields. Finally, we drew up a list of elements to supervise to enable experts to anticipate price variations

    A Deep Learning Approach to Persian Plagiarism Detection

    Get PDF
    ABSTRACT Plagiarism detection is defined as automatic identification of reused text materials. General availability of the internet and easy access to textual information enhances the need for automated plagiarism detection. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to drawbacks and inefficiency of traditional methods and lack of proper algorithms for Persian plagiarism detection, in this paper, we propose a deep learning based method to detect plagiarism. In the proposed method, words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors for sentence representation. By comparing representations of source and suspicious sentences, pair sentences with the highest similarity are considered as the candidates for plagiarism. The decision on being plagiarism is performed using a two level evaluation method. Our method has been used in PAN2016 Persian plagiarism detection contest and results in %90.6 plagdet, %85.8 recall, and % 95.9 precision on the provided data sets. CCS Concepts • Information systems → Near-duplicate and plagiarism detection • Information systems → Evaluation of retrieval results

    Mining, Modeling, and Analyzing Real-Time Social Trails

    Get PDF
    Real-time social systems are the fastest growing phenomena on the web, enabling millions of users to generate, share, and consume content on a massive scale. These systems are manifestations of a larger trend toward the global sharing of the real-time interests, affiliations, and activities of everyday users and demand new computational approaches for monitoring, analyzing, and distilling information from the prospective web of real-time content. In this dissertation research, we focus on the real-time social trails that reflect the digital footprints of crowds of real-time web users in response to real-world events or online phenomena. These digital footprints correspond to the artifacts strewn across the real-time web like posting of messages to Twitter or Facebook; the creation, sharing, and viewing of videos on websites like YouTube; and so on. While access to social trails could benefit many domains there is a significant research gap toward discovering, modeling, and leveraging these social trails. Hence, this dissertation research makes three contributions: • The first contribution of this dissertation research is a suite of efficient techniques for discovering non-trivial social trails from large-scale real-time social systems. We first develop a communication-based method using temporal graphs for discovering social trails on a stream of conversations from social messaging systems like instant messages, emails, Twitter directed or @ messages, SMS, etc. and then develop a content-based method using locality sensitive hashing for discovering content based social trails on a stream of text messages like Tweet stream, stream of Facebook messages, YouTube comments, etc. • The second contribution of this dissertation research is a framework for modeling and predicting the spatio-temporal dynamics of social trails. In particular, we develop a probabilistic model that synthesizes two conflicting hypotheses about the nature of online information spread: (i) the spatial influence model, which asserts that social trails propagates to locations that are close by; and (ii) the community affinity influence model, which asserts that social trail prop- agates between locations that are culturally connected, even if they are distant. • The third contribution of this dissertation research is a set of methods for social trail analytics and leveraging social trails for prognostic applications like real-time content recommendation, personalized advertising, and so on. We first analyze geo-spatial social trails of hashtags from Twitter, investigate their spatio-temporal dynamics and then use this analysis to develop a framework for recommending hashtags. Finally, we address the challenge of classifying social trails efficiently on real-time social systems

    Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation

    Get PDF
    With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and translation technologies

    Large-scale document labeling using supervised sequence embedding

    Get PDF
    A critical component in computational treatment of an automated document labeling is the choice of an appropriate representation. Proper representation captures specific phenomena of interest in data while transforming it to a format appropriate for a classifier. For a text document, a popular choice is the bag-of-words (BoW) representation that encodes presence of unique words with non-zero weights such as TF-IDF. Extending this model to long, overlapping phrases (n-grams) results in exponential explosion in the dimensionality of the representation. In this work, we develop a model that encodes long phrases in a low-dimensional latent space with a cumulative function of individual words in each phrase. In contrast to BoW, the parameter space of the proposed model grows linearly with the length of the phrase. The proposed model requires only vector additions and multiplications with scalars to compute the latent representation of phrases, which makes it applicable to large-scale text labeling problems. Several sentiment classification and binary topic categorization problems will be used to empirically evaluate the proposed representation. The same model can also encode relative spatial distribution of elements in higher-dimensional sequences. In order to verify this claim, the proposed model will be evaluated on a large-scale image classification dataset, where images are transformed into two-dimensional sequences of quantized image descriptors.Ph.D., Computer Science -- Drexel University, 201

    Multi-document Summarization System Using Rhetorical Information

    Get PDF
    Over the past 20 years, research in automated text summarization has grown significantly in the field of natural language processing. The massive availability of scientific and technical information on the Internet, including journals, conferences, and news articles has attracted the interest of various groups of researchers working in text summarization. These researchers include linguistics, biologists, database researchers, and information retrieval experts. However, because the information available on the web is ever expanding, reading the sheer volume of information is a significant challenge. To deal with this volume of information, users need appropriate summaries to help them more efficiently manage their information needs. Although many automated text summarization systems have been proposed in the past twenty years, none of these systems have incorporated the use of rhetoric. To date, most automated text summarization systems have relied only on statistical approaches. These approaches do not take into account other features of language such as antimetabole and epanalepsis. Our hypothesis is that rhetoric can provide this type of additional information. This thesis addresses these issues by investigating the role of rhetorical figuration in detecting the salient information in texts. We show that automated multi-document summarization can be improved using metrics based on rhetorical figuration. A corpus of presidential speeches, which is for different U.S. presidents speeches, has been created. It includes campaign, state of union, and inaugural speeches to test our proposed multi-document summarization system. Various evaluation metrics have been used to test and compare the performance of the produced summaries of both our proposed system and other system. Our proposed multi-document summarization system using rhetorical figures improves the produced summaries, and achieves better performance over MEAD system in most of the cases especially in antimetabole, polyptoton, and isocolon. Overall, the results of our system are promising and leads to future progress on this research

    A fast and scalable binary similarity method for open source libraries

    Get PDF
    Abstract. Usage of third party open source software has become more and more popular in the past years, due to the need for faster development cycles and the availability of good quality libraries. Those libraries are integrated as dependencies and often in the form of binary artifacts. This is especially common in embedded software applications. Dependencies, however, can proliferate and also add new attack surfaces to an application due to vulnerabilities in the library code. Hence, the need for binary similarity analysis methods to detect libraries compiled into applications. Binary similarity detection methods are related to text similarity methods and build upon the research in that area. In this research we focus on fuzzy matching methods, that have been used widely and successfully in text similarity analysis. In particular, we propose using locality sensitive hashing schemes in combination with normalised binary code features. The normalization allows us to apply the similarity comparison across binaries produced by different compilers using different optimization flags and being build for various machine architectures. To improve the matching precision, we use weighted code features. Machine learning is used to optimize the feature weights to create clusters of semantically similar code blocks extracted from different binaries. The machine learning is performed in an offline process to increase scalability and performance of the matching system. Using above methods we build a database of binary similarity code signatures for open source libraries. The database is utilized to match by similarity any code blocks from an application to known libraries in the database. One of the goals of our system is to facilitate a fast and scalable similarity matching process. This allows integrating the system into continuous software development, testing and integration pipelines. The evaluation shows that our results are comparable to other systems proposed in related research in terms of precision while maintaining the performance required in continuous integration systems.Nopea ja skaalautuva käännettyjen ohjelmistojen samankaltaisuuden tunnistusmenetelmä avoimen lähdekoodin kirjastoille. Tiivistelmä. Kolmansien osapuolten kehittämien ohjelmistojen käyttö on yleistynyt valtavasti viime vuosien aikana nopeutuvan ohjelmistokehityksen ja laadukkaiden ohjelmistokirjastojen tarjonnan kasvun myötä. Nämä kirjastot ovat yleensä lisätty kehitettävään ohjelmistoon riippuvuuksina ja usein jopa käännettyinä binääreinä. Tämä on yleistä varsinkin sulatetuissa ohjelmistoissa. Riippuvuudet saattavat kuitenkin luoda uusia hyökkäysvektoreita kirjastoista löytyvien haavoittuvuuksien johdosta. Nämä kolmansien osapuolten kirjastoista löytyvät haavoittuvuudet synnyttävät tarpeen tunnistaa käännetyistä binääriohjelmistoista löytyvät avoimen lähdekoodin ohjelmistokirjastot. Binäärien samankaltaisuuden tunnistusmenetelmät usein pohjautuvat tekstin samankaltaisuuden tunnistusmenetelmiin ja hyödyntävät tämän tieteellisiä saavutuksia. Tässä tutkimuksessa keskitytään sumeisiin tunnistusmenetelmiin, joita on käytetty laajasti tekstin samankaltaisuuden tunnistamisessa. Tutkimuksessa hyödynnetään sijainnille sensitiivisiä tiivistemenetelmiä ja normalisoituja binäärien ominaisuuksia. Ominaisuuksien normalisoinnin avulla binäärien samankaltaisuutta voidaan vertailla ohjelmiston kääntämisessä käytetystä kääntäjästä, optimisaatiotasoista ja prosessoriarkkitehtuurista huolimatta. Menetelmän tarkkuutta parannetaan painotettujen binääriominaisuuksien avulla. Koneoppimista hyödyntämällä binääriomisaisuuksien painotus optimoidaan siten, että samankaltaisista binääreistä puretut ohjelmistoblokit luovat samankaltaisien ohjelmistojen joukkoja. Koneoppiminen suoritetaan erillisessä prosessissa, mikä parantaa järjestelmän suorituskykyä. Näiden menetelmien avulla luodaan tietokanta avoimen lähdekoodin kirjastojen tunnisteista. Tietokannan avulla minkä tahansa ohjelmiston samankaltaiset binääriblokit voidaan yhdistää tunnettuihin avoimen lähdekoodin kirjastoihin. Menetelmän tavoitteena on tarjota nopea ja skaalautuva samankaltaisuuden tunnistus. Näiden ominaisuuksien johdosta järjestelmä voidaan liittää osaksi ohjelmistokehitys-, integraatioprosesseja ja ohjelmistotestausta. Vertailu muihin kirjallisuudessa esiteltyihin menetelmiin osoittaa, että esitellyn menetlmän tulokset on vertailtavissa muihin kirjallisuudessa esiteltyihin menetelmiin tarkkuuden osalta. Menetelmä myös ylläpitää suorituskyvyn, jota vaaditaan jatkuvan integraation järjestelmissä

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation
    corecore