855 research outputs found

    Text based classification of companies in CrunchBase

    Get PDF
    This paper introduces two fuzzy fingerprint based text classification techniques that were successfully applied to automatically label companies from CrunchBase, based purely on their unstructured textual description. This is a real and very challenging problem due to the large set of possible labels (more than 40) and also to the fact that the textual descriptions do not have to abide by any criteria and are, therefore, extremely heterogeneous. Fuzzy fingerprints are a recently introduced technique that can be used for performing fast classification. They perform well in the presence of unbalanced datasets and can cope with a very large number of classes. In the paper, a comparison is performed against some of the best text classification techniques commonly used to address similar problems. When applied to the CrunchBase dataset, the fuzzy fingerprint based approach outperformed the other techniques.info:eu-repo/semantics/submittedVersio

    MISNIS: an intelligent platform for Twitter topic mining

    Get PDF
    Twitter has become a major tool for spreading news, for dissemination of positions and ideas, and for the commenting and analysis of current world events. However, with more than 500 million tweets flowing per day, it is necessary to find efficient ways of collecting, storing, managing, mining and visualizing all this information. This is especially relevant if one considers that Twitter has no ways of indexing tweet contents, and that the only available categorization “mechanism” is the #hashtag, which is totally dependent of a user's will to use it. This paper presents an intelligent platform and framework, named MISNIS - Intelligent Mining of Public Social Networks’ Influence in Society - that facilitates these issues and allows a non-technical user to easily mine a given topic from a very large tweet's corpus and obtain relevant contents and indicators such as user influence or sentiment analysis. When compared to other existent similar platforms, MISNIS is an expert system that includes specifically developed intelligent techniques that: (1) Circumvent the Twitter API restrictions that limit access to 1% of all flowing tweets. The platform has been able to collect more than 80% of all flowing portuguese language tweets in Portugal when online; (2) Intelligently retrieve most tweets related to a given topic even when the tweets do not contain the topic #hashtag or user indicated keywords. A 40% increase in the number of retrieved relevant tweets has been reported in real world case studies. The platform is currently focused on Portuguese language tweets posted in Portugal. However, most developed technologies are language independent (e.g. intelligent retrieval, sentiment analysis, etc.), and technically MISNIS can be easily expanded to cover other languages and locations

    Exploring events and distributed representations of text in multi-document summarization

    Get PDF
    In this article, we explore an event detection framework to improve multi-document summarization. Our approach is based on a two-stage single-document method that extracts a collection of key phrases, which are then used in a centrality-as-relevance passage retrieval model. We explore how to adapt this single-document method for multi-document summarization methods that are able to use event information. The event detection method is based on Fuzzy Fingerprint, which is a supervised method trained on documents with annotated event tags. To cope with the possible usage of different terms to describe the same event, we explore distributed representations of text in the form of word embeddings, which contributed to improve the summarization results. The proposed summarization methods are based on the hierarchical combination of single-document summaries. The automatic evaluation and human study performed show that these methods improve upon current state-of-the-art multi-document summarization systems on two mainstream evaluation datasets, DUC 2007 and TAC 2009. We show a relative improvement in ROUGE-1 scores of 16% for TAC 2009 and of 17% for DUC 2007.info:eu-repo/semantics/submittedVersio

    Resilient and Scalable Android Malware Fingerprinting and Detection

    Get PDF
    Malicious software (Malware) proliferation reaches hundreds of thousands daily. The manual analysis of such a large volume of malware is daunting and time-consuming. The diversity of targeted systems in terms of architecture and platforms compounds the challenges of Android malware detection and malware in general. This highlights the need to design and implement new scalable and robust methods, techniques, and tools to detect Android malware. In this thesis, we develop a malware fingerprinting framework to cover accurate Android malware detection and family attribution. In this context, we emphasize the following: (i) the scalability over a large malware corpus; (ii) the resiliency to common obfuscation techniques; (iii) the portability over different platforms and architectures. In the context of bulk and offline detection on the laboratory/vendor level: First, we propose an approximate fingerprinting technique for Android packaging that captures the underlying static structure of the Android apps. We also propose a malware clustering framework on top of this fingerprinting technique to perform unsupervised malware detection and grouping by building and partitioning a similarity network of malicious apps. Second, we propose an approximate fingerprinting technique for Android malware's behavior reports generated using dynamic analyses leveraging natural language processing techniques. Based on this fingerprinting technique, we propose a portable malware detection and family threat attribution framework employing supervised machine learning techniques. Third, we design an automatic framework to produce intelligence about the underlying malicious cyber-infrastructures of Android malware. We leverage graph analysis techniques to generate relevant, actionable, and granular intelligence that can be used to identify the threat effects induced by malicious Internet activity associated to Android malicious apps. In the context of the single app and online detection on the mobile device level, we further propose the following: Fourth, we design a portable and effective Android malware detection system that is suitable for deployment on mobile and resource constrained devices, using machine learning classification on raw method call sequences. Fifth, we elaborate a framework for Android malware detection that is resilient to common code obfuscation techniques and adaptive to operating systems and malware change overtime, using natural language processing and deep learning techniques. We also evaluate the portability of the proposed techniques and methods beyond Android platform malware, as follows: Sixth, we leverage the previously elaborated techniques to build a framework for cross-platform ransomware fingerprinting relying on raw hybrid features in conjunction with advanced deep learning techniques

    Software development process mining: discovery, conformance checking and enhancement

    Get PDF
    Context. Modern software projects require the proper allocation of human, technical and financial resources. Very often, project managers make decisions supported only by their personal experience, intuition or simply by mirroring activities performed by others in similar contexts. Most attempts to avoid such practices use models based on lines of code, cyclomatic complexity or effort estimators, thus commonly supported by software repositories which are known to contain several flaws. Objective. Demonstrate the usefulness of process data and mining methods to enhance the software development practices, by assessing efficiency and unveil unknown process insights, thus contributing to the creation of novel models within the software development analytics realm. Method. We mined the development process fragments of multiple developers in three different scenarios by collecting Integrated Development Environment (IDE) events during their development sessions. Furthermore, we used process and text mining to discovery developers’ workflows and their fingerprints, respectively. Results. We discovered and modeled with good quality developers’ processes during programming sessions based on events extracted from their IDEs. We unveiled insights from coding practices in distinct refactoring tasks, built accurate software complexity forecast models based only on process metrics and setup a method for characterizing coherently developers’ behaviors. The latter may ultimately lead to the creation of a catalog of software development process smells. Conclusions. Our approach is agnostic to programming languages, geographic location or development practices, making it suitable for challenging contexts such as in modern global software development projects using either traditional IDEs or sophisticated low/no code platforms.Contexto. Projetos de software modernos requerem a correta alocação de recursos humanos, técnicos e financeiros. Frequentemente, os gestores de projeto tomam decisões suportadas apenas na sua própria experiência, intuição ou simplesmente espelhando atividades executadas por terceiros em contextos similares. As tentativas para evitar tais práticas baseiam-se em modelos que usam linhas de código, a complexidade ciclomática ou em estimativas de esforço, sendo estes tradicionalmente suportados por repositórios de software conhecidos por conterem várias limitações. Objetivo. Demonstrar a utilidade dos dados de processo e respetivos métodos de análise na melhoria das práticas de desenvolvimento de software, colocando o foco na análise da eficiência e revelando aspetos dos processos até então desconhecidos, contribuindo para a criação de novos modelos no contexto de análises avançadas para o desenvolvimento de software. Método. Explorámos os fragmentos de processo de vários programadores em três cenários diferentes, recolhendo eventos durante as suas sessões de desenvolvimento no IDE. Adicionalmente, usámos métodos de descoberta e análise de processos e texto no sentido de modelar o fluxo de trabalho dos programadores e as suas características individuais, respetivamente. Resultados. Descobrimos e modelámos com boa qualidade os processos dos programadores durante as suas sessões de trabalho, usando eventos provenientes dos seus IDEs. Revelámos factos desconhecidos sobre práticas de refabricação, construímos modelos de previsão da complexidade ciclomática usando apenas métricas de processo e criámos um método para caracterizar coerentemente os comportamentos dos programadores. Este último, pode levar à criação de um catálogo de boas/más práticas no processo de desenvolvimento de software. Conclusões. A nossa abordagem é agnóstica em termos de linguagens de programação, localização geográfica ou prática de desenvolvimento, tornando-a aplicável em contextos complexos tal como em projetos modernos de desenvolvimento global que utilizam tanto os IDEs tradicionais como as atuais e sofisticadas plataformas "low/no code"

    Engineering systematic musicology : methods and services for computational and empirical music research

    Get PDF
    One of the main research questions of *systematic musicology* is concerned with how people make sense of their musical environment. It is concerned with signification and meaning-formation and relates musical structures to effects of music. These fundamental aspects can be approached from many different directions. One could take a cultural perspective where music is considered a phenomenon of human expression, firmly embedded in tradition. Another approach would be a cognitive perspective, where music is considered as an acoustical signal of which perception involves categorizations linked to representations and learning. A performance perspective where music is the outcome of human interaction is also an equally valid view. To understand a phenomenon combining multiple perspectives often makes sense. The methods employed within each of these approaches turn questions into concrete musicological research projects. It is safe to say that today many of these methods draw upon digital data and tools. Some of those general methods are feature extraction from audio and movement signals, machine learning, classification and statistics. However, the problem is that, very often, the *empirical and computational methods require technical solutions* beyond the skills of researchers that typically have a humanities background. At that point, these researchers need access to specialized technical knowledge to advance their research. My PhD-work should be seen within the context of that tradition. In many respects I adopt a problem-solving attitude to problems that are posed by research in systematic musicology. This work *explores solutions that are relevant for systematic musicology*. It does this by engineering solutions for measurement problems in empirical research and developing research software which facilitates computational research. These solutions are placed in an engineering-humanities plane. The first axis of the plane contrasts *services* with *methods*. Methods *in* systematic musicology propose ways to generate new insights in music related phenomena or contribute to how research can be done. Services *for* systematic musicology, on the other hand, support or automate research tasks which allow to change the scope of research. A shift in scope allows researchers to cope with larger data sets which offers a broader view on the phenomenon. The second axis indicates how important Music Information Retrieval (MIR) techniques are in a solution. MIR-techniques are contrasted with various techniques to support empirical research. My research resulted in a total of thirteen solutions which are placed in this plane. The description of seven of these are bundled in this dissertation. Three fall into the methods category and four in the services category. For example Tarsos presents a method to compare performance practice with theoretical scales on a large scale. SyncSink is an example of a service

    Twitter gender classification using user unstructured information

    Get PDF
    This paper describes an approach to automatically detect the gender of Twitter users, based only on clues provided by their profile information in an unstructured form. A number of features that capture phenomena specific of Twitter users is proposed and evaluated on a dataset of about 242K English language users. Different supervised and unsupervised approaches are used to assess the performance of the proposed features, including Naive Bayes variants, Logistic Regression, Support Vector Machines, Fuzzy c-Means clustering, and K-means. An unsupervised approach based on Fuzzy c-Means proved to be very suitable for this task, returning the correct gender for about 96% of the users.info:eu-repo/semantics/acceptedVersio
    corecore