    Data Mining Algorithms for Internet Data: from Transport to Application Layer

    Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets. The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.). In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data

    Interactive data analysis and its applications on multi-structured datasets

    CBR and MBR techniques: review for an application in the emergencies domain

    The purpose of this document is to provide an in-depth analysis of current reasoning engine practice and the integration strategies of Case Based Reasoning and Model Based Reasoning that will be used in the design and development of the RIMSAT system. RIMSAT (Remote Intelligent Management Support and Training) is a European Commission funded project designed to: a.. Provide an innovative, 'intelligent', knowledge based solution aimed at improving the quality of critical decisions b.. Enhance the competencies and responsiveness of individuals and organisations involved in highly complex, safety critical incidents - irrespective of their location. In other words, RIMSAT aims to design and implement a decision support system that using Case Base Reasoning as well as Model Base Reasoning technology is applied in the management of emergency situations. This document is part of a deliverable for RIMSAT project, and although it has been done in close contact with the requirements of the project, it provides an overview wide enough for providing a state of the art in integration strategies between CBR and MBR technologies.Postprint (published version

    Text mining techniques for patent analysis.

    Abstract Patent documents contain important research results. However, they are lengthy and rich in technical terminology such that it takes a lot of human efforts for analyses. Automatic tools for assisting patent engineers or decision makers in patent analysis are in great demand. This paper describes a series of text mining techniques that conforms to the analytical process used by patent analysts. These techniques include text segmentation, summary extraction, feature selection, term association, cluster generation, topic identification, and information mapping. The issues of efficiency and effectiveness are considered in the design of these techniques. Some important features of the proposed methodology include a rigorous approach to verify the usefulness of segment extracts as the document surrogates, a corpus-and dictionary-free algorithm for keyphrase extraction, an efficient co-word analysis method that can be applied to large volume of patents, and an automatic procedure to create generic cluster titles for ease of result interpretation. Evaluation of these techniques was conducted. The results confirm that the machine-generated summaries do preserve more important content words than some other sections for classification. To demonstrate the feasibility, the proposed methodology was applied to a realworld patent set for domain analysis and mapping, which shows that our approach is more effective than existing classification systems. The attempt in this paper to automate the whole process not only helps create final patent maps for topic analyses, but also facilitates or improves other patent analysis tasks such as patent classification, organization, knowledge sharing, and prior art searches

    Automated Analysis of Customer Contacts – a Fintech Based Case Study

    Seoses infotehnoloogia arenguga tekib igapäevaselt enneolematu kogus andmeid, mille automaatne analüüsimine konkurentsieelise saavutamiseks on otsustava tähtsusega. Traditsioonilised andmekaeve meetodid on leidnud laialdaselt ärilisi rakendusi, kuid ei ole sobivad struktureerimata (näiteks tekstiliste) andmete puhul. Seevastu on valdav osa andmetest just struktureerimata kujul, mistõttu on iseäranis oluline luua lahendusi neist olulise teabe eraldamiseks. Käesolev magistritöö on praktilise loomuga ning selle eesmärk oli luua automatiseeritud tekstianalüüsi mudel, mida saab kasutada sissetulevate kliendipäringute efektiivseks prioriseerimiseks ning mõõtmiseks kasutades TransferWise Ltd. andmeid. Tulenevalt püstitatud eesmärgist teostas autor arvukalt eksperimente kasutades nii klassikalisi kui ka uudseid loomuliku keele töötluse meetodeid. Seejuures ei taganud antud ülesande puhul uudsed tehnoloogiad märgatavat paremust klassikaliste meetodite ees. Töö tulemusena valminud mudel on oluline nii ettevõttele kui ka selle klientidele – mudel võimaldab prioriseerida sissetulevaid päringuid vastavalt nende keerukusele ning pakilisusele, mis parandab kliendikogemust ning soodustab ettevõtte kasvu muutes operatsioonilisi protsesse efektiivsemaks. Peale praktilise väärtuse pakub käesolev töö ka ulatuslikku ülevaadet erinevatest loomuliku keele töötluse meetoditest, nende sobivusest ning nendega kaasnevatest võimalustest.The rapid development of information technologies has brought along abnormal amounts of data being generated on a daily basis and the need to automatically analyse it to gain a competitive advantage. Traditional data mining techniques have been efficiently applied in a variety of commercial applications, yet they are only applicable on structured data. However, an overwhelming amount of existing data is in an unstructured (e.g. textual) form, hence it is crucial for companies to build solutions to automatically extract useful information from it. Given master’s thesis is with a practical nature and its purpose was to implement an automated text analysis model using data from TransferWise Ltd. that can be used to efficiently prioritise and measure incoming customer contacts. To achieve this, the author conducted numerous experiments via employing classical as well as novel natural language processing techniques. Apropos, employing novel methods did not ensure a noticeably better outcome. The established model is important for both the company as well as its customers since it can be used to prioritise incoming contacts based on their complexity or urgency. This ensures a convenient customer experience and is likely to accelerate growth by making operational procedures more efficient. Besides its practical value, given thesis also provides an extensive comparison of numerous natural language processing techniques, their suitability and opportunities

    Predicting User Interaction on Social Media using Machine Learning

    Analysis of Facebook posts provides helpful information for users on social media. Current papers about user engagement on social media explore methods for predicting user engagement. These analyses of Facebook posts have included text and image analysis. Yet, the studies have not incorporate both text and image data. This research explores the usefulness of incorporating image and text data to predict user engagement. The study incorporates five types of machine learning models: text-based Neural Networks (NN), image-based Convolutional Neural Networks (CNN), Word2Vec, decision trees, and a combination of text-based NN and image-based CNN. The models are unique in their use of the data. The research collects 350k Facebook posts. The models learn and test on advertisement posts in order to predict user engagement. User engagements includes share count, comment count, and comment sentiment. The study found that combining image and text data produced the best models. The research further demonstrates that combined models outperform random models

    Data mining techniques for complex application domains

    The emergence of advanced communication techniques has increased availability of large collection of data in electronic form in a number of application domains including healthcare, e- business, and e-learning. Everyday a large amount of records are stored electronically. However, finding useful information from such a large data collection is a challenging issue. Data mining technology aims automatically extracting hidden knowledge from large data repositories exploiting sophisticated algorithms. The hidden knowledge in the electronic data may be potentially utilized to facilitate the procedures, productivity, and reliability of several application domains. The PhD activity has been focused on novel and effective data mining approaches to tackle the complex data coming from two main application domains: Healthcare data analysis and Textual data analysis. The research activity, in the context of healthcare data, addressed the application of different data mining techniques to discover valuable knowledge from real exam-log data of patients. In particular, efforts have been devoted to the extraction of medical pathways, which can be exploited to analyze the actual treatments followed by patients. The derived knowledge not only provides useful information to deal with the treatment procedures but may also play an important role in future predictions of potential patient risks associated with medical treatments. The research effort in textual data analysis is twofold. On the one hand, a novel approach to discovery of succinct summaries of large document collections has been proposed. On the other hand, the suitability of an established descriptive data mining to support domain experts in making decisions has been investigated. Both research activities are focused on adopting widely exploratory data mining techniques to textual data analysis, which require overcoming intrinsic limitations for traditional algorithms for handling textual documents efficiently and effectively

    Machine Learning

    Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience