49 research outputs found

    Mining diverse consumer preferences for bundling and recommendation

    Get PDF

    Utilizing AI/ML methods for measuring data quality

    Get PDF
    Kvalitní data jsou zásadní pro důvěryhodná rozhodnutí na datech založená. Značná část současných přístupů k měření kvality dat je spojena s náročnou, odbornou a časově náročnou prací, která vyžaduje manuální přístup k dosažení odpovídajících výsledků. Tyto přístupy jsou navíc náchylné k chybám a nevyužívají plně potenciál umělé inteligence (AI). Možným řešením je prozkoumat inovativní nové metody založené na strojovém učení (ML), které využívají potenciál AI k překonání těchto problémů. Významná část práce se zabývá teorií kvality dat, která poskytuje komplexní vhled do této oblasti. V existující literatuře byly objeveny čtyři moderní metody založené na ML a byla navržena jedna nová metoda založená na autoenkodéru (AE). Byly provedeny experimenty s AE a dolováním asociačních pravidel za pomoci metod zpracování přirozeného jazyka. Navrhované metody založené na AE prokázaly schopnost detekce potenciálních problémů s kvalitou dat na datasetech z reálného světa. Dolování asociačních pravidel dokázalo extrahovat byznys pravidla pro stanovený problém, ale vyžadovalo značné úsilí s předzpracováním dat. Alternativní metody nezaložené na AI byly také podrobeny analýze, ale vyžadovaly odborné znalosti daného problému a domény.High-quality data is crucial for trusted data-based decisions. A considerable part of current data quality measuring approaches is associated with expensive, expert and time-consuming work that includes manual effort to achieve adequate results. Furthermore, these approaches are prone to error and do not take full advantage of the AI potential. A possible solution is to explore ML-based state-of-the-art methods that are using the potential of AI to overcome these issues. A significant part of the thesis deals with data quality theory which provides a comprehensive insight into the field of data quality. Four ML-based state-of-the-art methods were discovered in the existing literature, and one novel method based on Autoencoders (AE) was proposed. Experiments with AE and Association Rule Mining using NLP were conducted. Proposed methods based on AE proved to detect potential data quality defects in real-world datasets. Association Rule Mining approach was able to extract business rules for a given business question, but the required significant preprocessing effort. Alternative non-AI methods were also analyzed but required reliance on expert and domain knowledge

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    Social media analytics and the role of twitter in the 2014 South Africa general election: a case study

    Get PDF
    A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science., University of the Witwatersrand, Johannesburg, 2018Social network sites such as Twitter have created vibrant and diverse communities in which users express their opinions and views on a variety of topics such as politics. Extensive research has been conducted in countries such as Ireland, Germany and the United States, in which text mining techniques have been used to obtain information from politically oriented tweets. The purpose of this research was to determine if text mining techniques can be used to uncover meaningful information from a corpus of political tweets collected during the 2014 South African General Election. The Twitter Application Programming Interface was used to collect tweets that were related to the three major political parties in South Africa, namely: the African National Congress (ANC), the Democratic Alliance (DA) and the Economic Freedom Fighters (EFF). The text mining techniques used in this research are: sentiment analysis, clustering, association rule mining and word cloud analysis. In addition, a correlation analysis was performed to determine if there exists a relationship between the total number of tweets mentioning a political party and the total number of votes obtained by that party. The VADER (Valence Aware Dictionary for sEntiment Reasoning) sentiment classifier was used to determine the public’s sentiment towards the three main political parties. This revealed an overwhelming neutral sentiment of the public towards the ANC, DA and EFF. The result produced by the VADER sentiment classifier was significantly greater than any of the baselines in this research. The K-Means cluster algorithm was used to successfully cluster the corpus of political tweets into political-party clusters. Clusters containing tweets relating to the ANC and EFF were formed. However, tweets relating to the DA were scattered across multiple clusters. A fairly strong relationship was discovered between the number of positive tweets that mention the ANC and the number of votes the ANC received in election. Due to the lack of data, no conclusions could be made for the DA or the EFF. The apriori algorithm uncovered numerous association rules, some of which were found to be interest- ing. The results have also demonstrated the usefulness of word cloud analysis in providing easy-to-understand information from the tweet corpus used in this study. This research has highlighted the many ways in which text mining techniques can be used to obtain meaningful information from a corpus of political tweets. This case study can be seen as a contribution to a research effort that seeks to unlock the information contained in textual data from social network sites.MT 201

    Mining Behavior of Citizen Sensor Communities to Improve Cooperation with Organizational Actors

    Get PDF
    Web 2.0 (social media) provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens generate content for sharing information and engaging in discussions. Such a citizen sensor community (CSC) has stated or implied goals that are helpful in the work of formal organizations, such as an emergency management unit, for prioritizing their response needs. This research addresses questions related to design of a cooperative system of organizations and citizens in CSC. Prior research by social scientists in a limited offline and online environment has provided a foundation for research on cooperative behavior challenges, including \u27articulation\u27 and \u27awareness\u27, but Web 2.0 supported CSC offers new challenges as well as opportunities. A CSC presents information overload for the organizational actors, especially in finding reliable information providers (for awareness), and finding actionable information from the data generated by citizens (for articulation). Also, we note three data level challenges: ambiguity in interpreting unconstrained natural language text, sparsity of user behaviors, and diversity of user demographics. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues. I present a novel web information-processing framework, called the Identify-Match- Engage (IME) framework. IME allows operationalizing computation in design problems of awareness and articulation of the cooperative system between citizens and organizations, by addressing data problems of group engagement modeling and intent mining. The IME framework includes: a.) Identification of cooperation-assistive intent (seeking-offering) from short, unstructured messages using a classification model with declarative, social and contrast pattern knowledge, b.) Facilitation of coordination modeling using bipartite matching of complementary intent (seeking-offering), and c.) Identification of user groups to prioritize for engagement by defining a content-driven measure of \u27group discussion divergence\u27. The use of prior knowledge and interplay of features of users, content, and network structures efficiently captures context for computing cooperation-assistive behavior (intent and engagement) from unstructured social data in the online socio-technical systems. Our evaluation of a use-case of the crisis response domain shows improvement in performance for both intent classification and group engagement prioritization. Real world applications of this work include use of the engagement interface tool during various recent crises including the 2014 Jammu and Kashmir floods, and intent classification as a service integrated by the crisis mapping pioneer Ushahidi\u27s CrisisNET project for broader impact

    Mining Twitter Sequences of Product Opinions with Multi-Word Aspect Terms

    Get PDF
    Social media platforms have opened doors to users\u27 opinions and perceptions. The text remains the most popular means of contact on social media, despite different means of communication (audio/video and images). Twitter is one such microblogging platform that allows people to express their thoughts within 280 characters per message. The freedom of expression has made it difficult to understand the polarity (Positive, Negative, or Neutral) of the tweets/posts. Given a corpus of microblog texts (e.g., the new iPhone battery life is good, but camera quality is bad ), mining aspects (e.g., battery life, camera quality) and opinions (e.g., good, bad) of these products are challenging due to the vast data being generated. Aspect-Based Opinion Mining (ABOM) is thus a combination of aspect extraction and opinion mining that allows an enterprise to analyze the data in detail, saving time and money automatically. Existing systems such as Hate Crime Twitter Sentiment (HCTS) and Microblog Aspect Miner (MAM) have been recently proposed to perform ABOM on Twitter. These systems generally go through the four-step approach of obtaining microblog posts, identifying frequent nouns (candidate aspects), pruning the candidate aspects, and getting opinion polarity. However, they differ in how well they prune their candidate features. HCTS uses Apriori based Association rule mining to find the important aspects (single and multi word) of a given product. However, the Apriori based system generate many candidate sequences which generates redundant candidate aspects and HCTS also fails to summarize the category of the aspects (Camera? Battery?). MAM follows the similar approach to that of HCTS for finding the relevant aspects but it further clusters the frequent nouns (aspects) to obtain the relevant aspects. However, it does not identify the multi-word aspects and the aspect category of a product. This thesis proposes a system called Microblog Aspect Sequence Miner (MASM) as an extension of Microblog Aspect Miner (MAM) by replacing the Apriori algorithm with the modified frequent sequential pattern mining algorithm. The system uses the power of sequential pattern mining for aspect extraction in ABOM. The sentiments of the tweets are unknown, so we build our approach in an unsupervised learning manner. The input posts are first classified to identify those tweets which contain the opinion (subjective) to those that do not have any opinion (objective). Then we extract the Parts of Speech tags for the explicit aspects to identify the frequent nouns. The novel frequent pattern mining framework (CM-SPAM) is applied to segment the single and multi-word aspects which generates less sequences as compared to previous approaches. This prior knowledge helps us to operate a topic modeling framework (Latent Dirichlet Allocation) to determine the summary of most common aspects (Aspect Category) and their sentiments for a product. Thefindings demonstrate that the MASM model has a promising performance in finding relevant aspects with reduction of average vector size (cost of candidate/aspect generation) against the MAM and HCTS using the Sanders Twitter corpus dataset. Experimental results with evaluation metrics of execution time, precision, recall, and F-measure indicate that our approach has higher recall and precision than the existing systems

    Learning lost temporal fuzzy association rules

    Get PDF
    Fuzzy association rule mining discovers patterns in transactions, such as shopping baskets in a supermarket, or Web page accesses by a visitor to a Web site. Temporal patterns can be present in fuzzy association rules because the underlying process generating the data can be dynamic. However, existing solutions may not discover all interesting patterns because of a previously unrecognised problem that is revealed in this thesis. The contextual meaning of fuzzy association rules changes because of the dynamic feature of data. The static fuzzy representation and traditional search method are inadequate. The Genetic Iterative Temporal Fuzzy Association Rule Mining (GITFARM) framework solves the problem by utilising flexible fuzzy representations from a fuzzy rule-based system (FRBS). The combination of temporal, fuzzy and itemset space was simultaneously searched with a genetic algorithm (GA) to overcome the problem. The framework transforms the dataset to a graph for efficiently searching the dataset. A choice of model in fuzzy representation provides a trade-off in usage between an approximate and descriptive model. A method for verifying the solution to the hypothesised problem was presented. The proposed GA-based solution was compared with a traditional approach that uses an exhaustive search method. It was shown how the GA-based solution discovered rules that the traditional approach did not. This shows that simultaneously searching for rules and membership functions with a GA is a suitable solution for mining temporal fuzzy association rules. So, in practice, more knowledge can be discovered for making well-informed decisions that would otherwise be lost with a traditional approach.EPSRC DT

    Big Data mining and machine learning techniques applied to real world scenarios

    Get PDF
    Data mining techniques allow the extraction of valuable information from heterogeneous and possibly very large data sources, which can be either structured or unstructured. Unstructured data, such as text files, social media, mobile data, are much more than structured data, and grow at a higher rate. Their high volume and the inherent ambiguity of natural language make unstructured data very hard to process and analyze. Appropriate text representations are therefore required in order to capture word semantics as well as to preserve statistical information, e.g. word counts. In Big Data scenarios, scalability is also a primary requirement. Data mining and machine learning approaches should take advantage of large-scale data, exploiting abundant information and avoiding the curse of dimensionality. The goal of this thesis is to enhance text understanding in the analysis of big data sets, introducing novel techniques that can be employed for the solution of real world problems. The presented Markov methods temporarily achieved the state-of-the-art on well-known Amazon reviews corpora for cross-domain sentiment analysis, before being outperformed by deep approaches in the analysis of large data sets. A noise detection method for the identification of relevant tweets leads to 88.9% accuracy in the Dow Jones Industrial Average daily prediction, which is the best result in literature based on social networks. Dimensionality reduction approaches are used in combination with LinkedIn users' skills to perform job recommendation. A framework based on deep learning and Markov Decision Process is designed with the purpose of modeling job transitions and recommending pathways towards a given career goal. Finally, parallel primitives for vendor-agnostic implementation of Big Data mining algorithms are introduced to foster multi-platform deployment, code reuse and optimization

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p