916 research outputs found

    An ontology enhanced parallel SVM for scalable spam filter training

    Get PDF
    This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

    Privacy preserving association rule mining using attribute-identity mapping

    Get PDF
    Association rule mining uncovers hidden yet important patterns in data. Discovery of the patterns helps data owners to make right decision to enhance efficiency, increase profit and reduce loss. However, there is privacy concern especially when the data owner is not the miner or when many parties are involved. This research studied privacy preserving association rule mining (PPARM) of horizontally partitioned and outsourced data. Existing research works in the area concentrated mainly on the privacy issue and paid very little attention to data quality issue. Meanwhile, the more the data quality, the more accurate and reliable will the association rules be. Consequently, this research proposed Attribute-Identity Mapping (AIM) as a PPARM technique to address the data quality issue. Given a dataset, AIM identifies set of attributes, attribute values for each attribute. It then assigns ‘unique’ identity for each of the attributes and their corresponding values. It then generates sanitized dataset by replacing each attribute and its values with their corresponding identities. For privacy preservation purpose, the sanitization process will be carried out by data owners. They then send the sanitized data, which is made up of only identities, to data miner. When any or all the data owners need(s) ARM result from the aggregate data, they send query to the data miner. The query constitutes attributes (in form of identities), minSup and minConf thresholds and then number of rules they are want. Results obtained show that the PPARM technique maintains 100% data quality without compromising privacy, using Census Income dataset

    DPWeka: Achieving Differential Privacy in WEKA

    Get PDF
    Organizations belonging to the government, commercial, and non-profit industries collect and store large amounts of sensitive data, which include medical, financial, and personal information. They use data mining methods to formulate business strategies that yield high long-term and short-term financial benefits. While analyzing such data, the private information of the individuals present in the data must be protected for moral and legal reasons. Current practices such as redacting sensitive attributes, releasing only the aggregate values, and query auditing do not provide sufficient protection against an adversary armed with auxiliary information. In the presence of additional background information, the privacy protection framework, differential privacy, provides mathematical guarantees against adversarial attacks. Existing platforms for differential privacy employ specific mechanisms for limited applications of data mining. Additionally, widely used data mining tools do not contain differentially private data mining algorithms. As a result, for analyzing sensitive data, the cognizance of differentially private methods is currently limited outside the research community. This thesis examines various mechanisms to realize differential privacy in practice and investigates methods to integrate them with a popular machine learning toolkit, WEKA. We present DPWeka, a package that provides differential privacy capabilities to WEKA, for practical data mining. DPWeka includes a suite of differential privacy preserving algorithms which support a variety of data mining tasks including attribute selection and regression analysis. It has provisions for users to control privacy and model parameters, such as privacy mechanism, privacy budget, and other algorithm specific variables. We evaluate private algorithms on real-world datasets, such as genetic data and census data, to demonstrate the practical applicability of DPWeka

    Hoodsquare: Modeling and Recommending Neighborhoods in Location-based Social Networks

    Full text link
    Information garnered from activity on location-based social networks can be harnessed to characterize urban spaces and organize them into neighborhoods. In this work, we adopt a data-driven approach to the identification and modeling of urban neighborhoods using location-based social networks. We represent geographic points in the city using spatio-temporal information about Foursquare user check-ins and semantic information about places, with the goal of developing features to input into a novel neighborhood detection algorithm. The algorithm first employs a similarity metric that assesses the homogeneity of a geographic area, and then with a simple mechanism of geographic navigation, it detects the boundaries of a city's neighborhoods. The models and algorithms devised are subsequently integrated into a publicly available, map-based tool named Hoodsquare that allows users to explore activities and neighborhoods in cities around the world. Finally, we evaluate Hoodsquare in the context of a recommendation application where user profiles are matched to urban neighborhoods. By comparing with a number of baselines, we demonstrate how Hoodsquare can be used to accurately predict the home neighborhood of Twitter users. We also show that we are able to suggest neighborhoods geographically constrained in size, a desirable property in mobile recommendation scenarios for which geographical precision is key.Comment: ASE/IEEE SocialCom 201

    A Multiple Classifier Approach to Improving Classification Accuracy Using Big Data Analytics Tool

    Get PDF
    At the heart of analytics is data. Data analytics has become an indispensable part of intelligent decision making in the current digital scenario. Applications today generate a large amount of data. Associated with the data deluge, data analytics field has seen an onset of a large number of open source tools and software to expedite large scale analytics. Data science community is robust with numerous options of tools available for storing, processing and analysing data. This research paper makes use of KNIME, one of the popular tools for big data analytics, to perform an investigative study of the key classification algorithms in machine learning. The comparative study shows that the classification accuracy can be enhanced by using a combination of the learning techniques and proposes an ensemble technique on publicly available datasets

    Data Mining Techniques for Predicting Real Estate Trends

    Get PDF
    A wide variety of businesses and government agencies support the U.S. real estate market. Examples would include sales agents, national lenders, local credit unions, private mortgage and title insurers, and government sponsored entities (Freddie Mac and Fannie Mae), to name a few. The financial performance and overall success of these organizations depends in large part on the health of the overall real estate market. According to the National Association of Home Builders (NAHB), the construction of one single-family home of average size creates the equivalent of nearly 3 new jobs for a year (Greiner, 2015). The economic impact is significant, with residential construction and related activities contributing approximately 5 percent to overall gross domestic product. With these data points in mind, the ability to accurately predict housing trends has become an increasingly important function for organizations engaged in the real estate market. The government bailouts of Freddie Mac and Fannie Mae in July 2008, following the severe housing market collapse which began earlier that year, serve as an example of the risks associated with the housing market. The housing market collapse had left the two firms, which at the time owned or guaranteed about $5 trillion of home loans, in a dangerous and uncertain financial state (Olick, 2018). Countrywide Home Loans, Indy Mac, and Washington Mutual Bank are a few examples of mortgage banks that did not survive the housing market collapse and subsequent recession. In the wake of the financial crisis, businesses within the real estate market have recognized that predicting the direction of real estate is an essential business requirement. A business acquisition by Radian Group, the Philadelphia-based mortgage insurance company, illustrates the importance of predictive modeling for the mortgage industry. In January 2019, Radian Group acquired Five Bridges Advisors, a Maryland-based firm which develops data analytics and econometric predictive models leveraging artificial intelligence and machine learning techniques (Blumenthal, 2019)
    corecore