29 research outputs found

    Phishing Detection Using Natural Language Processing and Machine Learning

    Get PDF
    Phishing emails are a primary mode of entry for attackers into an organization. A successful phishing attempt leads to unauthorized access to sensitive information and systems. However, automatically identifying phishing emails is often difficult since many phishing emails have composite features such as body text and metadata that are nearly indistinguishable from valid emails. This paper presents a novel machine learning-based framework, the DARTH framework, that characterizes and combines multiple models, with one model for each composite feature, that enables the accurate identification of phishing emails. The framework analyses each composite feature independently utilizing a multi-faceted approach using Natural Language Processing (NLP) and neural network-based techniques and combines the results of these analyses to classify the emails as malicious or legitimate. Utilizing the framework on more than 150,000 emails and training data from multiple sources, including the authors’ emails and phishtank.com, resulted in the precision (correct identification of malicious observations to the total prediction of malicious observations) of 99.97% with an f-score of 99.98% and accurately identifying phishing emails 99.98% of the time. Utilizing multiple machine learning techniques combined in an ensemble approach across a range of composite features yields highly accurate identification of phishing emails

    Mahimahi: A Lightweight Toolkit for Reproducible Web Measurement

    Get PDF
    This demo presents a measurement toolkit, Mahimahi, that records websites and replays them under emulated network conditions. Mahimahi is structured as a set of arbitrarily composable UNIX shells. It includes two shells to record and replay Web pages, RecordShell and ReplayShell, as well as two shells for network emulation, DelayShell and LinkShell. In addition, Mahimahi includes a corpus of recorded websites along with benchmark results and link traces (https://github.com/ravinet/sites). Mahimahi improves on prior record-and-replay frameworks in three ways. First, it preserves the multi-origin nature of Web pages, present in approximately 98% of the Alexa U.S. Top 500, when replaying. Second, Mahimahi isolates its own network traffic, allowing multiple instances to run concurrently with no impact on the host machine and collected measurements. Finally, Mahimahi is not inherently tied to browsers and can be used to evaluate many different applications. A demo of Mahimahi recording and replaying a Web page over an emulated link can be found at http://youtu.be/vytwDKBA-8s. The source code and instructions to use Mahimahi are available at http://mahimahi.mit.edu/

    Predictors of mortality among hospitalized COVID-19 patients and risk score formulation for prioritizing tertiary care—An experience from South India

    Get PDF
    BACKGROUND: We retrospectively data-mined the case records of Reverse Transcription Polymerase Chain Reaction (RT-PCR) confirmed COVID-19 patients hospitalized to a tertiary care centre to derive mortality predictors and formulate a risk score, for prioritizing admission. METHODS AND FINDINGS: Data on clinical manifestations, comorbidities, vital signs, and basic lab investigations collected as part of routine medical management at admission to a COVID-19 tertiary care centre in Chengalpattu, South India between May and November 2020 were retrospectively analysed to ascertain predictors of mortality in the univariate analysis using their relative difference in distribution among ‘survivors’ and ‘non-survivors’. The regression coefficients of those factors remaining significant in the multivariable logistic regression were utilised for risk score formulation and validated in 1000 bootstrap datasets. Among 746 COVID-19 patients hospitalised [487 “survivors” and 259 “non-survivors” (deaths)], there was a slight male predilection [62.5%, (466/746)], with a higher mortality rate observed among 40–70 years age group [59.1%, (441/746)] and highest among diabetic patients with elevated urea levels [65.4% (68/104)]. The adjusted odds ratios of factors [OR (95% CI)] significant in the multivariable logistic regression were SaO(2)3; 3.01 (1.61–5.83), Age ≥50 years;2.52 (1.45–4.43), Pulse Rate ≥100/min: 2.02 (1.19–3.47) and coexisting Diabetes Mellitus; 1.73 (1.02–2.95) with hypertension and gender not retaining their significance. The individual risk scores for SaO(2)3–11, Age ≥50 years-9, Pulse Rate ≥100/min-7 and coexisting diabetes mellitus-6, acronymed collectively as ‘OUR-ARDs score’ showed that the sum of scores ≥ 25 predicted mortality with a sensitivity-90%, specificity-64% and AUC of 0.85. CONCLUSIONS: The ‘OUR ARDs’ risk score, derived from easily assessable factors predicting mortality, offered a tangible solution for prioritizing admission to COVID-19 tertiary care centre, that enhanced patient care but without unduly straining the health system

    WatchTower: Fast, Secure Mobile Page Loads Using Remote Dependency Resolution

    No full text
    Remote dependency resolution (RDR) is a proxy-driven scheme for reducing mobile page load times; a proxy loads a requested page using a local browser, fetching the page’s resources over fast proxy-origin links instead of a client’s slow last-mile links. In this paper, we describe two fundamental challenges to efficient RDR proxying: the increasing popularity of encrypted HTTPS content, and the fact that, due to time-dependent network conditions and page properties, RDR proxying can actually increase load times. We solve these problems by introducing a new, secure proxying scheme for HTTPS traffic, and by implementing WatchTower, a selective proxying system that uses dynamic models of network conditions and page structures to only enable RDR when it is predicted to help. WatchTower loads pages 21.2%–41.3% faster than state-of-the-art proxies and server push systems, while preserving end-to-end HTTPS security.NSF (Grant CNS-1407470

    WiFi, LTE, or Both?

    No full text
    Over the past two or three years, wireless cellular networks have become faster than before, most notably due to the deployment of LTE, HSPA+, and other similar networks. LTE throughputs can reach many megabits per second and can even rival WiFi throughputs in some locations. This paper addresses a fundamental question confronting transport and application-layer protocol designers: which network should an application use? WiFi, LTE, or Multi-Path TCP (MPTCP) running over both? We compare LTE and WiFi for transfers of different sizes along both directions (i.e. the uplink and the downlink) using a crowd-sourced mobile application run by 750 users over 180 days in 16 different countries. We find that LTE outperforms WiFi 40\% of the time, which is a higher fraction than one might expect at first sight. We measure flow-level MPTCP performance and compare it with the performance of TCP running over exclusively WiFi or LTE in 20 different locations across 7 cities in the United States. For short flows, we find that MPTCP performs worse than regular TCP running over the faster link; further, selecting the correct network for the primary subflow in MPTCP is critical in achieving good performance. For long flows, however, selecting the proper MPTCP congestion control algorithm is equally important. To complement our flow-level analysis, we analyze the traffic patterns of several mobile apps, finding that apps can be categorized as "short-flow dominated" or "long-flow dominated". We then record and replay these patterns over emulated WiFi and LTE links. We find that application performance has a similar dependence on the choice of networks as flow-level performance: an application dominated by short flows sees little gain from MPTCP, while an application with longer flows can benefit much more from MPTCP --- if the application can pick the right network for the primary subflow and the right choice of MPTCP congestion control.National Science Foundation (U.S.) (Grant 1407470)National Science Foundation (U.S.) (Grant 1161964

    Phishing Detection Using Natural Language Processing and Machine Learning

    Get PDF
    Phishing emails are a primary mode of entry for attackers into an organization. A successful phishing attempt leads to unauthorized access to sensitive information and systems. However, automatically identifying phishing emails is often difficult since many phishing emails have composite features such as body text and metadata that are nearly indistinguishable from valid emails. This paper presents a novel machine learning-based framework, the DARTH framework, that characterizes and combines multiple models, with one model for each composite feature, that enables the accurate identification of phishing emails. The framework analyses each composite feature independently utilizing a multi-faceted approach using Natural Language Processing (NLP) and neural network-based techniques and combines the results of these analyses to classify the emails as malicious or legitimate. Utilizing the framework on more than 150,000 emails and training data from multiple sources, including the authors’ emails and phishtank.com, resulted in the precision (correct identification of malicious observations to the total prediction of malicious observations) of 99.97% with an f-score of 99.98% and accurately identifying phishing emails 99.98% of the time. Utilizing multiple machine learning techniques combined in an ensemble approach across a range of composite features yields highly accurate identification of phishing emails
    corecore