30 research outputs found

    Phishing Detection With Identity Keywords and Target Domain Name

    Get PDF
    This thesis describes the research work carried out to address the problem of phishing detection and the weaknesses in existing anti-phishing methods. Phishing works by luring users to counterfeit websites, where highly confidential credentials are requested. To safeguard Internet users against phishing attacks, a hybrid anti-phishing method consisting of text-based, search engine-based and identity-based methods are proposed, where the differences between the target and actual identities of a webpage are exploited for classification. The proposed method can be divided into three phases. The first phase extracts identity keywords from the textual contents of the website, where a novel weighted URL tokens system based on the N-gram model is proposed. The second phase finds the target domain name by using a search engine, and the target domain name is selected based on identity-relevant features. In the final phase, a 3-tier identity matching system exploits indirect identity relationships to conclude the legitimacy of the query webpage. Experiments were conducted over 10,000 datasets, where true positive rate of 99.68% and true negative rate of 92.52% were achieved. Benchmarking results also suggest that the proposed method achieves comparable overall accuracy with three selected conventional methods. In summary, the proposed method has the key advantage of identifying phishing webpages accurately. This key advantage is highly desirable in anti-phishing applications

    Hybrid phishing detection using joint visual and textual identity

    Get PDF
    In recent years, phishing attacks have evolved considerably, causing existing adversarial features that were widely utilised for detecting phishing websites to become less discriminative. These developments have fuelled growing interests among security researchers towards an anti-phishing strategy known as the identity-based detection technique. Identity-based detection techniques have consistently achieved high true positive rates in a rapidly changing phishing landscape, owing to its capitalisation on fundamental brand identity relations that are inherent in most legitimate webpages. However, existing identity-based techniques often suffer higher false positive rates due to complexities and challenges in establishing the webpage’s brand identity. To close the existing performance gap, this paper proposes a new hybrid identity-based phishing detection technique that leverages webpage visual and textual identity. Extending earlier anti-phishing work based on the website logo as visual identity, our method incorporates novel image features that mimic human vision to enhance the logo detection accuracy. The proposed hybrid technique integrates the visual identity with a textual identity, namely, brand-specific keywords derived from the webpage content using textual analysis methods. We empirically demonstrated on multiple benchmark datasets that this joint visual-textual identity detection approach significantly improves phishing detection performance with an overall accuracy of 98.6%. Benchmarking results against an existing technique showed comparable true positive rates and a reduction of up to 3.4% in false positive rates, thus affirming our objective of reducing the misclassification of legitimate webpages without sacrificing the phishing detection performance. The proposed hybrid identitybased technique is proven to be a significant and practical contribution that will enrich the anti-phishing community with improved defence strategies against rapidly evolving phishing schemes

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe

    Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

    No full text
    Funder: NCI U24CA211006Abstract: The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts

    Leveraging Ensemble Strategies in Identity Verification and Feature Optimisation for Phishing Website Detection

    Get PDF
    The aim of this thesis is to enrich the ongoing efforts of protecting Internet users against phishing attacks. Mainstream solutions and technical approaches for phishing detection suffer from inherent problems such as ineffectiveness against newly launched phishing webpages, misclassification of legitimate webpages, utilisation of irrelevant features, and susceptibility to intentional manipulation by adversaries. In this study, we explore whether ensemble strategies can be leveraged in website identity verification and feature optimisation to address the limitations of existing techniques. This study intends to provide a deeper understanding on the progressive state of phishing and identify potential directions where phishing detection measures should be concentrated. Through the proposal of an improved website logo extraction technique, we showed that the ensemble of visual and textual identities has led to a promising detection accuracy of 98.6%. The misclassification rate of legitimate webpages has also improved by 3.4%, which is consistent with our aim of attaining robustness over legitimate webpages with varying properties that users routinely encounter. To facilitate the identification of essential features for phishing detection, we propose a novel ensemble feature selection framework, which achieved a competitive detection accuracy of 94.6% using only 20.8% of the original number of features. Based on experimental results, we also challenged the utilisation of certain conventional features that are often highly rated and falsely assumed to be effective. Lastly, we showed that the underlying phishing patterns at the webpage interconnection level can be exploited using ensemble strategies in a graph-theoretic approach, achieving up to 97.8% of accuracy while demonstrating robustness and immutability against current and emerging phishing schemes

    Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

    Get PDF
    The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts.The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that -80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAFPeer reviewe

    The burden of mild asthma: Clinical burden and healthcare resource utilisation in the NOVELTY study

    No full text
    Background: Patients with mild asthma represent a substantial proportion of the population with asthma, yet there are limited data on their true burden of disease. We aimed to describe the clinical and healthcare resource utilisation (HCRU) burden of physician-assessed mild asthma.Methods: Patients with mild asthma were included from the NOVEL observational longiTudinal studY (NOVELTY; NCT02760329), a global, 3-year, real-world prospective study of patients with asthma and/or chronic obstructive pulmonary disease from community practice (specialised and primary care). Diagnosis and severity were based on physician discretion. Clinical burden included physician-reported exacerbations and patient-reported measures. HCRU included inpatient and outpatient visits.Results: Overall, 2004 patients with mild asthma were included; 22.8% experienced ≄1 exacerbation in the previous 12 months, of whom 72.3% experienced ≄1 severe exacerbation. Of 625 exacerbations reported, 48.0% lasted >1 week, 27.7% were preceded by symptomatic worsening lasting >3 days, and 50.1% required oral corticosteroid treatment. Health status was moderately impacted (St George's Respiratory Questionnaire score: 23.5 [standard deviation ± 17.9]). At baseline, 29.7% of patients had asthma symptoms that were not well controlled or very poorly controlled (Asthma Control Test score <20), increasing to 55.6% for those with ≄2 exacerbations in the previous year. In terms of HCRU, at least one unscheduled ambulatory visit for exacerbations was required by 9.5% of patients, including 9.2% requiring ≄1 emergency department visit and 1.1% requiring ≄1 hospital admission.Conclusions: In this global sample representing community practice, a significant proportion of patients with physician-assessed mild asthma had considerable clinical burden and HCRU

    Treatable traits in the NOVELTY study

    No full text
    CorrigendumVolume 27, Issue 12, Respirology, pages: 1095-1095. First Published online: November 6, 2022 10.1111/resp.14406International audienceAsthma and chronic obstructive pulmonary disease (COPD) are two prevalent and complex diseases that require personalized management. Although a strategy based on treatable traits (TTs) has been proposed, the prevalence and relationship of TTs to the diagnostic label and disease severity established by the attending physician in a real-world setting are unknown. We assessed how the presence/absence of specific TTs relate to the diagnosis and severity of 'asthma', 'COPD' or 'asthma + COPD'
    corecore