1,384 research outputs found

    Global disease monitoring and forecasting with Wikipedia

    Full text link
    Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data such as social media and search queries are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r2r^2 up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.Comment: 27 pages; 4 figures; 4 tables. Version 2: Cite McIver & Brownstein and adjust novelty claims accordingly; revise title; various revisions for clarit

    Studying Ransomware Attacks Using Web Search Logs

    Full text link
    Cyber attacks are increasingly becoming prevalent and causing significant damage to individuals, businesses and even countries. In particular, ransomware attacks have grown significantly over the last decade. We do the first study on mining insights about ransomware attacks by analyzing query logs from Bing web search engine. We first extract ransomware related queries and then build a machine learning model to identify queries where users are seeking support for ransomware attacks. We show that user search behavior and characteristics are correlated with ransomware attacks. We also analyse trends in the temporal and geographical space and validate our findings against publicly available information. Lastly, we do a case study on 'Nemty', a popular ransomware, to show that it is possible to derive accurate insights about cyber attacks by query log analysis.Comment: To appear in the proceedings of SIGIR 202

    A Study on the Efficient Estimation of the Payment Intention in the Mail Order Industry

    Get PDF
    AbstractThis paper presents investigating the customer payment intention prediction in the mail order industry. As the B2C market expands their market volume, the fraud transactions increase in number. The primary indicator for the detection are the shipping address, the recipient name, and the payment method. These information usually make use of the prediction in the Japanese mail order industry. Conventional detecting method for the fraud depends on the human working experiences so far. As the number of transaction becomes large, fraud detection becomes difficult. The mail order industry needs something new method for the detection. The result of the Google Flu Trends shows, accurate prediction needs the heuristics knowledge. For these backgrounds, we observe the transaction data with the customer attribute information gathered from a mail order company in Japan and characterized the customer with machine learning method. From the results of the intensive research, potential fraudulent transactions are identified. Intensive research revealed that the classification of the deliberate customer and the careless customer with machine learning. This result will make use of the customer screening at the time of order received

    Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance

    Get PDF
    A variety of obstacles, including bureaucracy and lack of resources, delay detection and reporting of dengue and exist in many countries where the disease is a major public health threat. Surveillance efforts have turned to modern data sources such as Internet usage data. People often seek health-related information online and it has been found that the frequency of, for example, influenza-related web searches as a whole rises as the number of people sick with influenza rises. Tools have been developed to help track influenza epidemics by finding patterns in certain web search activity. However, few have evaluated whether this approach would also be effective for other diseases, especially those that affect many people, that have severe consequences, or for which there is no vaccine. In this study, we found that aggregated, anonymized Google search query data were also capable of tracking dengue activity in Bolivia, Brazil, India, Indonesia and Singapore. Whereas traditional dengue data from official sources are often not available until after a long delay, web search query data is available for analysis within a day. Therefore, because it could potentially provide earlier warnings, these data represent a valuable complement to traditional dengue surveillance

    Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance

    Get PDF
    Health surveillance systems based on online user-generated content often rely on the identification of textual markers that are related to a target disease. Given the high volume of available data, these systems benefit from an automatic feature selection process. This is accomplished either by applying statistical learning techniques, which do not consider the semantic relationship between the selected features and the inference task, or by developing labour-intensive text classifiers. In this paper, we use neural word embeddings, trained on social media content from Twitter, to determine, in an unsupervised manner, how strongly textual features are semantically linked to an underlying health concept. We then refine conventional feature selection methods by a priori operating on textual variables that are sufficiently close to a target concept. Our experiments focus on the supervised learning problem of estimating influenza-like illness rates from Google search queries. A "flu infection" concept is formulated and used to reduce spurious and potentially confounding features that were selected by previously applied approaches. In this way, we also address forms of scepticism regarding the appropriateness of the feature space, alleviating potential cases of overfitting. Ultimately, the proposed hybrid feature selection method creates a more reliable model that, according to our empirical analysis, improves the inference performance (Mean Absolute Error) of linear and nonlinear regressors by 12% and 28.7%, respectively

    "Does Vinegar Kill Coronavirus?" - Using Search Log Analysis to Estimate the Extent of COVID-19-Related Misinformation Searching Behaviour in the United States

    Get PDF
    Health experts and government authorities' actions to combat the coronavirus outbreak are strongly compromised by the misinformation infodemic that evolved in parallel to the COVID-19 pandemic. When people get misled by unscientific and unsubstantiated claims regarding the origin or cures for COVID-19, public health response efforts get undermined and people might be less likely to comply with official guidance and thus spread the virus or even harm themselves. To prevent this from happening, a first step is to reveal the prevalence of misinformation ideas in the public. In this study, we use search log analysis to investigate the extent and characteristics of misinformation seeking behaviour in the US using the Bing Search Data-set for Coronavirus Intent. We train a machine learning model to distinguish between regular and misinformation queries and find that only around 1\% of queries are related to misinformation myths or conspiracy theories. The query term \textit{qanon} --- connecting the conspiracy theory to many different origin myths of COVID-19 --- is the most frequent and steadily increasing misinformation-related query in the data-set
    • …
    corecore