253 research outputs found
Unveiling the frontiers of deep learning: innovations shaping diverse domains
Deep learning (DL) enables the development of computer models that are
capable of learning, visualizing, optimizing, refining, and predicting data. In
recent years, DL has been applied in a range of fields, including audio-visual
data processing, agriculture, transportation prediction, natural language,
biomedicine, disaster management, bioinformatics, drug design, genomics, face
recognition, and ecology. To explore the current state of deep learning, it is
necessary to investigate the latest developments and applications of deep
learning in these disciplines. However, the literature is lacking in exploring
the applications of deep learning in all potential sectors. This paper thus
extensively investigates the potential applications of deep learning across all
major fields of study as well as the associated benefits and challenges. As
evidenced in the literature, DL exhibits accuracy in prediction and analysis,
makes it a powerful computational tool, and has the ability to articulate
itself and optimize, making it effective in processing data with no prior
training. Given its independence from training data, deep learning necessitates
massive amounts of data for effective analysis and processing, much like data
volume. To handle the challenge of compiling huge amounts of medical,
scientific, healthcare, and environmental data for use in deep learning, gated
architectures like LSTMs and GRUs can be utilized. For multimodal learning,
shared neurons in the neural network for all activities and specialized neurons
for particular tasks are necessary.Comment: 64 pages, 3 figures, 3 table
Machine Learning and Alternative Data Analytics for Fashion Finance
This dissertation investigates the application of Machine Learning, Natural Language Processing and computational finance to a novel area Fashion Finance. Specifically identifying investment opportunities within the Apparel industry using influential alternative data sources such as Instagram. Fashion investment is challenging due to the ephemeral nature of the industry and the difficulty for investors who lack an understanding of how to analyze trend-driven consumer brands. Unstructured online data (e-commerce stores, social media, online blogs, news, etc.), introduce new opportunities for investment signals extraction. We focus on how trading signals can be generated from the Instagram data and events reported in the news articles. Part of this research work was done in collaboration with Arabesque Asset Management. Farfetch, the online luxury retailer, and Living Bridge Private Equity provided industry advice. Research Datasets The datasets used for this research are collected from various sources and include the following types of data: - Financial data: daily stock prices of 50 U.S. and European Apparel and Footwear equities, daily U.S. Retail Trade and U.S. Consumer Non-Durables sectors indices, Form 10-K reports. - Instagram data: daily Instagram profile followers for 11 fashion companies. - News data: 0.5 mln news articles that mention selected 50 equities. Research Experiments The thesis consists of the below studies: 1. Relationship between Instagram Popularity and Stock Prices. This study investigates a link between the changes in a company's popularity (daily followers counts) on Instagram and its stock price, revenue movements. We use cross-correlation analysis to find whether the signals derived from the followers' data could help to infer a company's future financial performance. Two hypothetical trading strategies are designed to test if the changes in a company's Instagram popularity could improve the returns. To test the hypotheses, Wilcoxon signed-rank test is used. 2. Dynamic Density-based News Clustering. The aim of this study is twofold: 1) analyse the characteristics of relevant news event articles and how they differ from the noisy/irrelevant news; 2) using the insights, design an unsupervised framework that clusters news articles and identifies events clusters without predefined parameters or expert knowledge. The framework incorporates the density-based clustering algorithm DBSCAN where the clustering parameters are selected dynamically with Gaussian Mixture Model and by maximizing the inter-cluster Information Entropy. 3. ALGA: Automatic Logic Gate Annotator for Event Detection. We design a news classification model for detecting fashion events that are likely to impact a company's stock price. The articles are represented by the following text embeddings: TF-IDF, Doc2Vec and BERT (Transformer Neural Network). The study is comprised of two parts: 1) we design a domain-specific automatic news labelling framework ALGA. The framework incorporates topic extraction (Latent Dirichlet Allocation) and clustering (DBSCAN) algorithms in addition to other filters to annotate the dataset; 2) using the labelled dataset, we train Logistic Regression classifier for identifying financially relevant news. The model shows the state-of-the-art results in the domain-specific financial event detection problem. Contribution to Science This research work presents the following contributions to science: - Introducing original work in Machine Learning and Natural Language Processing application for analysing alternative data on ephemeral fashion assets. - Introducing the new metrics to measure and track a fashion brand's popularity for investment decision making. - Design of the dynamic news events clustering framework that finds events clusters of various sizes in the news articles without predefined parameters. - Present the original Automatic Logic Gate Annotator framework (ALGA) for automatic labelling of news articles for the financial event detection task. - Design of the Apparel and Footwear news events classifier using the datasets generated by the ALGA's framework and show the state-of-the-art performance in a domain-specific financial event detection task. - Build the \textit{Fashion Finance Dictionary} that contains 320 phrases related to various financially-relevant events in the Apparel and Footwear industry
Recommended from our members
Achieving Accurate Predictions of Future Events Under Hardware Heterogeneity
Heterogeneous hardware is becoming increasingly available in modern hardware, while research breakthroughs enforce the expectation that heterogeneity will keep increasing in the future. Significant gains can be achieved via appropriate utilization of heterogeneity, in terms of performance and power consumption, however, poor utilization can have a detrimental effect. Intelligent scheduling and resource management is a crucial challenge we need to overcome in order to harvest the full potential of heterogeneous hardware. As systems become larger and include greater levels of hardware diversity, the importance of intelligent scheduling and resource management is further accentuated.This dissertation presents techniques that aid the process of scheduling and resource management in the presence of heterogeneous hardware, via accurately predicting upcoming runtime events. With a proactive and accurate view of the near future, schedulers can utilize the underlying hardware more efficiently, and fully take advantage of the available benefits.By adapting a majority element heuristic, this dissertation significantly improves the accuracy of predicting memory addresses about to be accessed, while reducing prediction-related costs by a factor of ten thousand compared to previously proposed predictive approaches. Coupled with novel microarchitectural modifications, accurate address predictions are shown to improve the performance of heterogeneous memory architectures.Machine learning-based performance predictors are further presented, capable of predicting a program's performance when executed on a given general-purpose core. Trained to model the subtleties of the interaction between hardware and software, these predictors are capable of generating highly accurate predictions even for cores with varied Instruction Set Architectures. Utilizing these performance predictions for job scheduling, is shown to improve overall system performance. The trained predictors are further examined and interpreted in order to visualize the correlations between features picked up and amplified during training.Finally, this dissertation demonstrates that scheduling algorithms cannot guarantee deriving an optimal schedule during realistic execution scenarios due to the underlying hardware heterogeneity, the wide range of runtime requirements of software, as well as prediction error from performance predictors. In response, deep neural networks are trained to select one scheduling approach from a list of options with varied overheads and correctness guarantees. The scheduling approach chosen, is the one which will most likely return the highest-performance schedule with the lowest overhead, given a particular instance of the job-to-core assignment problem
BotCap: Machine Learning Approach for Botnet Detection Based on Statistical Features
In this paper, we describe a detailed approach to develop a botnet detection system using machine learning (ML)techniques. Detecting botnet member hosts, or identifying botnet traffic has been the main subject of manyresearch efforts. This research aims to overcome two serious limitations of current botnet detection systems:First, the need for Deep Packet Inspection-DPI and the need to collect traffic from several infected hosts. Toachieve that, we have analyzed several botware samples of known botnets. Based on this analysis, we haveidentified a set of statistical features that may help to distinguish between benign and botnet malicious traffic.Then, we have carried several machine learning experiments in order to test the suitability of ML techniques andalso to pick a minimal subset of the identified features that provide best detection. We have implemented ourapproach in a tool called BotCap whose test results showed its proven ability to detect individually infected hostsin a local network
Anomalous behaviour detection using heterogeneous data
Anomaly detection is one of the most important methods to process and find abnormal data, as this method can distinguish between normal and abnormal behaviour. Anomaly detection has been applied in many areas such as the medical sector, fraud detection in finance, fault detection in machines, intrusion detection in networks, surveillance systems for security, as well as forensic investigations. Abnormal behaviour can give information or answer questions when an investigator is performing an investigation. Anomaly detection is one way to simplify big data by focusing on data that have been grouped or clustered by the anomaly detection method. Forensic data usually consists of heterogeneous data which have several data forms or types such as qualitative or quantitative, structured or unstructured, and primary or secondary. For example, when a crime takes place, the evidence can be in the form of various types of data. The combination of all the data types can produce rich information insights. Nowadays, data has become ‘big’ because it is generated every second of every day and processing has become time-consuming and tedious. Therefore, in this study, a new method to detect abnormal behaviour is proposed using heterogeneous data and combining the data using data fusion technique. Vast challenge data and image data are applied to demonstrate the heterogeneous data. The first contribution in this study is applying the heterogeneous data to detect an anomaly. The recently introduced anomaly detection technique which is known as Empirical Data Analytics (EDA) is applied to detect the abnormal behaviour based on the data sets. Standardised eccentricity (a newly introduced within EDA measure offering a new simplified form of the well-known Chebyshev Inequality) can be applied to any data distribution. Then, the second contribution is applying image data. The image data is processed using pre-trained deep learning network, and classification is done using a support vector machine (SVM). After that, the last contribution is combining anomaly result from heterogeneous data and image recognition using new data fusion technique. There are five types of data with three different modalities and different dimensionalities. The data cannot be simply combined and integrated. Therefore, the new data fusion technique first analyses the abnormality in each data type separately and determines the degree of suspicious between 0 and 1 and sums up all the degrees of suspicion data afterwards. This method is not intended to be a fully automatic system that resolves investigations, which would likely be unacceptable in any case. The aim is rather to simplify the role of the humans so that they can focus on a small number of cases to be looked in more detail. The proposed approach does simplify the processing of such huge amounts of data. Later, this method can assist human experts in their investigations and making final decisions
Cyber Security
This open access book constitutes the refereed proceedings of the 17th International Annual Conference on Cyber Security, CNCERT 2021, held in Beijing, China, in AJuly 2021. The 14 papers presented were carefully reviewed and selected from 51 submissions. The papers are organized according to the following topical sections: ​data security; privacy protection; anomaly detection; traffic analysis; social network security; vulnerability detection; text classification
Cyber Security
This open access book constitutes the refereed proceedings of the 17th International Annual Conference on Cyber Security, CNCERT 2021, held in Beijing, China, in AJuly 2021. The 14 papers presented were carefully reviewed and selected from 51 submissions. The papers are organized according to the following topical sections: ​data security; privacy protection; anomaly detection; traffic analysis; social network security; vulnerability detection; text classification
Recommended from our members
Phishing website detection using intelligent data mining techniques. Design and development of an intelligent association classification mining fuzzy based scheme for phishing website detection with an emphasis on E-banking.
Phishing techniques have not only grown in number, but also in sophistication. Phishers might
have a lot of approaches and tactics to conduct a well-designed phishing attack. The targets of
the phishing attacks, which are mainly on-line banking consumers and payment service
providers, are facing substantial financial loss and lack of trust in Internet-based services. In
order to overcome these, there is an urgent need to find solutions to combat phishing attacks.
Detecting phishing website is a complex task which requires significant expert knowledge and
experience. So far, various solutions have been proposed and developed to address these
problems. Most of these approaches are not able to make a decision dynamically on whether the
site is in fact phished, giving rise to a large number of false positives. This is mainly due to
limitation of the previously proposed approaches, for example depending only on fixed black
and white listing database, missing of human intelligence and experts, poor scalability and their
timeliness.
In this research we investigated and developed the application of an intelligent fuzzy-based
classification system for e-banking phishing website detection. The main aim of the proposed
system is to provide protection to users from phishers deception tricks, giving them the ability
to detect the legitimacy of the websites. The proposed intelligent phishing detection system
employed Fuzzy Logic (FL) model with association classification mining algorithms. The
approach combined the capabilities of fuzzy reasoning in measuring imprecise and dynamic
phishing features, with the capability to classify the phishing fuzzy rules. Different phishing experiments which cover all phishing attacks, motivations and deception
behaviour techniques have been conducted to cover all phishing concerns. A layered fuzzy
structure has been constructed for all gathered and extracted phishing website features and
patterns. These have been divided into 6 criteria and distributed to 3 layers, based on their attack
type. To reduce human knowledge intervention, Different classification and association
algorithms have been implemented to generate fuzzy phishing rules automatically, to be
integrated inside the fuzzy inference engine for the final phishing detection.
Experimental results demonstrated that the ability of the learning approach to identify all
relevant fuzzy rules from the training data set. A comparative study and analysis showed that
the proposed learning approach has a higher degree of predictive and detective capability than
existing models. Experiments also showed significance of some important phishing criteria like
URL & Domain Identity, Security & Encryption to the final phishing detection rate.
Finally, our proposed intelligent phishing website detection system was developed, tested and
validated by incorporating the scheme as a web based plug-ins phishing toolbar. The results
obtained are promising and showed that our intelligent fuzzy based classification detection
system can provide an effective help for real-time phishing website detection. The toolbar
successfully recognized and detected approximately 92% of the phishing websites selected from
our test data set, avoiding many miss-classified websites and false phishing alarms
Advanced document data extraction techniques to improve supply chain performance
In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
JURI SAYS:An Automatic Judgement Prediction System for the European Court of Human Rights
In this paper we present the web platform JURI SAYS that automatically predicts decisions of the European Court of Human Rights based on communicated cases, which are published by the court early in the proceedings and are often available many years before the final decision is made. Our system therefore predicts future judgements of the court. The platform is available at jurisays.com and shows the predictions compared to the actual decisions of the court. It is automatically updated every month by including the prediction for the new cases. Additionally, the system highlights the sentences and paragraphs that are most important for the prediction (i.e. violation vs. no violation of human rights)
- …