Investments made on the stock market depend on timely and credible information being made available to investors. Such information can be sourced from online news articles, broker agencies, and discussion platforms such as financial discussion boards and Twitter. The monitoring of such discussion is a challenging yet necessary task to support the transparency of the financial market. Although financial discussion boards are typically monitored by administrators who respond to other users reporting posts for misconduct, actively monitoring social media such as Twitter remains a difficult task.
Users sharing news about stock-listed companies on Twitter can embed cashtags in their tweets that mimic a company’s stock ticker symbol (e.g. TSCO on the London Stock Exchange refers to Tesco PLC). A cashtag is simply the ticker characters prefixed with a ’$’ symbol, which then becomes a clickable hyperlink – similar to a hashtag. Twitter, however, does not distinguish between companies with identical ticker symbols that belong to different exchanges. TSCO, for example, refers to Tesco PLC on the London Stock Exchange but also refers to the Tractor Supply Company listed on the NASDAQ. This research has referred to such scenarios as a ’cashtag collision’. Investors who wish to capitalise on the fast dissemination that Twitter provides
may become susceptible to tweets containing colliding cashtags. Further exacerbating
this issue is the presence of tweets referring to cryptocurrencies, which also
feature cashtags that could be identical to the cashtags used for stock-listed companies.
A system that is capable of identifying stock-specific tweets by resolving such
collisions, and assessing the credibility of such messages, would be of great benefit to
a financial market monitoring system by filtering out non-significant messages. This
project has involved the design and development of a novel, multi-layered, smart
data ecosystem to monitor potential irregularities within the financial market. This
ecosystem is primarily concerned with the behaviour of participants’ communicative
practices on discussion platforms and the activity surrounding company events
(e.g. a broker rating being issued for a company). A wide array of data sources –
such as tweets, discussion board posts, broker ratings, and share prices – is collected
to support this process. A novel data fusion model fuses together these data sources
to provide synchronicity to the data and allow easier analysis of the data to be undertaken
by combining data sources for a given time window (based on the company
the data refers to and the date and time). This data fusion model, located within the
data layer of the ecosystem, utilises supervised machine learning classifiers - due to
the domain expertise needed to accurately describe the origin of a tweet in a binary
way - that are trained on a novel set of features to classify tweets as being related to a
London Stock Exchange-listed company or not. Experiments involving the training
of such classifiers have achieved accuracy scores of up to 94.9%.
The ecosystem also adopts supervised learning to classify tweets concerning
their credibility. Credibility classifiers are trained on both general features found in
all tweets, and a novel set of features only found within financial stock tweets. The
experiments in which these credibility classifiers were trained have yielded AUC
scores of up to 94.3.
Once the data has been fused, and irrelevant tweets have been identified, unsupervised
clustering algorithms are then used within the detection layer of the
ecosystem to cluster tweets and posts for a specific time window or event as potentially
irregular. The results are then presented to the user within the presentation
and decision layer, where the user may wish to perform further analysis or additional
clustering