257 research outputs found
Knowledge extraction from unstructured data
Data availability is becoming more essential, considering the current growth of web-based data. The data available on the web are represented as unstructured, semi-structured, or structured data. In order to make the web-based data available for several Natural Language Processing or Data Mining tasks, the data needs to be presented as machine-readable data in a structured format. Thus, techniques for addressing the problem of capturing knowledge from unstructured data sources are needed. Knowledge extraction methods are used by the research communities to address this problem; methods that are able to capture knowledge in a natural language text and map the extracted knowledge to existing knowledge presented in knowledge graphs (KGs). These knowledge extraction methods include Named-entity recognition, Named-entity Disambiguation, Relation Recognition, and Relation Linking. This thesis addresses the problem of extracting knowledge over unstructured data and discovering patterns in the extracted knowledge. We devise a rule-based approach for entity and relation recognition and linking. The defined approach effectively maps entities and relations within a text to their resources in a target KG. Additionally, it overcomes the challenges of recognizing and linking entities and relations to a specific KG by employing devised catalogs of linguistic and domain-specific rules that state the criteria to recognize entities in a sentence of a particular language, and a deductive database that encodes knowledge in community-maintained KGs. Moreover, we define a Neuro-symbolic approach for the tasks of knowledge extraction in encyclopedic and domain-specific domains; it combines symbolic and sub-symbolic components to overcome the challenges of entity recognition and linking and the limitation of the availability of training data while maintaining the accuracy of recognizing and linking entities. Additionally, we present a context-aware framework for unveiling semantically related posts in a corpus; it is a knowledge-driven framework that retrieves associated posts effectively. We cast the problem of unveiling semantically related posts in a corpus into the Vertex Coloring Problem. We evaluate the performance of our techniques on several benchmarks related to various domains for knowledge extraction tasks. Furthermore, we apply these methods in real-world scenarios from national and international projects. The outcomes show that our techniques are able to effectively extract knowledge encoded in unstructured data and discover patterns over the extracted knowledge presented as machine-readable data. More importantly, the evaluation results provide evidence to the effectiveness of combining the reasoning capacity of the symbolic frameworks with the power of pattern recognition and classification of sub-symbolic models
Tweet categorization by combining content and structural knowledge
Twitter is a worldwide social media platform where millions of people frequently express ideas and opinions
about any topic. This widespread success makes the analysis of tweets an interesting and possibly
lucrative task, being those tweets rarely objective and becoming the targeting for large-scale analysis. In
this paper, we explore the idea of integrating two fundamental aspects of a tweet, the proper textual
content and its underlying structural information, when addressing the tweet categorization task. Thus,
not only we analyze textual content of tweets but also analyze the structural information provided by the
relationship between tweets and users, and we propose different methods for effectively combining both
kinds of feature models extracted from the different knowledge sources. In order to test our approach, we
address the specific task of determining the political opinion of Twitter users within their political context,
observing that our most refined knowledge integration approach performs remarkably better (about
5 points above) than the textual-based classic modelMinisterio de EconomĂa y Competitividad TIN2012-38536-C03-02Junta de AndalucĂa P11-TIC-7684 M
Resorting to Context-Aware Background Knowledge for Unveiling Semantically Related Social Media Posts
Social media networks have become a prime source for sharing news, opinions, and research accomplishments in various domains, and hundreds of millions of posts are announced daily. Given this wealth of information in social media, finding related announcements has become a relevant task, particularly in trending news (e.g., COVID-19 or lung cancer). To facilitate the search of connected posts, social networks enable users to annotate their posts, e.g., with hashtags in tweets. Albeit effective, an annotation-based search is limited because results will only include the posts that share the same annotations. This paper focuses on retrieving context-related posts based on a specific topic, and presents PINYON, a knowledge-driven framework, that retrieves associated posts effectively. PINYON implements a two-fold pipeline. First, it encodes, in a graph, a CORPUS of posts and an input post; posts are annotated with entities for existing knowledge graphs and connected based on the similarity of their entities. In a decoding phase, the encoded graph is used to discover communities of related posts. We cast this problem into the Vertex Coloring Problem, where communities of similar posts include the posts annotated with entities colored with the same colors. Built on results reported in the graph theory, PINYON implements the decoding phase guided by a heuristic-based method that determines relatedness among posts based on contextual knowledge, and efficiently groups the most similar posts in the same communities. PINYON is empirically evaluated on various datasets and compared with state-of-the-art implementations of the decoding phase. The quality of the generated communities is also analyzed based on multiple metrics. The observed outcomes indicate that PINYON accurately identifies semantically related posts in different contexts. Moreover, the reported results put in perspective the impact of known properties about the optimality of existing heuristics for vertex graph coloring and their implications on PINYON scalability
Real-time Event Detection on Social Data Streams
Social networks are quickly becoming the primary medium for discussing what
is happening around real-world events. The information that is generated on
social platforms like Twitter can produce rich data streams for immediate
insights into ongoing matters and the conversations around them. To tackle the
problem of event detection, we model events as a list of clusters of trending
entities over time. We describe a real-time system for discovering events that
is modular in design and novel in scale and speed: it applies clustering on a
large stream with millions of entities per minute and produces a dynamically
updated set of events. In order to assess clustering methodologies, we build an
evaluation dataset derived from a snapshot of the full Twitter Firehose and
propose novel metrics for measuring clustering quality. Through experiments and
system profiling, we highlight key results from the offline and online
pipelines. Finally, we visualize a high profile event on Twitter to show the
importance of modeling the evolution of events, especially those detected from
social data streams.Comment: Accepted as a full paper at KDD 2019 on April 29, 201
Identifying Expert Investors on Financial Microblog via Artificial Neural Networks
In the recent years, thanks to social media platform, a plethora of information has been available to financial investors, that were traditionally dependent from financial institutions advisors. Strategies are now shared among web users, performances of stocks are commented in web communities and hints and suggestions are travelling on the internet with a fast pace, in a way that was unthinkable few years before. Several attempts have been made in the recent past, to predict Market movements and trends from activity of Financial Social Networks participants, and to evaluate if contributions from individuals with high level of expertise distinguish themselves from the rest of crowd. The Present Work is leveraging 6 years of tweets extracted from the financial platform StockTwits.com, deep diving in its content, and proposing a predictive Neural Network algorithm of Multi-Layer Perceptron type, based on features derived from text, social network and sentiment analysis. Users have been classified based on the performance achieved during the training, consistence of their prediction has been verified throughout the time and, finally, a trading strategy has been proposed based on following the top actors. The outcomes highlighted that expert investors are outperforming the wisdom of the crowd, and the trading schema put together generated a return of 38.6%, in 2015, when S&P500 had a slightly negative balance
- …