589 research outputs found

    A generic open world named entity disambiguation approach for tweets

    Get PDF
    Social media is a rich source of information. To make use of this information, it is sometimes required to extract and disambiguate named entities. In this paper we focus on named entity disambiguation (NED) in twitter messages. NED in tweets is challenging in two ways. First, the limited length of Tweet makes it hard to have enough context while many disambiguation techniques depend on it. The second is that many named entities in tweets do not exist in a knowledge base (KB). In this paper we share ideas from information retrieval (IR) and NED to propose solutions for both challenges. For the first problem we make use of the gregarious nature of tweets to get enough context needed for disambiguation. For the second problem we look for an alternative home page if there is no Wikipedia page represents the entity. Given a mention, we obtain a list of Wikipedia candidates from YAGO KB in addition to top ranked pages from Google search engine. We use Support Vector Machine (SVM) to rank the candidate pages to find the best representative entities. Experiments conducted on two data sets show better disambiguation results compared with the baselines and a competitor

    Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data

    Get PDF
    In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data

    LINKING ENTITIES TO A KNOWLEDGE BASE

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Adaptive Semantic Annotation of Entity and Concept Mentions in Text

    Get PDF
    The recent years have seen an increase in interest for knowledge repositories that are useful across applications, in contrast to the creation of ad hoc or application-specific databases. These knowledge repositories figure as a central provider of unambiguous identifiers and semantic relationships between entities. As such, these shared entity descriptions serve as a common vocabulary to exchange and organize information in different formats and for different purposes. Therefore, there has been remarkable interest in systems that are able to automatically tag textual documents with identifiers from shared knowledge repositories so that the content in those documents is described in a vocabulary that is unambiguously understood across applications. Tagging textual documents according to these knowledge bases is a challenging task. It involves recognizing the entities and concepts that have been mentioned in a particular passage and attempting to resolve eventual ambiguity of language in order to choose one of many possible meanings for a phrase. There has been substantial work on recognizing and disambiguating entities for specialized applications, or constrained to limited entity types and particular types of text. In the context of shared knowledge bases, since each application has potentially very different needs, systems must have unprecedented breadth and flexibility to ensure their usefulness across applications. Documents may exhibit different language and discourse characteristics, discuss very diverse topics, or require the focus on parts of the knowledge repository that are inherently harder to disambiguate. In practice, for developers looking for a system to support their use case, is often unclear if an existing solution is applicable, leading those developers to trial-and-error and ad hoc usage of multiple systems in an attempt to achieve their objective. In this dissertation, I propose a conceptual model that unifies related techniques in this space under a common multi-dimensional framework that enables the elucidation of strengths and limitations of each technique, supporting developers in their search for a suitable tool for their needs. Moreover, the model serves as the basis for the development of flexible systems that have the ability of supporting document tagging for different use cases. I describe such an implementation, DBpedia Spotlight, along with extensions that we performed to the knowledge base DBpedia to support this implementation. I report evaluations of this tool on several well known data sets, and demonstrate applications to diverse use cases for further validation

    A Smart Data Ecosystem for the Monitoring of Financial Market Irregularities

    Get PDF
    Investments made on the stock market depend on timely and credible information being made available to investors. Such information can be sourced from online news articles, broker agencies, and discussion platforms such as financial discussion boards and Twitter. The monitoring of such discussion is a challenging yet necessary task to support the transparency of the financial market. Although financial discussion boards are typically monitored by administrators who respond to other users reporting posts for misconduct, actively monitoring social media such as Twitter remains a difficult task. Users sharing news about stock-listed companies on Twitter can embed cashtags in their tweets that mimic a company’s stock ticker symbol (e.g. TSCO on the London Stock Exchange refers to Tesco PLC). A cashtag is simply the ticker characters prefixed with a ’$’ symbol, which then becomes a clickable hyperlink – similar to a hashtag. Twitter, however, does not distinguish between companies with identical ticker symbols that belong to different exchanges. TSCO, for example, refers to Tesco PLC on the London Stock Exchange but also refers to the Tractor Supply Company listed on the NASDAQ. This research has referred to such scenarios as a ’cashtag collision’. Investors who wish to capitalise on the fast dissemination that Twitter provides may become susceptible to tweets containing colliding cashtags. Further exacerbating this issue is the presence of tweets referring to cryptocurrencies, which also feature cashtags that could be identical to the cashtags used for stock-listed companies. A system that is capable of identifying stock-specific tweets by resolving such collisions, and assessing the credibility of such messages, would be of great benefit to a financial market monitoring system by filtering out non-significant messages. This project has involved the design and development of a novel, multi-layered, smart data ecosystem to monitor potential irregularities within the financial market. This ecosystem is primarily concerned with the behaviour of participants’ communicative practices on discussion platforms and the activity surrounding company events (e.g. a broker rating being issued for a company). A wide array of data sources – such as tweets, discussion board posts, broker ratings, and share prices – is collected to support this process. A novel data fusion model fuses together these data sources to provide synchronicity to the data and allow easier analysis of the data to be undertaken by combining data sources for a given time window (based on the company the data refers to and the date and time). This data fusion model, located within the data layer of the ecosystem, utilises supervised machine learning classifiers - due to the domain expertise needed to accurately describe the origin of a tweet in a binary way - that are trained on a novel set of features to classify tweets as being related to a London Stock Exchange-listed company or not. Experiments involving the training of such classifiers have achieved accuracy scores of up to 94.9%. The ecosystem also adopts supervised learning to classify tweets concerning their credibility. Credibility classifiers are trained on both general features found in all tweets, and a novel set of features only found within financial stock tweets. The experiments in which these credibility classifiers were trained have yielded AUC scores of up to 94.3. Once the data has been fused, and irrelevant tweets have been identified, unsupervised clustering algorithms are then used within the detection layer of the ecosystem to cluster tweets and posts for a specific time window or event as potentially irregular. The results are then presented to the user within the presentation and decision layer, where the user may wish to perform further analysis or additional clustering

    Data Science for Entrepreneurship Research:Studying Demand Dynamics for Entrepreneurial Skills in the Netherlands

    Get PDF
    The recent rise of big data and artificial intelligence (AI) is changing markets, politics, organizations, and societies. It also affects the domain of research. Supported by new statistical methods that rely on computational power and computer science --- data science methods --- we are now able to analyze data sets that can be huge, multidimensional, unstructured, and are diversely sourced. In this paper, we describe the most prominent data science methods suitable for entrepreneurship research and provide links to literature and Internet resources for self-starters. We survey how data science methods have been applied in the entrepreneurship research literature. As a showcase of data science techniques, based on a dataset of 95% of all job vacancies in the Netherlands over a 6-year period with 7.7 million data points, we provide an original analysis of the demand dynamics for entrepreneurial skills in the Netherlands. We show which entrepreneurial skills are particularly important for which type of profession. Moreover, we find that demand for both entrepreneurial and digital skills has increased for managerial positions, but not for others. We also find that entrepreneurial skills were significantly more demanded than digital skills over the entire period 2012-2017 and that the absolute importance of entrepreneurial skills has even increased more than digital skills for managers, despite the impact of datafication on the labor market. We conclude that further studies of entrepreneurial skills in the general population --- outside the domain of entrepreneurs --- is a rewarding subject for future research
    corecore