8 research outputs found

    Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents

    Get PDF
    The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when documents exhibit unique properties that behoove specialized and deeper semantic extraction. Recently, AlgorithmSeer, a search engine for algorithms has been proposed, that extracts pseudo-codes and shallow textual metadata from scientific publications and treats them as traditional documents so that the conventional search engine methodology could be applied. However, such a system fails to facilitate user search queries that seek to identify algorithm-specific information, such as the datasets on which algorithms operate, the performance of algorithms, and runtime complexity, etc. In this paper, a set of enhancements to the previously proposed algorithm search engine are presented. Specifically, we propose a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using a set of machine-learning techniques. In an experiment with over 93,000 text lines, we introduce 60 novel features, comprising content-based, font style based and structure-based feature groups, to extract algorithmic pseudo-codes. Our proposed pseudo-code extraction method achieves 93.32% F1-score, outperforming the state-of-the-art techniques by 28%. Additionally, we propose a method to extract algorithmic-related sentences using deep neural networks and achieve an accuracy of 78.5%, outperforming a Rule-based model and a support vector machine model by 28% and 16%, respectively

    Inspecting and Directing Neural Language Models

    No full text

    Definition Modeling: Learning to Define Word Embeddings in Natural Language

    No full text
    Distributed representations of words have been shown to capture lexical semantics, based on their effectiveness in word similarity and analogical relation tasks. But, these tasks only evaluate lexical semantics indirectly. In this paper, we study whether it is possible to utilize distributed representations to generate dictionary definitions of words, as a more direct and transparent representation of the embeddings' semantics. We introduce definition modeling, the task of generating a definition for a given word and its embedding. We present different definition model architectures based on recurrent neural networks, and experiment with the models over multiple data sets. Our results show that a model that controls dependencies between the word being defined and the definition words performs significantly better, and that a character-level convolution layer that leverages morphology can complement word-level embeddings. Our analysis reveals which components of our models contribute to accuracy. Finally, the errors made by a definition model may provide insight into the shortcomings of word embeddings

    DeepMetaForge: A Deep Vision-Transformer Metadata-Fusion Network for Automatic Skin Lesion Classification

    No full text
    Skin cancer is a dangerous form of cancer that develops slowly in skin cells. Delays in diagnosing and treating these malignant skin conditions may have serious repercussions. Likewise, early skin cancer detection has been shown to improve treatment outcomes. This paper proposes DeepMetaForge, a deep-learning framework for skin cancer detection from metadata-accompanied images. The proposed framework utilizes BEiT, a vision transformer pre-trained as a masked image modeling task, as the image-encoding backbone. We further propose merging the encoded metadata with the derived visual characteristics while compressing the aggregate information simultaneously, simulating how photos with metadata are interpreted. The experiment results on four public datasets of dermoscopic and smartphone skin lesion images reveal that the best configuration of our proposed framework yields 87.1% macro-average F1 on average. The empirical scalability analysis further shows that the proposed framework can be implemented in a variety of machine-learning paradigms, including applications on low-resource devices and as services. The findings shed light on not only the possibility of implementing telemedicine solutions for skin cancer on a nationwide scale that could benefit those in need of quality healthcare but also open doors to many intelligent applications in medicine where images and metadata are collected together, such as disease detection from CT-scan images and patients’ expression recognition from facial images

    DAViS: a unified solution for data collection, analyzation, and visualization in real-time stock market prediction

    No full text
    The explosion of online information with the recent advent of digital technology in information processing, information storing, information sharing, natural language processing, and text mining techniques has enabled stock investors to uncover market movement and volatility from heterogeneous content. For example, a typical stock market investor reads the news, explores market sentiment, and analyzes technical details in order to make a sound decision prior to purchasing or selling a particular company???s stock. However, capturing a dynamic stock market trend is challenging owing to high fluctuation and the non-stationary nature of the stock market. Although existing studies have attempted to enhance stock prediction, few have provided a complete decision-support system for investors to retrieve real-time data from multiple sources and extract insightful information for sound decision-making. To address the above challenge, we propose a unified solution for data collection, analysis, and visualization in real-time stock market prediction to retrieve and process relevant financial data from news articles, social media, and company technical information. We aim to provide not only useful information for stock investors but also meaningful visualization that enables investors to effectively interpret storyline events affecting stock prices. Specifically, we utilize an ensemble stacking of diversified machine-learning-based estimators and innovative contextual feature engineering to predict the next day???s stock prices. Experiment results show that our proposed stock forecasting method outperforms a traditional baseline with an average mean absolute percentage error of 0.93. Our findings confirm that leveraging an ensemble scheme of machine learning methods with contextual information improves stock prediction performance. Finally, our study could be further extended to a wide variety of innovative financial applications that seek to incorporate external insight from contextual information such as large-scale online news articles and social media data

    CAMELON: A System for Crime Metadata Extraction and Spatiotemporal Visualization From Online News Articles

    No full text
    Crimes result in not only loss to individuals but also hinder national economic growth. While crime rates have been reported to decrease in developed countries, underdeveloped and developing nations still suffer from prevalent crimes, especially those undergoing rapid expansion of urbanization. The ability to monitor and assess trends of different types of crimes at both regional and national levels could assist local police and national-level policymakers in proactively devising means to prevent and address the root causes of criminal incidents. Furthermore, such a system could prove useful to individuals seeking to evaluate criminal activity for purposes of travel, investment, and relocation decisions. Recent literature has opted to utilize online news articles as a reliable and timely source for information on crime activity. However, most of the crime monitoring systems fueled by such news sources merely classified crimes into different types and visualized individual crimes on the map using extracted geolocations, lacking crucial information for stakeholders to make relevant, informed decisions. To better serve the unique needs of the target user groups, this paper proposes a novel comprehensive crime visualization system that mines relevant information from large-scale online news articles. The system features automatic crime-type classification and metadata extraction from news articles. The crime classification and metadata schemes are designed to serve the need for information from law enforcement and policymakers, as well as general users. Novel interactive spatiotemporal designs are integrated into the system with the ability to assess the severity and intensity of crimes in each region through the novel Criminometer index. The system is designed to be generalized for implementation in different countries with diverse prevalent crime types and languages composing the news articles, owing to the use of deep learning cross-lingual language models. The experiment results reveal that the proposed system yielded 86%, 51%, and 67% F1 in crime type classification, metadata extraction, and closed-form metadata extraction tasks, respectively. Additionally, the results of the system usability tests indicated a notable level of contentment among the target user groups. The findings not only offer insights into the possible applications of interactive spatiotemporal crime visualization tools for proactive policymaking and predictive policing but also serve as a foundation for future research that utilizes online news articles for intelligent monitoring of real-world phenomena
    corecore