7 research outputs found

    DERIVING TECHNOLOGY ROADMAPS WITH TECH MINING TECHNIQUES

    Get PDF
    Technology monitoring has been a knowledge intensive and time-consuming task for IT managers or domain experts. Tech mining techniques can be used to mitigate these efforts. This paper proposes a technology monitoring framework based on tech mining techniques to facilitate the derivative of information and communication technology (ICT) roadmaps. With this framework, a tech mining engine is able to allocate the most relevant documents which describe a category of technologies. Domain experts were participated in a scan meeting to verify the generated roadmaps based on the selected cluster of documents. The draft roadmaps can be further articulated with domain experts\u27 judgment for technology forecasting and assessment

    A Frame Work for Text Mining using Learned Information Extraction System

    Get PDF
    Text mining is a very exciting research area as it tries to discover knowledge from unstructured texts These texts can be found on a computer desktop intranets and the internet The aim of this paper is to give an overview of text mining in the contexts of its techniques application domains and the most challenging issue The Learned Information Extraction LIE is about locating specific items in natural-language documents This paper presents a framework for text mining called DTEX Discovery Text Extraction using a learned information extraction system to transform text into more structured data which is then mined for interesting relationships The initial version of DTEX integrates an LIE module acquired by an LIE learning system and a standard rule induction module In addition rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents thereby improving the recall of the underlying extraction system Applying these techniques best results are presented to a corpus of computer job announcement postings from an Internet newsgrou

    Proposta de utilização de mineração de textos para seleção, classificação e qualificação de documentos.

    Get PDF
    Introduçaõ. Revisão de literatura. proposta de aplicação da Agência. Hipóteses. Resultados esperados. Trabalhos futuros.bitstream/CNPTIA/10656/1/doc47.pdfAcesso em: 29 maio 2008

    Mining Knowledge from Text Collections Using Automatically Generated Metadata

    No full text
    Data mining is typically applied to large databases of highly structured information in order to discover new knowledge. In businesses and institutions, the amount of information existing in repositories of text documents usually rivals or surpasses the amount found in relational databases. Though the amount of potentially valuable knowledge contained in document collections can be great, they are often difficult to analyze. Therefore, it is important to develop methods to efficiently discover knowledge embedded in these document repositories. In this paper we describe an approach for mining knowledge from text collections by applying data mining techniques to metadata records generated via automated text categorization. By controlling the set of metadata fields as well as the set of assigned categories we can customize the knowledge discovery task to address specific questions. As an example, we apply the approach to a large collection of product reviews and evaluate the performance of the knowledge discovery

    Internetgestützte Textanalyse zur Extraktion von Produktentwicklungswissen anhand von semi-strukturierten Dokumenten

    Get PDF
    Mit der Popularisierung und Entwicklung des Internets in den letzten Jahrzehnten tauchen immer mehr elektronische Dokumenten im Internet auf. Zahlreiche Produktspezifikationen sind über das Internet z.B. in Form von Web-Seiten oder PDFs zugänglich. Diese Arbeit hilft den Unternehmen, die Produkte und das Produktentwicklungswissen aus den Webseiten automatisch zu extrahieren. In dieser Arbeit werden die Definition der Product Named Entity, die Konstruktion der Corpus, die Identifizierung von Product Name Entity und schließlich die Extraktion von Produktnamen und Produktentwicklungswissen erforscht. Die Arbeit betrifft die folgenden Aspekte: 1. Nach der Untersuchung von Produktenamen in Web-Seiten definieren wir die verschiedenen Komponenten von Produktnamen. Mit der Definition entwickelten wir eine Rechtlinie für die Markierung des Korpus. Danach erstellen wir einen Product Named Entity Korpus durch die Nutzung der halb-betreuten Lernmethode. 2. Nach den Merkmalen des Produktnames unterteilen wir die Indentifizierung des Produktnames auf zwei Phasen. Die erste Phase erkennt den Brandname, den Serienname und den Typenname eines Produkts. Basierend auf den ersten Ergebnissen wird der Produktname in der zweiten Phase erkannt werden. Für die Erkennung von diesen zwei Phasen können wir verschiedene Methoden verwenden. In der Arbeit werden das Hidden Markov Modell, Maximum Entropy Modell und das Conditional Random Field Modell diskutiert. Nach dem Vergleich der drei Metholden nutzen wir das Conditional Random Field Modell. 3. Nachdem die Produktnamen erfolgreich erkannt werden, werden die Produktnamen, die Produktmerkmale und die Restriktionen zwischen Produkten extrahiert.With the popularization and development of internet in the past few decades, more and more electronic documents appear on the Internet. Numerous product specifications are available via Internet, eg available in the form of web pages or PDFs. This dissertation helps the company to automatically extract the products, product sepecifications and product restriction from the web site. In this paper, We research on the definition of product named entity, the construction of the corpus, and the recognition technologies. This work concerns the following aspects: 1. After studying many of product names in web pages, we define the various compositions of product name entity. With this definition, we developed a rule for the corpus annotation. Then we create a product named entity corpus by using the semi-supervised method. 2. According to the features of the product names we divided the recognition of product names into two phases. The first phase detects the brand name, the series name and the type of a product. Based on the first results the product name will be recognised in the second phase. For the recognition in these two phases, many methods can be used. In this work we discuss hidden Markov model, maximum entropy model and Conditional Random Field model. After comparing these three models we decide to use conditional Random Field Model to do the recognition. 3. After the product names are successfully detected, the products, the product features and the restrictions between products will be extracted

    Text mining with exploitation of user\u27s background knowledge : discovering novel association rules from text

    Get PDF
    The goal of text mining is to find interesting and non-trivial patterns or knowledge from unstructured documents. Both objective and subjective measures have been proposed in the literature to evaluate the interestingness of discovered patterns. However, objective measures alone are insufficient because such measures do not consider knowledge and interests of the users. Subjective measures require explicit input of user expectations which is difficult or even impossible to obtain in text mining environments. This study proposes a user-oriented text-mining framework and applies it to the problem of discovering novel association rules from documents. The developed system, uMining, consists of two major components: a background knowledge developer and a novel association rules miner. The background knowledge developer learns a user\u27s background knowledge by extracting keywords from documents already known to the user (background documents) and developing a concept hierarchy to organize popular keywords. The novel association rule miner discovers association rules among noun phrases extracted from relevant documents (target documents) and compares the rules with the background knowledge to predict the rule novelty to the particular user (useroriented novelty). The user-oriented novelty measure is defined as the semantic distance between the antecedent and the consequent of a rule in the background knowledge. It consists of two components: occurrence distance and connection distance. The former considers the co-occurrences of two keywords in the background documents: the more the shorter the distance. The latter considers the common connections of with others in the concept hierarchy. It is defined as the length of the connecting the two keywords in the concept hierarchy: the longer the path, distance. The user-oriented novelty measure is evaluated from two perspectives: novelty prediction accuracy and usefulness indication power. The results show that the useroriented novelty measure outperforms the WordNet novelty measure and the compared objective measures in term of predicting novel rules and identifying useful rules

    Neural Sequence Labeling on Social Media Text

    Get PDF
    As social media (SM) brings opportunities to study societies across the world, it also brings a variety of challenges to automate the processing of SM language. In particular, most of the textual content in SM is considered noisy; it does not always stick to the rules of the written language, and it tends to have misspellings, arbitrary abbreviations, orthographic inconsistencies, and flexible grammar. Additionally, SM platforms provide a unique space for multilingual content. This polyglot environment requires modern systems to adapt to a diverse range of languages, imposing another linguistic barrier to processing and understanding of text from SM domains. This thesis aims at providing novel sequence labeling approaches to handle noise and linguistic code-switching (i.e., the alternation of languages in the same utterance) in SM text. In particular, the first part of this thesis focuses on named entity recognition for English SM text, where I propose linguistically-inspired methods to address phonological writing and flexible syntax. Besides, I investigate whether the performance of current state-of-the-art models relies on memorization or contextual generalization of entities. In the second part of this thesis, I focus on three sequence labeling tasks for code-switched SM text: language identification, part-of-speech tagging, and named entity recognition. Specifically, I propose transfer learning methods from state-of-the-art monolingual and multilingual models, such as ELMo and BERT, to the code-switching setting for sequence labeling. These methods reduce the demand for code-switching annotations and resources while exploiting multilingual knowledge from large pre-trained unsupervised models. The methods presented in this thesis are meant to benefit higher-level NLP applications oriented to social media domains, including but not limited to question-answering, conversational systems, and information extraction
    corecore