4 research outputs found

    Internetgestützte Textanalyse zur Extraktion von Produktentwicklungswissen anhand von semi-strukturierten Dokumenten

    Get PDF
    Mit der Popularisierung und Entwicklung des Internets in den letzten Jahrzehnten tauchen immer mehr elektronische Dokumenten im Internet auf. Zahlreiche Produktspezifikationen sind über das Internet z.B. in Form von Web-Seiten oder PDFs zugänglich. Diese Arbeit hilft den Unternehmen, die Produkte und das Produktentwicklungswissen aus den Webseiten automatisch zu extrahieren. In dieser Arbeit werden die Definition der Product Named Entity, die Konstruktion der Corpus, die Identifizierung von Product Name Entity und schließlich die Extraktion von Produktnamen und Produktentwicklungswissen erforscht. Die Arbeit betrifft die folgenden Aspekte: 1. Nach der Untersuchung von Produktenamen in Web-Seiten definieren wir die verschiedenen Komponenten von Produktnamen. Mit der Definition entwickelten wir eine Rechtlinie für die Markierung des Korpus. Danach erstellen wir einen Product Named Entity Korpus durch die Nutzung der halb-betreuten Lernmethode. 2. Nach den Merkmalen des Produktnames unterteilen wir die Indentifizierung des Produktnames auf zwei Phasen. Die erste Phase erkennt den Brandname, den Serienname und den Typenname eines Produkts. Basierend auf den ersten Ergebnissen wird der Produktname in der zweiten Phase erkannt werden. Für die Erkennung von diesen zwei Phasen können wir verschiedene Methoden verwenden. In der Arbeit werden das Hidden Markov Modell, Maximum Entropy Modell und das Conditional Random Field Modell diskutiert. Nach dem Vergleich der drei Metholden nutzen wir das Conditional Random Field Modell. 3. Nachdem die Produktnamen erfolgreich erkannt werden, werden die Produktnamen, die Produktmerkmale und die Restriktionen zwischen Produkten extrahiert.With the popularization and development of internet in the past few decades, more and more electronic documents appear on the Internet. Numerous product specifications are available via Internet, eg available in the form of web pages or PDFs. This dissertation helps the company to automatically extract the products, product sepecifications and product restriction from the web site. In this paper, We research on the definition of product named entity, the construction of the corpus, and the recognition technologies. This work concerns the following aspects: 1. After studying many of product names in web pages, we define the various compositions of product name entity. With this definition, we developed a rule for the corpus annotation. Then we create a product named entity corpus by using the semi-supervised method. 2. According to the features of the product names we divided the recognition of product names into two phases. The first phase detects the brand name, the series name and the type of a product. Based on the first results the product name will be recognised in the second phase. For the recognition in these two phases, many methods can be used. In this work we discuss hidden Markov model, maximum entropy model and Conditional Random Field model. After comparing these three models we decide to use conditional Random Field Model to do the recognition. 3. After the product names are successfully detected, the products, the product features and the restrictions between products will be extracted

    Query understanding: applying machine learning algorithms for named entity recognition

    Get PDF
    The term-frequency inverse-document(tf-idf) paradigm which is often used in general search engines for ranking the relevance of documents in a corpus to a given user query, is based on the frequency of occurrence of the search key terms in the corpus. These search terms are mostly expressed in natural language thus requiring natural language processing methods. But for domain-speciffic search engines like a software download portal, search terms are usually expressed in forms that does not conform to grammatical rules present in natural language and as such, they cannot be tackled using natural language processing techniques. This thesis proposes named entity recognition using supervised machine learning methods as a means to understanding queries for such domain-speciffic search engines. Particularly, our main objective is to apply machine learning techniques to automatically learn to recognize and classify search terms according to named entity class of predefined categories they belong. By so doing, we are able to understand user intents and rank result sets according to their relevance to detected named entities present in search query. Our approach involved three machine learning algorithms; Hidden Markov Models (HMM), Conditional Random Field(CRF) and Neural Network(NN). We followed the supervised learning approach in training these algorithms using labeled training data from sample queries, we then evaluated their performance on new unseen queries. Our empirical results showed precisions of 93% for NN which was based on distributed representations proposed by Yoshua Bengio, 85.60% for CRF and 82.84% for HMM. CRF 's precision improved to about 2% , achieving 87.40% after we generated gazetteer-based and morphological features. From our results, we were able to prove that machine learning methods for named entity recognition is useful for understanding query intents

    Named Entity Recognition in Chinese Clinical Text

    Get PDF
    Objective: Named entity recognition (NER) is one of the fundamental tasks in natural language processing (NLP). In the medical domain, there have been a number of studies on NER in English clinical notes; however, very limited NER research has been done on clinical notes written in Chinese. The goal of this study is to develop corpora, methods, and systems for NER in Chinese clinical text. Materials and methods: To study entities in Chinese clinical text, we started with building annotated clinical corpora in Chinese. We developed an NER annotation guideline in Chinese by extending the one used in the 2010 i2b2 NLP challenge. We randomly selected 400 admission notes and 400 discharge summaries from Peking Union Medical College Hospital (PUMCH) in China. For each note, four types of entities including clinical problems, procedures, labs, and medications were annotated according to the developed guideline. In addition, an annotation tool was developed to assist two MD students to annotate Chinese clinical documents. A comparison of entity distribution between Chinese and English clinical notes (646 English and 400 Chinese discharge summaries) was performed using the annotated corpora, to identify the important features for NER. In the NER study, two-thirds of the 400 notes were used for training the NER systems and one-third were used for testing. We investigated the effects of different types of features including bag-of-characters, word segmentation, part-of-speech, and section information, with different machine learning (ML) algorithms including Conditional Random Fields (CRF), Support Vector Machines (SVM), Maximum Entropy (ME), and Structural Support Vector Machines (SSVM) on the Chinese clinical NER task. All classifiers were trained on the training dataset, evaluated on the test set, and microaveraged precision, recall, and F-measure were reported. Results: Our evaluation on the independent test set showed that most types of features were beneficial to Chinese NER systems, although the improvements were limited. By combining word segmentation and section information, the system achieved the highest performance, indicating that these two types of features are complementary to each other. When the same types of optimized features were used, CRF and SSVM outperformed SVM and ME. More specifically, SSVM reached the highest performance among the four algorithms, with F-measures of 93.51% and 90.01% for admission notes and discharge summaries respectively. Conclusions: In this study, we created large annotated datasets of Chinese admission notes and discharge summaries and then systematically evaluated different types of features (e.g., syntactic, semantic, and segmentation information) and four ML algorithms including CRF, SVM, SSVM, and ME for clinical NER in Chinese. To the best of our knowledge, this is one of the earliest comprehensive effort in Chinese clinical NER research and we believe it will provide valuable insights to NLP research in Chinese clinical text. Our results suggest that both word segmentation and section information improves NER in Chinese clinical text, and SSVM, a recent sequential labelling algorithm, outperformed CRF and other classification algorithms. Our best system achieved F-measures of 90.01% and 93.52% on Chinese discharge summaries and admission notes, respectively, indicating a promising start on Chinese NLP research

    Product named entity recognition in Chinese text

    No full text
    corecore