Search CORE

1 research outputs found

Internetgestützte Textanalyse zur Extraktion von Produktentwicklungswissen anhand von semi-strukturierten Dokumenten

Author: Zou Fan
Publication venue
Publication date: 01/01/2013
Field of study

Mit der Popularisierung und Entwicklung des Internets in den letzten Jahrzehnten tauchen immer mehr elektronische Dokumenten im Internet auf. Zahlreiche Produktspezifikationen sind über das Internet z.B. in Form von Web-Seiten oder PDFs zugänglich. Diese Arbeit hilft den Unternehmen, die Produkte und das Produktentwicklungswissen aus den Webseiten automatisch zu extrahieren. In dieser Arbeit werden die Definition der Product Named Entity, die Konstruktion der Corpus, die Identifizierung von Product Name Entity und schließlich die Extraktion von Produktnamen und Produktentwicklungswissen erforscht. Die Arbeit betrifft die folgenden Aspekte: 1. Nach der Untersuchung von Produktenamen in Web-Seiten definieren wir die verschiedenen Komponenten von Produktnamen. Mit der Definition entwickelten wir eine Rechtlinie für die Markierung des Korpus. Danach erstellen wir einen Product Named Entity Korpus durch die Nutzung der halb-betreuten Lernmethode. 2. Nach den Merkmalen des Produktnames unterteilen wir die Indentifizierung des Produktnames auf zwei Phasen. Die erste Phase erkennt den Brandname, den Serienname und den Typenname eines Produkts. Basierend auf den ersten Ergebnissen wird der Produktname in der zweiten Phase erkannt werden. Für die Erkennung von diesen zwei Phasen können wir verschiedene Methoden verwenden. In der Arbeit werden das Hidden Markov Modell, Maximum Entropy Modell und das Conditional Random Field Modell diskutiert. Nach dem Vergleich der drei Metholden nutzen wir das Conditional Random Field Modell. 3. Nachdem die Produktnamen erfolgreich erkannt werden, werden die Produktnamen, die Produktmerkmale und die Restriktionen zwischen Produkten extrahiert.With the popularization and development of internet in the past few decades, more and more electronic documents appear on the Internet. Numerous product specifications are available via Internet, eg available in the form of web pages or PDFs. This dissertation helps the company to automatically extract the products, product sepecifications and product restriction from the web site. In this paper, We research on the definition of product named entity, the construction of the corpus, and the recognition technologies. This work concerns the following aspects: 1. After studying many of product names in web pages, we define the various compositions of product name entity. With this definition, we developed a rule for the corpus annotation. Then we create a product named entity corpus by using the semi-supervised method. 2. According to the features of the product names we divided the recognition of product names into two phases. The first phase detects the brand name, the series name and the type of a product. Based on the first results the product name will be recognised in the second phase. For the recognition in these two phases, many methods can be used. In this work we discuss hidden Markov model, maximum entropy model and Conditional Random Field model. After comparing these three models we decide to use conditional Random Field Model to do the recognition. 3. After the product names are successfully detected, the products, the product features and the restrictions between products will be extracted