    A Context Free Gramma for Key Noun-Phrase Extraction from Text

    Topic extraction is a major field in text mining. Key noun-phrases play a very important role in identifying the important document topic because the primary information of a document is described in nounphrases. In this paper, we propose a new topic extraction schema to identify the key noun-phrases by constructing a context free grammar (CFG) from input documents. In our new method, documents are reconstructed as a set of CFG rules using an existing algorithm called Sequitur. The Sequitur algorithm infers the resulting context-free grammatical rules, which can be considered as a hierarchical structure, from a sequence of discrete symbols. The resulting hierarchical structure exposes the underlying structure of input sequence that can help us capture meaningful regularity. Based on this hierarchical structure of the input document, we designed a new algorithm to identify noun-phrases and extract key noun-phrases

    Web page content adjustment for search engines using machine learning and natural language processing

    Optimizacija mrežnih stranica za tražilice (engl. Search engine optimization, SEO) podrazumijeva tehnike pomoću kojih autor mrežnih stranica provodi nad svojim stranicama kako bi one što bolje rangirale u organskim (prirodnim) rezultatima pretraživanja na internetskim tražilicama za odabrane ključne riječi. Taj proces između ostalog uključuje i optimizaciju sadržaja, odnosno prilagodbu sadržaja mrežnih stranica prema preporukama za optimizaciju mrežnih stranica za tražilice (u daljem tekstu SEO preporukama). Ovim istraživanjem ispituje se mogućnost upotrebe strojnog učenja za klasifikaciju mrežnih stranica u tri predefinirane klase s obzirom na stupanj prilagodbe sadržaja SEO preporukama. Pomoću strojnoga učenja izgrađeni su klasifikatori koji su naučili svrstati nepoznati uzorak (mrežnu stranicu) u predefinirane klase, te utvrditi značajne faktore (varijable) koje utječu na stupanj prilagodbe. Također izgrađen je sustav ispravka „neprilagođenih“ stranica upotrebom tehnika iz domene obrade prirodnog jezika. Rezultati su pokazali da se pomoću strojnog učenja može ocijeniti stupanj prilagođenosti stranice SEO preporukama, da se strojno učenje može koristiti za utvrđivanje značajnih faktora, te da se izgrađeni sustav prilagodbe može koristiti za ispravak tj. poboljšanje mrežnih stranica koje su u prethodnim fazama klasificirane kao "neprilagođene".Search engine optimization (SEO) involves techniques by which the author of the website customizes the website so that it ranks higher in organic (natural) search results on popular Internet search engines for selected keywords. This process includes, among others, the optimization of content (text) to fit SEO recommendations. This study examines the possibility of using machine learning tecniques to classify web pages into three predefined classes related to the degree of content adjustment to the SEO recommendations. Using machine learning algorithms, classifiers are built and trained to classify an unknown sample (web page) in the predefined classes and to identify important factors that affect the degree of adjustment. In addition, using algorithms from the domain of natural language processing a system for correction is built and tested. Results show that machine learning can be used to predict the degree of adjustments of web pages to SEO recommendations, for identifying important SEO factors and that the proposed correction system can be used to correct pages which were classified as "misfits" in prior stages

    Term-Based Clustering and Summarization of Web Page Collections

    Abstract. Effectively summarizing Web page collections becomes more and more critical as the amount of information continues to grow on the World Wide Web. A concise and meaningful summary of a Web page collection, which is generated automatically, can help Web users understand the essential topics and main contents covered in the collection quickly without spending much browsing time. However, automatically generating coherent summaries as good as human-authored summaries is a challenging task since Web page collections often contain diverse topics and contents. This research aims towards clustering of Web page collections using automatically extracted topical terms, and automatic summarization of the resulting clusters. We experiment with word- and term-based representations of Web documents and demonstrate that term-based clustering significantly outperforms word-based clustering with much lower dimensionality. The summaries of computed clusters are informative and meaningful, which indicates that clustering and summarization of large Web page collections is promising for alleviating the information overload problem.