161 research outputs found

    The design, implementation, and effectiveness of a Farsi word stemmer

    Full text link
    A stemmer for the Farsi language has been designed, implemented, and evaluated. The stemmer uses Farsi morphology to remove affixes, producing effective stems. The implementation is written in C, using strings of unicode-encoded characters to represent Farsi words. It is meant to enhance the Farsi information retrieval system currently being developed at the Information Science Research Institute at the University of Nevada at Las Vegas. The effectiveness of the Farsi stemmer and stopword list on recall/precision was tested

    Creating Appropriate Corpus for Information Retrieval and Natural Language Processing in Persian Language

    Get PDF
    Persian natural language processing (NLP) researchers have many limitations to access linguistic tools which are suitable for text processing. Therefore, researchin Persian text processing is very limited. Since dataset is an important requirement for experiments and their evaluation, we aimed to create appropriate corpora for information retrieval and natural language processing in Persian. The provided corpora in this article are based on HAMSHAHRI dataset which is appropriate for simple information retrieval and simple natural language processing because it has not been tagged. We converted this dataset to tagged collection and increased its text quality. The new corpora minimize the text preprocessing requirement. Here we have used STep-1 tools for text processing and have proposed some ideas to remove the bugs of these tools in order to increase their quality. At the end we used the new corpora for text retrieval and results showed performance improvement. 

    Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu

    Get PDF
    University of Minnesota Ph.D. dissertation. February 2018. Major: Computer Science. Advisors: Vipin Kumar, Blake Howald. 1 computer file (PDF); xi, 236 pages.Search is not a solved problem even in the world of Google and Bing's state of the art engines. Google and similar search engines are keyword based. Keyword-based searching suffers from the vocabulary mismatch problem -- the terms in document and user's information request don't overlap. For example, cars and automobiles. This phenomenon is called synonymy. Similarly, the user's term may be polysemous -- a user is inquiring about a river's bank, but documents about financial institutions are matched. Vocabulary mismatch exacerbated when the search occurs in Morphological Rich Language (MRL). Concept search techniques like dimensionality reduction do not improve search in Morphological Rich Languages. Names frequently occur news text and determine the "what," "where," "when," and "who" in the news text. Named Entity Recognition attempts to recognize names automatically in text, but these techniques are far from mature in MRL, especially in Arabic Script languages. Urdu is one the focus MRL of this dissertation among Arabic, Farsi, Hindi, and Russian, but it does not have the enabling technologies for NER and search. A corpus, stop word generation algorithm, a light stemmer, a baseline, and NER algorithm is created so the NER-aware search can be accomplished for Urdu. This dissertation demonstrates that NER-aware search on Arabic, Russian, Urdu, and English shows significant improvement over baseline. Furthermore, this dissertation highlights the challenges for researching in low-resource MRL languages

    Clustering Blog Information

    Get PDF
    Blogs form an important source of information in today’s internet world. There are blogs on different topics such as technical, health, electronic gadgets, shopping, etc. However, most of the blog websites have the blogs arranged in chronological order rather than its contents. Such arrangement of blogs makes it difficult for the user searching information about a particular topic from the blog. To resolve this problem, we propose an approach to cluster the blogs based on its content. We studied several clustering algorithms available. The objective of this report is to understand various steps involved in clustering blog information and working of clustering algorithms which best fits in text clustering and improving clustering results by reducing its drawbacks. The report demonstrates a comparison between the K-Means, Vector Space Model (VSM), Latent Semantic Indexing (LSI), and Fuzzy C-Means (FCM) clustering algorithms discussed through the paper and selects the optimum algorithm for blog clustering. The paper proposes modification of selected optimum algorithm to get required blog clustering. A comparison table of the selected algorithm and modified algorithm is included at the end of the paper and the results showed the proposed modified algorithm has performed overall better compared to other clustering algorithms

    Improving Retrieval Accuracy in Main Content Extraction from HTML Web Documents

    Get PDF
    The rapid growth of text based information on the World Wide Web and various applications making use of this data motivates the need for efficient and effective methods to identify and separate the “main content” from the additional content items, such as navigation menus, advertisements, design elements or legal disclaimers. Firstly, in this thesis, we study, develop, and evaluate R2L, DANA, DANAg, and AdDANAg, a family of novel algorithms for extracting the main content of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other three algorithms, is to use well particularities of Right-to-Left languages for obtaining the main content of web pages. As the English character set and the Right-to-Left character set are encoded in different intervals of the Unicode character set, we can efficiently distinguish the Right-to-Left characters from the English ones in an HTML file. This enables the R2L approach to recognize areas of the HTML file with a high density of Right-to-Left characters and a low density of characters from the English character set. Having recognized these areas, R2L can successfully separate only the Right-to-Left characters. The first extension of the R2L, DANA, improves effectiveness of the baseline algorithm by employing an HTML parser in a post processing phase of R2L for extracting the main content from areas with a high density of Right-to-Left characters. DANAg is the second extension of the R2L and generalizes the idea of R2L to render it language independent. AdDANAg, the third extension of R2L, integrates a new preprocessing step to normalize the hyperlink tags. The presented approaches are analyzed under the aspects of efficiency and effectiveness. We compare them to several established main content extraction algorithms and show that we extend the state-of-the-art in terms of both, efficiency and effectiveness. Secondly, automatically extracting the headline of web articles has many applications. We develop and evaluate a content-based and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. The proposed method achieves high performance in terms of effectiveness and efficiency and outperforms approaches operating on structural and visual features.Das rasante Wachstum von textbasierten Informationen im World Wide Web und die Vielfalt der Anwendungen, die diese Daten nutzen, macht es notwendig, effiziente und effektive Methoden zu entwickeln, die den Hauptinhalt identifizieren und von den zusätzlichen Inhaltsobjekten wie z.B. Navigations-Menüs, Anzeigen, Design-Elementen oder Haftungsausschlüssen trennen. Zunächst untersuchen, entwickeln und evaluieren wir in dieser Arbeit R2L, DANA, DANAg und AdDANAg, eine Familie von neuartigen Algorithmen zum Extrahieren des Inhalts von Web-Dokumenten. Das grundlegende Konzept hinter R2L, das auch zur Entwicklung der drei weiteren Algorithmen führte, nutzt die Besonderheiten der Rechts-nach-links-Sprachen aus, um den Hauptinhalt von Webseiten zu extrahieren. Da der lateinische Zeichensatz und die Rechts-nach-links-Zeichensätze durch verschiedene Abschnitte des Unicode-Zeichensatzes kodiert werden, lassen sich die Rechts-nach-links-Zeichen leicht von den lateinischen Zeichen in einer HTML-Datei unterscheiden. Das erlaubt dem R2L-Ansatz, Bereiche mit einer hohen Dichte von Rechts-nach-links-Zeichen und wenigen lateinischen Zeichen aus einer HTML-Datei zu erkennen. Aus diesen Bereichen kann dann R2L die Rechts-nach-links-Zeichen extrahieren. Die erste Erweiterung, DANA, verbessert die Wirksamkeit des Baseline-Algorithmus durch die Verwendung eines HTML-Parsers in der Nachbearbeitungsphase des R2L-Algorithmus, um den Inhalt aus Bereichen mit einer hohen Dichte von Rechts-nach-links-Zeichen zu extrahieren. DANAg erweitert den Ansatz des R2L-Algorithmus, so dass eine Sprachunabhängigkeit erreicht wird. Die dritte Erweiterung, AdDANAg, integriert eine neue Vorverarbeitungsschritte, um u.a. die Weblinks zu normalisieren. Die vorgestellten Ansätze werden in Bezug auf Effizienz und Effektivität analysiert. Im Vergleich mit mehreren etablierten Hauptinhalt-Extraktions-Algorithmen zeigen wir, dass sie in diesen Punkten überlegen sind. Darüber hinaus findet die Extraktion der Überschriften aus Web-Artikeln vielfältige Anwendungen. Hierzu entwickeln wir mit TitleFinder einen sich nur auf den Textinhalt beziehenden und sprachabhängigen Ansatz. Das vorgestellte Verfahren ist in Bezug auf Effektivität und Effizienz besser als bekannte Ansätze, die auf strukturellen und visuellen Eigenschaften der HTML-Datei beruhen
    • …
    corecore