45 research outputs found

    Kaynak Keşif Yeteneğinin Artırılması için İnternet Kaynaklarının İçeriklerinin Standart Biçimde Tanımlanması

    Get PDF
    Since internet resources are not yet machine-understandable, there are some problems with satisfaction of information needs of users. Unstructural representation of resources and employing ad-hoc solutions for the issue of how to interpret these resources constitute the reasons that an eye would easily catch at first sight. Given that a drastic and staedy in­crease in the number as well as in the volume of lntemet resources has been obseNed over years, in addition to search engines based on content terms, the necessity of software agents being capable of automatically discovering and haNesting these resources has arisen. Success of this type of software agents very closely depends on standartization of mod­eling of resources to be processed. A semantic modeling defined as a re­sult of such an effort is called RDF (Resource Description Framework), and the studies on this model is controlled by WWW (World-Wide Web) Consortium. The DC (Dublin Core) metadata elements have been defined using the property of extensibility of RDF to handle electronic catolog in­formation. in this article, an authoring editor, called H-DCEdit, is intro­duced. This editor makes use of RDF/DC model to define contents of Turkish electronic resources. To serialize (or to code) a RDF model, SGML (Standard Generalized Markup Language) has been used. in addi­tion to this work, by using DSSSL (Document Style Semantics and Spec­ification Language) standard, the format information in regard to a given RDFIDC document has been separated from its content, and hence H­DCEdit provides different views of an RDFIDC document.İnternet kaynaklarının makinaca anlaşılabilir olmamasından dolayı, kullanıcıların bilgi ihtiyacını karşılamada sorunlar yaşanmaktadır. Kaynakların yapısal olarak gösterilmemesi ve nasıl yorumlanması ge­rektiğinin ilgili kaynağa özgün çözümlerle hal/edilmeye çalışılması, ilk göze çarpan nedenleri oluşturmaktadır. lnternet kaynaklarının gerek sa­yısal ve gerekse de hacimsel olarak çok hızlı artışı göz önünde bu­lundurıJlduğunda, içerik sözcüklere dayalı arama makinalarına ilaveten, otomatik kaynak keşfine ve bilginin harmanlanmasına olanak veren ya­zılım araçlarının gerekliliği ortaya çıkmaktadır. Bu tür yazılım araçlarının başarısı ise, işlenecek kaynakların modellenmesinin standart bir biçimde yapılmasına çok yakından bağlıdır. RDF (Resource Description Fra­mework), böyle bir çabadan doğan anlamsal bir modellemedir ve bu model üzerinde yapılan çalışmalar WWW (World Wide Web) Kon­sorsiyumu tarafından kontrol edilmektedir. DC (Dublin Core) üstveri ele­manları, elektronik katalog bilgilerini tutmak için, RDF"in genişletilebilirlik özelliği kullanılarak tanımlanmıştır. Bu makalede, RDFIDC modeli kul­lanılarak Türkçe elektronik kaynakların içeriklerinin tanımlanmasını sağ­layan H-DCEdit adlı editör aracı tanıtılacaktır. RDF modelinin seri/eştirme sözdizimi olarak SGML (Standard Generalized Markup Language) kul­lanılmıştır. Bu çalışmaya ek olarak, RDF/DC belgelerinin, DSSSL (Do­cument Style Semantics and Specification Language) standardı yar­dımıyla farklı belge biçimlerinde yeniden biçimlenmesi de sağlanmıştır

    Sessizliğin Kaldırılması ve Konuşmanın Parçalara Ayrılması İşleminin Türkçe Otomatik Konuşma Tanıma Üzerindeki Etkisi

    Get PDF
    Otomatik Konuşma Tanıma sistemleri temel olarak akustik bilgiden faydalanılarak geliştirilmektedir. Akustik bilgiden fonem bilgisinin elde edilmesi için eşleştirilmiş konuşma ve metin verileri kullanılmaktadır. Bu veriler ile eğitilen akustik modeller gerçek hayattaki bütün akustik bilgiyi modelleyememektedir. Bu nedenle belirli ön işlemlerin yapılması ve otomatik konuşma tanıma sistemlerinin başarımını düşürecek akustik bilgilerin ortadan kaldırılması gerekmektedir. Bu çalışmada konuşma içerisinde geçen sessizliklerin kaldırılması için bir yöntem önerilmiştir. Önerilen yöntemin amacı sessizlik bilgisinin ortadan kaldırılması ve akustik bilgide uzun bağımlılıklar sağlayan konuşmaların parçalara ayrılmasıdır. Geliştirilen yöntemin sonunda elde edilen sessizlik içermeyen ve parçalara ayrılan konuşma bilgisi bir Türkçe Otomatik Konuşma Tanıma sistemine girdi olarak verilmiştir. Otomatik Konuşma Tanıma sisteminin çıkışında sisteme giriş olarak verilen konuşma parçalarına karşılık gelen metinler birleştirilerek sunulmuştur. Gerçekleştirilen deneylerde sessizliğin kaldırılması ve konuşmanın parçalara ayrılması işleminin Otomatik Konuşma Tanıma sistemlerinin başarımını artırdığı görülmüştür

    How does language model size effects speech recognition accuracy for the Turkish language?

    Get PDF
    In this paper we aimed at investigating the effect of Language Model (LM) size on Speech Recognition (SR) accuracy. We also provided details of our approach for obtaining the LM for Turkish. Since LM is obtained by statistical processing of raw text, we expect that by increasing the size of available data for training the LM, SR accuracy will improve. Since this study is based on recognition of Turkish, which is a highly agglutinative language, it is important to find out the appropriate size for the training data. The minimum required data size is expected to be much higher than the data needed to train a language model for a language with low level of agglutination such as English. In the experiments we also tried to adjust the Language Model Weight (LMW) and Active Token Count (ATC) parameters of LM as these are expected to be different for a highly agglutinative language. We showed that by increasing the training data size to an appropriate level, the recognition accuracy improved on the other hand changes on LMW and ATC did not have a positive effect on Turkish speech recognition accuracy.</span

    Truncation of Content Terms for Turkish

    Get PDF
    Stemming, truncating, suffix stripping and decompounding algorithms used in information retrieval (IR) to reduce the content terms to their respective conflated forms are well-known algorithms for their causes for improving the retrieval performance as well as providing space and processing efficiency. In this paper we investigate the statistical characteristics of the truncated terms for Turkish on a text corpus consisting of more than 50 million words and attempt to measure the vocabulary growth rates for both the whole and truncated words. Findings indicate that the truncated words in Turkish exhibit a Zipfian behavior and that the whole words can successfully be truncated to the average word length (6.2 characters) without compromising performance effectiveness. The vocabulary growth rate for truncated words is about one third of that for the whole words. The result of our study is two fold. First it surely opens the room for truncation of content terms for Turkish for which there is no publicly available stemming code equipped with morphological analysis capability. Second, use of a truncation algorithm for indexing Turkish text may yield comparable effectiveness values with that of a stemming algorithm and hence, the need for stemming may become absolote, given that morphological analyzers for Turkish is highly complex in nature

    Veri Tabanlarında Bilgi Keşfine Formel Bir Yaklaşım: Kısım II: Eşleştirme Sorgularının Biçimsel Kavram Analizi ile Modellenmesi.

    Get PDF
    In this study we utilize formal concept analysis to model association rules. Formal concept analysis provides a topological structure for a universe of objects and attributes. By exploiting the relationship between objects and attributes, formal concept analysis then introduces an entity called a concept. A concept is a set of attributes and objects. The attributes are maximally possessed by the set of objects and similarly the objects are the maximal set which all possess the set of attributes. Formal concept analysis deals with formal mathematical tools and techniques to develop and analyze relationship between concepts and to develop concept structures. We propose and develop a connection between association rule mining and formal concept analysis. We show that dependencies found by an association query can be derived from a concept structure. We have extended formal concept analysis framework to the association rule mining. We use analysis of market-basket problem, a specific case of association rule mining, to achieve this extension. This extension provides a natural basis for complexity analysis of the association rule mining. This extension can also help in developing a unified framework for common data mining problems

    Veri Tabanlarında Bilgi Keşfine Formel Bir Yaklaşım:Kısım I: Eşleştirme Sorguları ve Algoritmalar

    Get PDF
    In the last two decades, we have witnessed an explosive growth in our capabilities to both collect and store data, and generate even more data by further computer processing. In fact, it is estimated that the amount of information in the world doubles every 20 months. Our inability to interpret and digest these data, as readily as they are accumulated, has created a need for a new generation of tools and techniques for automated and intelligent database analysis. Consequently, the discipline of knowledge discovery in databases (KDD), which deals with the study of such tools and techniques, has evolved into an important and active area of research because of theoretical challenges and practical applications associated with the problem of discovering (or extracting) valuable, interesting and previously unknown knowledge from very large real-world databases. Many aspects of KDD have been investigated in several related fields such as database systems, machine learning, intelligent information systems, statistics, and expert systems. In the first part of our study (Part I), we discuss the fundamental issues of KDD as well as its process oriented view with a special emphasis on modelling association rules. In the second part (Part II), a follow-up study of this article, association queries will be modelled by formal concept analysis

    Türkçe Arama Motorlarında Performans Değerlendirme

    Get PDF
    Evaluation of Information Retrieval Performance of Turkish Search Engines This is an investigation on the information retrieval performances of search engines based on various measures. We searched 17 queries of differing types on four Turkish search engines, namely Arabul, Arama, Netbul and Superonline. We classified each document/Web site contained in the retrieval results as being “relevant” or “non-relevant”. Based on this classification, we calculated the precision and normalized ranking ratios in various cut-off points for each query run on each search engine. We checked the “dead” or “broken” links among the retrieval results to determine how often the crawlers of search engines visit the sites they index and how often they update their indexes, if needed. We found out the coverage and novelty ratios of each search engine by searching five keywords that have been the most frequently submitted queries to the Turkish search engines. Those keywords are “mp3”, “oyun” (game), “sex”, “erotik” (erotica) and “porno” (porn). By means of two modest experiments, we tested to see if Turkish search engines make use of index terms that are assigned by the authors of Web pages and included under the “keywords” and “description” meta tags of HTML documents. Using Kruskal-Wallis and Mann-Whitney statistics, we tested if up-to-dateness, precision, normalized ranking, coverage and novelty ratios of each search engine differ significantly from each other. Major findings of our research are as follows: On the average, one in six documents retrieved by search engines was not available due to dead or broken links. Netbul retrieved fewer documents with dead or broken links than other search engines did. Some search engines retrieved no documents (so called “zero retrievals”) or no relevant documents for some queries. On the average, five in six documents retrieved were not relevant. Average precision ratios of search engines ranged between 11% (Netbul) and 28% (Arama)(Superonline being 20% and Arabul 15%). Arama retrieved more relevant documents than that of Arabul and Netbul in the first five documents retrieved. Search engines do not seem to make every efforts to retrieve and display the relevant documents in higher ranks of retrieval results. Average normalized ranking ratios of search engines ranged between 20% (Arabul) and 54% (Arama) (Superonline being 37% and Netbul 30%). Arama retrieved the relevant documents in higher ranks than that of Arabul and Netbul. The strong positive correlation between the precision and normalized ranking ratios got weakened as the number of documents that we evaluated increased. Search engines were less successful in finding relevant documents for specific queries or queries that contained broad terms. Although nonrelevant documents were higher in number, search engines were more successful in singleterm queries or queries with Boolean “OR” operator. The success rate was lower for queries with Boolean “AND” operator. Search engines seemingly do not use stemming algorithms to better analyze queries and to increase retrieval performance. The use of Turkish characters such as “ç”, “ö”, and “ş” in queries still creates problems for Turkish search engines as retrieval results differed for such queries. Superonline’s coverage rate was much higher than that of other search engines for the most frequently searched queries on the Turkish search engines. Except Arama, search engines index fewer documents/sites with domain names ending with “.tr”. Arama is the indisputable leader in covering documents with Turkish addresses. Almost all search engines scored high in novelty ratios for the most frequently searched queries. Different search engines tend to retrieve different relevant documents for the same queries. For retrieval purposes, Netbul and Superonline seem to index and make use of metadata fields that are contained in HTML documents under “keywords” and “description” meta tags. The research report concludes with some recommendations to improve the information retrieval performances of Turkish search engines

    Kavram Tabanlı Bilgi Geri Getirim Yaklaşımı

    Get PDF
    Search engines become incompetent in tackling with the gap caused by differences in vocabularies used by both the authors of documents and users in expressing their information needs. One way to alleviate this problem is to introduce concept based retrieval of information. In this study, a model that can be regarded as an extension of RUBRIC system in some perspectives, but it has some distinct features especially suited for enabling descriptive-level retrieval was proposed and obtained results were evaluated
    corecore