4 research outputs found

    The application of statistical methods in the development of Cyrillic-Latin converter for Tatar language

    Get PDF
    The article describes the process of a software product development that allows you to convert a text written in Tatar to Latin using Cyrillic graphics. The aspects of Cyrillicgraphics to Latin graphics conversion are considered for Tatar language. The authors study the application of various statistical methods necessary for converter operation and analyzethe speed and the accuracy of the conversion algorithms. An algorithm was created and software modules were developed that made it possible to convert messages written in Tatar Cyrillic alphabet to Tatar Latin alphabet. Based on normative documents and scientific works on the use of Latin graphics in Tatar language, a verbal and an algorithmic model of conversion was constructed. In the process of development, it turned out that the process of a Tatar word conversion depends on its origin. If native Tatar words are converted according to the phonetic principle (кәлам - qäläm), the borrowed words are converted according to the rules of transliteration. The main problem of the study is the problem of a word origin determination. In order to solve this problem, the authors propose various algorithms. Software tools based on the statistical processing of linguistic data are considered and developed in the work: combined bigram analysis, naive Bayesian classification and a direct search. Each of these algorithms is used to determine the etymology of a word, on which depends the application of certain rules of conversion from Cyrillic to Latin. The result of the research is a developed software product that is capable to carry out the process of Cyrillic graphics conversion to Latin for Tatar. In the future, the authors plan to improve the software product and use it in educational activities

    Exploring Linguistic Features for Web Spam Detection: A Preliminary Study

    Get PDF
    We study the usability of linguistic features in theWeb spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.JRC.G.2-Support to external securit

    Detecting, Modeling, and Predicting User Temporal Intention

    Get PDF
    The content of social media has grown exponentially in the recent years and its role has evolved from narrating life events to actually shaping them. Unfortunately, content posted and shared in social networks is vulnerable and prone to loss or change, rendering the context associated with it (a tweet, post, status, or others) meaningless. There is an inherent value in maintaining the consistency of such social records as in some cases they take over the task of being the first draft of history as collections of these social posts narrate the pulse of the street during historic events, protest, riots, elections, war, disasters, and others as shown in this work. The user sharing the resource has an implicit temporal intent: either the state of the resource at the time of sharing, or the current state of the resource at the time of the reader \clicking . In this research, we propose a model to detect and predict the user\u27s temporal intention of the author upon sharing content in the social network and of the reader upon resolving this content. To build this model, we first examine the three aspects of the problem: the resource, time, and the user. For the resource we start by analyzing the content on the live web and its persistence. We noticed that a portion of the resources shared in social media disappear, and with further analysis we unraveled a relationship between this disappearance and time. We lose around 11% of the resources after one year of sharing and a steady 7% every following year. With this, we turn to the public archives and our analysis reveals that not all posted resources are archived and even they were an average 8% per year disappears from the archives and in some cases the archived content is heavily damaged. These observations prove that in regards to archives resources are not well-enough populated to consistently and reliably reconstruct the missing resource as it existed at the time of sharing. To analyze the concept of time we devised several experiments to estimate the creation date of the shared resources. We developed Carbon Date, a tool which successfully estimated the correct creation dates for 76% of the test sets. Since the resources\u27 creation we wanted to measure if and how they change with time. We conducted a longitudinal study on a data set of very recently-published tweet-resource pairs and recording observations hourly. We found that after just one hour, ~4% of the resources have changed by ≥30% while after a day the change rate slowed to be ~12% of the resources changed by ≥40%. In regards to the third and final component of the problem we conducted user behavioral analysis experiments and built a data set of 1,124 instances manually assigned by test subjects. Temporal intention proved to be a difficult concept for average users to understand. We developed our Temporal Intention Relevancy Model (TIRM) to transform the highly subjective temporal intention problem into the more easily understood idea of relevancy between a tweet and the resource it links to, and change of the resource through time. On our collected data set TIRM produced a significant 90.27% success rate. Furthermore, we extended TIRM and used it to build a time-based model to predict temporal intention change or steadiness at the time of posting with 77% accuracy. We built a service API around this model to provide predictions and a few prototypes. Future tools could implement TIRM to assist users in pushing copies of shared resources into public web archives to ensure the integrity of the historical record. Additional tools could be used to assist the mining of the existing social media corpus by derefrencing the intended version of the shared resource based on the intention strength and the time between the tweeting and mining

    Электронные библиотеки: перспективные методы и технологии, электронные коллекции

    Get PDF
    Электронные библиотеки – область исследований и разработок, направленных на развитие теории и практики обработки, распространения, хранения, анализа и поиска цифровых данных различной природы. Основная цель серии конференций RCDL заключается в формировании сообщества специалистов России, ведущих исследования и разработки в области электронных библиотек и близких областях. Всероссийская научная конференция 2009 г. (RCDL'2009) является одиннадцатой конференцией по данной тематике (1999 г. – Санкт-Петербург, 2000 г. – Протвино, 2001 г. – Петрозаводск, 2002 г. – Дубна, 2003 г. – Санкт-Петербург, 2004 г. – Пущино, 2005 г. – Ярославль, 2006 г. – Суздаль, 2007 г. – Переславль-Залесский, 2008 г. – Дубна). Настоящий сборник включает тексты докладов, коротких сообщений и стендовых докладов, отобранных Программным комитетом RCDL'2009 в результате проведенного рецензирования
    corecore