    Exploring Linguistic Features for Web Spam Detection: A Preliminary Study

    We study the usability of linguistic features in theWeb spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.JRC.G.2-Support to external securit

    Entity Summarisation with Limited Edge Budget on Undirected and Directed Knowledge Graphs

    The paper concerns a novel problem of summarising entities with limited presentation budget on entity-relationship knowledge graphs and propose an efficient algorithm for solving this problem. The algorithm has been implemented in two variants: undirected and directed, together with a visualisation tool. Experimental user evaluation of the algorithm was conducted on real large semantic knowledge graphs extracted from the web. The reported results of experimental user evaluation are promising and encourage to continue the work on improving the algorithm.

    Niepubliczne agencje zatrudnienia os贸b niepe艂nosprawnych. Mo偶liwo艣ci i dylematy rozwoju w sektorze pozarz膮dowym

    Raport powsta艂 z inicjatywy Fundacji Pomocy Matematykom i Informatykom Niesprawnym Ruchowo w ramach projektu 鈥濩entrum Edukacji i Aktywizacji Zawodowej Os贸b Niepe艂nosprawnych - Oddzia艂y Bydgoszcz i 艁贸d藕". Stanowi on rezultat bada艅 zjawiska niepe艂nosprawno艣ci i kategorii spo艂ecznej, jak膮 stanowi膮 osoby niepe艂nosprawne, oraz funkcjonowania ponad 30 agencji zatrudnienia wyspecjalizowanych we wsparciu os贸b niepe艂nosprawnych na rynku pracy. Pierwszy rozdzia艂 ekspertyzy dotyczy sposob贸w definiowania zjawiska niepe艂nosprawno艣ci, w drugim za艣 - podj臋to zagadnienie budowania potencja艂u niepublicznych s艂u偶b zatrudnienia os贸b niepe艂nosprawnych. Trzeci rozdzia艂 raportu zawiera informacje dotycz膮ce przyj臋tej metodologii bada艅, a czwarty prezentuje wyniki analiz zebranego materia艂u empirycznego w odniesieniu do oferty agencji zatrudnienia i jej klient贸w. Tematem pi膮tego rozdzia艂u pracy jest kondycja agencji zatrudnienia os贸b niepe艂nosprawnych, prowadzonych przez organizacje pozarz膮dowe. Motyw przewodni kolejnego rozdzia艂u to otoczenie zewn臋trzne agencji zatrudnienia. Ostatnia cze艣膰 raportu dotyczy rekomendacji wspomagaj膮cych rozwi膮zywanie dylemat贸w rozwojowych, przed kt贸rymi stoj膮 agencje zatrudnienia. ** This report was made on the initiative of the Foundation Supporting Disabled Mathematicians and IT professionals in the project "Centre for Education and Vocational Activation of Persons with Disabilities - Branches Bydgoszcz and Lodz." It is the result of research on disability phenomenon and people with disabilities social category. It contains information about operations of more than 30 employment agencies specialized in helping people with disabilities into the labor market. First chapter of expertise relates to methods for defining the prevalence of disability and in the second - it was the issue of capacity building for non-disabled employment services. Third chapter of the report provides information on the methodology of research, and the fourth presents the results of an empirical analysis of the collected material in relation to the offer of employment agencies and their clients. Theme of the fifth chapter of the work is the condition of the disabled employment agency run by NGOs. Theme of the next chapter is the external environment of employment agencies. Last part of the report focuses on solving a recommendation supporting development dilemmas faced by agencies employment

    Approximation Guarantees for Max Sum and Max Min Facility Dispersion with Parameterised Triangle Inequality and Applications in Result Diversification

    Problem "Facility Dispersion", pierwotnie studiowany w badaniach operacyjnych, znajduje od niedawna nowe wa偶ne zastosowania w podej艣ciu polegaj膮cym na dywersyfikacji wyników w naukach informacyjnych.Jest to problem optymalizacji dyskretnej polegaj膮cy na wyborze niewielkiego zbioru p elementów z pewnego du偶ego zbioru kandydatów tak, aby zmaksymalizowa膰 pewn膮 funkcj臋 celu. Funkcja ta wyra偶a "rozproszenie" wybranych elementów, za po艣rednictwem pomocnicznej miary odleg艂o艣ci par elementów.Problem jest NP-trudny w wi臋kszo艣ci znanych wariantów, lecz istniej膮 algorytmy aproksymacyjne o wspó艂czynniku 2 dla niektórych z nich, gdy miara odleg艂o艣ci jest metryk膮.W artykule zaprezentowano twierdzenia, które uogólniaj膮 znane wyniki do przypadku gdy miara odleg艂o艣ci spe艂nia parametryzowan膮 nierówno艣膰 trójk膮ta z parametrem alfa, dla wariantów "Max Sum" oraz "Max Min" problemu. Wyniki dotycz膮 zarówno os艂abionej jak i wzmocnionej nierówno艣ci trójk膮ta.Zademonstrowano tak偶e potencjalne zastosowania powy偶szych rezultatów w problemie dywersyfikacji wyników w takich dziedzinach jak wyszukiwanie informacji czy podsumowania encyj w semantycznych grafach wiedzy, jak równie偶 w praktycznych obliczeniach na sko艅czonych zbiorach danych.Facility Dispersion Problem, originally studied in Operations Research, has recently found important new applications in Result Diversification approach in information sciences. This optimisation problem consists of selecting a small set of p items out of a large set of candidates to maximise a given objective function. The function expresses the notion of dispersion of a set of selected items in terms of a pair-wise distance measure between items.In most known formulations the problem is NP-hard, but there exist 2-approximationalgorithms for some cases if distance satisfies triangle inequality. We present generalised 2= approximation guarantees for the Facility Dispersion Problem in its two most common variants: Max Sum and Max Min, when the underlying dissimilarity measure satisfies parameterised triangle inequality with parameter . The results apply to both relaxed and strengthen variants of the triangle inequality.We also demonstrate potential applications of our findings in the result diversification problem including web search or entity summarisation in semantic knowledge graphs, as well as in practical computations on finite data sets

    Can Link Analysis Tell Us about Web Traffic?

    In this paper we measure correlation between link analysis characteristics for Web pages such as in- and out-degree, PageRank and RBS with those obtained from real Web traffic analysis. Measurements made on real data from the Polish Web show that PageRank is observably but not strongly correlated with actual visits made by Web users to Web pages and that our RBS algorithm is more correlated with traffic data than PageRank in some cases

    String Distance Metrics for Reference Matching and Search Query Correction

    String distance metrics have been widely used in various applications concerning processing of textual data. This paper reports on the exploration of their usability for tackling the reference matching task and for the automatic correction of misspelled search engine queries, in the context of highly inflective languages, in particular focusing on Polish. The results of numerous experiments in different scenarios are presented and they revealed some preferred metrics. Surprisingly good results were observed for correcting misspelled search engine queries. Nevertheless, a more in-depth analysis is necessary to achieve improvements. The work reported here constitutes a good point of departure for further research on this topic.