77 research outputs found
The Quantile Index - Succinct Self-Index for Top-k Document Retrieval
One of the central problems in information retrieval is that of finding the k documents in a large text collection that best match a query given by a user. A recent result of Navarro & Nekrich (SODA 2012) showed that single term and phrase queries of length m can be solved in optimal O(m+k) time using a linear word sized index. While a verbatim implementation of the index would be at least an order of magnitude larger than the original collection, various authors incrementally improved the index to a point where the space requirement is currently within a factor of 1.5 to 2.0 of the text size for standard collections.
In this paper, we propose a new time/space trade-off for different top-k indexes. This is achieved by sampling only a quantile of the postings in the original inverted file or suffix array-based index. For those queries that cannot be answered using the sampled version of the index we show how to compute the query results on the fly efficiently. As an example, we apply our method to the top-k framework by Navarro & Nekrich. Under probabilistic assumptions that hold for most standard texts, and for a standard scoring function called term frequency, our index can be represented with only sublinearly many bits plus the space needed for a compressed suffix array of the text, while maintaining poly-logarithmic query times. We evaluate our solution on real-world datasets and compare its practical space usage and performance against state-of-the-art implementations. Our experiments show that our index compresses below the size of the original text. To our knowledge it is the first suffix array-based text index that is able to break this bound in practice even for non-repetitive collections, while still maintaining reasonable query times of under half a millisecond on average for top-10 queries
28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland
Peer reviewe
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
This survey addresses the crucial issue of factuality in Large Language
Models (LLMs). As LLMs find applications across diverse domains, the
reliability and accuracy of their outputs become vital. We define the
Factuality Issue as the probability of LLMs to produce content inconsistent
with established facts. We first delve into the implications of these
inaccuracies, highlighting the potential consequences and challenges posed by
factual errors in LLM outputs. Subsequently, we analyze the mechanisms through
which LLMs store and process facts, seeking the primary causes of factual
errors. Our discussion then transitions to methodologies for evaluating LLM
factuality, emphasizing key metrics, benchmarks, and studies. We further
explore strategies for enhancing LLM factuality, including approaches tailored
for specific domains. We focus two primary LLM configurations standalone LLMs
and Retrieval-Augmented LLMs that utilizes external data, we detail their
unique challenges and potential enhancements. Our survey offers a structured
guide for researchers aiming to fortify the factual reliability of LLMs.Comment: 62 pages; 300+ reference
Fashioning Jews: Clothing, Culture, and Commerce
This volume presents papers delivered at the 24th Annual Klutznick-Harris Symposium, held at Creighton University in October 2011. The contributors look at all aspects of the intimate relationship between Jews and clothing, through case studies from ancient, medieval, recent, and contemporary history. Papers explore topics ranging from Jewish leadership in the textile industry, through the art of fashion in nineteenth century Vienna, to the use of clothing as a badge of ethnic identity, in both secular and religious contexts.
Contents: Shmattas in the North, Shmattas in the South: The Civil War and the Birth of the American Clothing Industry (Adam Mendelsohn); Weimar Jewish Chic from Wigs to Furs: Jewish Women and Fashion in 1920s Germany (Kerry Wallach); Jewish Photographers and the Body in the Weimar Republic (Nils Roemer); Female Tallitot: Creating American Jewish Womenâs Religious Experience through Fashion (Rachel Gordan); Clothes and the Weaving of American Jewish Comedy (Ted Merwin); The Jewish Badge in Renaissance Italy: The Iconic O, the Yellow Hat, and the Paradoxes of Distinctive Sign Legislation(Flora Cassen); How a Rabbi Should Be Dressed: The Question of Cassock and Clerical Clothing among Italian Rabbis from the Renaissance to Contemporary Times (Asher Salah); The âDisinheritedâ Priesthood: A Look into Biblical Israelâs Unshod Priest (Christine Palmer); Costume and Identity in the Dura Europos Synagogue Paintings (Steven Fine); Picturing Viennaâs New Woman: Madame dâOra meets Ella Zwieback-Zirner (Lisa Silverman); Aboriginal Yarmulkes, Ambivalent Attire, and Ironies of Contemporary Jewish Identity (Eric Silverman); Fashioning Jews on the Screen: The Impact of Dress on Crafting the Jewish Image in Film and Television(Brian Amkraut)https://docs.lib.purdue.edu/sjc/1003/thumbnail.jp
The analysis of enumerative source codes and their use in BurrowsâWheeler compression algorithms
In the late 20th century the reliable and efficient transmission, reception and storage of information proved to be central to the most successful economies all over the world. The Internet, once a classified project accessible to a selected few, is now part of the everyday lives of a large part of the human population, and as such the efficient storage of information is an important part of the information economy. The improvement of the information storage density of optical and electronic media has been remarkable, but the elimination of redundancy in stored data and the reliable reconstruction of the original data is still a desired goal. The field of source coding is concerned with the compression of redundant data and its reliable decompression. The arithmetic source code, which was independently proposed by J. J. Rissanen and R. Pasco in 1976, revolutionized the field of source coding. Compression algorithms that use an arithmetic code to encode redundant data are typically more effective and computationally more efficient than compression algorithms that use earlier source codes such as extended Huffman codes. The arithmetic source code is also more flexible than earlier source codes, and is frequently used in adaptive compression algorithms. The arithmetic code remains the source code of choice, despite having been introduced more than 30 years ago. The problem of effectively encoding data from sources with known statistics (i.e. where the probability distribution of the source data is known) was solved with the introduction of the arithmetic code. The probability distribution of practical data is seldomly available to the source encoder, however. The source coding of data from sources with unknown statistics is a more challenging problem, and remains an active research topic. Enumerative source codes were introduced by T. J. Lynch and L. D. Davisson in the 1960s. These lossless source codes have the remarkable property that they may be used to effectively encode source sequences from certain sources without requiring any prior knowledge of the source statistics. One drawback of these source codes is the computationally complex nature of their implementations. Several years after the introduction of enumerative source codes, J. G. Cleary and I. H. Witten proved that approximate enumerative source codes may be realized by using an arithmetic code. Approximate enumerative source codes are significantly less complex than the original enumerative source codes, but are less effective than the original codes. Researchers have become more interested in arithmetic source codes than enumerative source codes since the publication of the work by Cleary and Witten. This thesis concerns the original enumerative source codes and their use in BurrowsâWheeler compression algorithms. A novel implementation of the original enumerative source code is proposed. This implementation has a significantly lower computational complexity than the direct implementation of the original enumerative source code. Several novel enumerative source codes are introduced in this thesis. These codes include optimal fixedâtoâfixed length source codes with manageable computational complexity. A generalization of the original enumerative source code, which includes more complex data sources, is proposed in this thesis. The generalized source code uses the BurrowsâWheeler transform, which is a lowâcomplexity algorithm for converting the redundancy of sequences from complex data sources to a more accessible form. The generalized source code effectively encodes the transformed sequences using the original enumerative source code. It is demonstrated and proved mathematically that this source code is universal (i.e. the code has an asymptotic normalized average redundancy of zero bits). AFRIKAANS : Die betroubare en doeltreffende versending, ontvangs en berging van inligting vorm teen die einde van die twintigste eeu die kern van die mees suksesvolle ekonomie¨e in die wËereld. Die Internet, eens op ân tyd ân geheime projek en toeganklik vir slegs ân klein groep verbruikers, is vandag deel van die alledaagse lewe van ân groot persentasie van die mensdom, en derhalwe is die doeltreffende berging van inligting ân belangrike deel van die inligtingsekonomie. Die verbetering van die bergingsdigteid van optiese en elektroniese media is merkwaardig, maar die uitwissing van oortolligheid in gebergde data, asook die betroubare herwinning van oorspronklike data, bly ân doel om na te streef. Bronkodering is gemoeid met die kompressie van oortollige data, asook die betroubare dekompressie van die data. Die rekenkundige bronkode, wat onafhanklik voorgestel is deur J. J. Rissanen en R. Pasco in 1976, het ân revolusie veroorsaak in die bronkoderingsveld. Kompressiealgoritmes wat rekenkundige bronkodes gebruik vir die kodering van oortollige data is tipies meer doeltreffend en rekenkundig meer effektief as kompressiealgoritmes wat vroe¨ere bronkodes, soos verlengde Huffman kodes, gebruik. Rekenkundige bronkodes, wat gereeld in aanpasbare kompressiealgoritmes gebruik word, is ook meer buigbaar as vroe¨ere bronkodes. Die rekenkundige bronkode bly na 30 jaar steeds die bronkode van eerste keuse. Die probleem om data wat afkomstig is van bronne met bekende statistieke (d.w.s. waar die waarskynlikheidsverspreiding van die brondata bekend is) doeltreffend te enkodeer is opgelos deur die instelling van rekenkundige bronkodes. Die bronenkodeerder het egter selde toegang tot die waarskynlikheidsverspreiding van praktiese data. Die bronkodering van data wat afkomstig is van bronne met onbekende statistieke is ân groter uitdaging, en bly steeds ân aktiewe navorsingsveld. T. J. Lynch and L. D. Davisson het telâbronkodes in die 1960s voorgestel. Telâ bronkodes het die merkwaardige eienskap dat bronsekwensies van sekere bronne effektief met hierdie foutlose kodes ge¨enkodeer kan word, sonder dat die bronenkodeerder enige vooraf kennis omtrent die statistieke van die bron hoef te besit. Een nadeel van telâbronkodes is die ho¨e rekenkompleksiteit van hul implementasies. J. G. Cleary en I. H. Witten het verskeie jare na die instelling van telâbronkodes bewys dat benaderde telâbronkodes gerealiseer kan word deur die gebruik van rekenkundige bronkodes. Benaderde telâbronkodes het ân laer rekenkompleksiteit as telâbronkodes, maar benaderde telâbronkodes is minder doeltreffend as die oorspronklike telâbronkodes. Navorsers het sedert die werk van Cleary en Witten meer belangstelling getoon in rekenkundige bronkodes as telâbronkodes. Hierdie tesis is gemoeid met die oorspronklike telâbronkodes en die gebruik daarvan in BurrowsâWheeler kompressiealgoritmes. ân Nuwe implementasie van die oorspronklike telâbronkode word voorgestel. Die voorgestelde implementasie het ân beduidende laer rekenkompleksiteit as die direkte implementasie van die oorspronklike telâbronkode. Verskeie nuwe telâbronkodes, insluitende optimale vasteâtotâvaste lengte telâbronkodes met beheerbare rekenkompleksiteit, word voorgestel. ân Veralgemening van die oorspronklike telâbronkode, wat meer komplekse databronne insluit as die oorspronklike telâbronkode, word voorgestel in hierdie tesis. The veralgemeende telâbronkode maak gebruik van die BurrowsâWheeler omskakeling. Die BurrowsâWheeler omskakeling is ân laeâkompleksiteit algoritme wat die oortolligheid van bronsekwensies wat afkomstig is van komplekse databronne omskakel na ân meer toeganklike vorm. Die veralgemeende bronkode enkodeer die omgeskakelde sekwensies effektief deur die oorspronklike telâbronkode te gebruik. Die universele aard van hierdie bronkode word gedemonstreer en wiskundig bewys (d.w.s. dit word bewys dat die kode ân asimptotiese genormaliseerde gemiddelde oortolligheid van nul bisse het). CopyrightDissertation (MEng)--University of Pretoria, 2010.Electrical, Electronic and Computer Engineeringunrestricte
Remote Sensing Data Compression
A huge amount of data is acquired nowadays by different remote sensing systems installed on satellites, aircrafts, and UAV. The acquired data then have to be transferred to image processing centres, stored and/or delivered to customers. In restricted scenarios, data compression is strongly desired or necessary. A wide diversity of coding methods can be used, depending on the requirements and their priority. In addition, the types and properties of images differ a lot, thus, practical implementation aspects have to be taken into account. The Special Issue paper collection taken as basis of this book touches on all of the aforementioned items to some degree, giving the reader an opportunity to learn about recent developments and research directions in the field of image compression. In particular, lossless and near-lossless compression of multi- and hyperspectral images still remains current, since such images constitute data arrays that are of extremely large size with rich information that can be retrieved from them for various applications. Another important aspect is the impact of lossless compression on image classification and segmentation, where a reasonable compromise between the characteristics of compression and the final tasks of data processing has to be achieved. The problems of data transition from UAV-based acquisition platforms, as well as the use of FPGA and neural networks, have become very important. Finally, attempts to apply compressive sensing approaches in remote sensing image processing with positive outcomes are observed. We hope that readers will find our book useful and interestin
Barry Smith an sich
Festschrift in Honor of Barry Smith on the occasion of his 65th Birthday. Published as issue 4:4 of the journal Cosmos + Taxis: Studies in Emergent Order and Organization. Includes contributions by Wolfgang Grassl, Nicola Guarino, John T. Kearns, Rudolf LĂźthe, Luc Schneider, Peter Simons, Wojciech ĹťeĹaniec, and Jan WoleĹski
Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS'09)
The Symposium on Theoretical Aspects of Computer Science (STACS) is held alternately in France and in Germany. The conference of February 26-28, 2009, held in Freiburg, is the 26th in this series. Previous meetings took place in Paris (1984), Saarbr¨ucken (1985), Orsay (1986), Passau (1987), Bordeaux (1988), Paderborn (1989), Rouen (1990), Hamburg (1991), Cachan (1992), W¨urzburg (1993), Caen (1994), M¨unchen (1995), Grenoble (1996), L¨ubeck (1997), Paris (1998), Trier (1999), Lille (2000), Dresden (2001), Antibes (2002), Berlin (2003), Montpellier (2004), Stuttgart (2005), Marseille (2006), Aachen (2007), and Bordeaux (2008). ..
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at UniversitĂ degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown
- âŚ