1 research outputs found

    Indexing Google 1T for low-turnaround wildcarded frequency queries

    No full text
    We propose a technique to prepare the Google 1T n-gram data set for wildcarded frequency queries with a very low turnaround time, making unbatched applications possible. Our method supports token-level wildcarding and – given a cache of 3.3 GB of RAM – requires only a single read of less than 4 KB from the disk to answer a query. We present an indexing structure, a way to generate it, and suggestions for how it can be tuned to particular applications. 1 Background and motivation The “Google 1T ” data set (LDC #2006T13) is a collection of 2-, 3-, 4-, and 5-gram frequencies extracte
    corecore