Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets

Kaser, Owen

Lemire, Daniel

English

Archipel - Université du Québec à Montréal

Removing Manually Generated Boilerplate from Electronic
Texts: Experiments with Project Gutenberg e-Books
Owen Kaser
University of New Brunswick
Daniel Lemire
Universite´ du Que´bec a` Montre´al
July 12, 2007
Abstract
Collaborative work on unstructured or semi-
structured documents, such as in literature cor-
pora or source code, often involves agreed upon
templates containing metadata. These tem-
plates are not consistent across users and over
time. Rule-based parsing of these templates is
expensive to maintain and tends to fail as new
documents are added. Statistical techniques
based on frequent occurrences have the poten-
tial to identify automatically a large fraction
of the templates, thus reducing the burden on
the programmers. We investigate the case of
the Project GutenbergTM corpus, where most
documents are in ASCII format with pream-
bles and epilogues that are often copied and
pasted or manually typed. We show that a sta-
tistical approach can solve most cases though
some documents require knowledge of English.
We also survey various technical solutions that
make our approach applicable to large data
sets.
1 Introduction
The Web has encouraged the wide distribu-
tion of collaboratively edited collections of text
documents. An example is Project Guten-
berg1 [14] (hereafter PG), the oldest digital li-
brary, containing over 20,000 digitized books.
Meanwhile, automated text analysis is becom-
ing more common. In any corpus of unstruc-
tured text files, including source code [2], we
may find that some uninteresting “boilerplate”
Copyright c© 2007 Owen Kaser and Daniel Lemire.
Permission to copy is hereby granted provided the orig-
inal copyright notice is reproduced in copies made.
1Project Gutenberg is a registered trademark of the
Project Gutenberg Literary Archive Foundation.
text coexists with interesting text that we wish
to process. This problem also exists when try-
ing to “scrape” information from Web pages [8].
We are particularly interested in cases where
no single template generates all text files —
rather, there is an undetermined number and
we do not initially know which template was
used for a particular file. Some templates may
differ only in trivial ways, such as in the use
of white space, while other differences can be
substantial — as is expected when distributed
teams edit the files over several years.
Ikeda and Yamada [10] propose “substring
amplification” to cluster files according to the
templates used to generate them. The key ob-
servations are that chunks of text belonging to
the template appear repeatedly in the set of
files, and that a suffix tree can help detect the
long and frequent strings.
Using this approach with PG is undesirable
since the suffix array would consume much
memory and require much processing: the total
size of the files is large and growing. Instead,
we should use our domain knowledge: the boil-
erplate in PG is naturally organized in lines
and only appears at the beginning or end of a
document. We expect to find similar patterns
in other hand-edited boilerplate.
1.1 Related Work
Stripping unwanted and often repeated content
is a common task. Frequent patterns in text
documents have been used for plagiarism detec-
tion [17], for document fingerprinting [15], for
removing templates in HTML documents [6],
and for spam detection [16]. Template detec-
tion in HTML pages has been shown to improve
document retrieval [4].
1
The specific problem of detecting pream-
ble/epilogue templates in the PG corpus has
been tackled by several hand-crafted rule-based
systems [1, 3, 9].
2 Stripping PG
In PG e-books, there is a preamble that pro-
vides various standard metadata. Following
the transcribed body of the book, there is fre-
quently an epilogue. We want an automated
solution to remove the preamble and epilogue.
The desired preambles and epilogues used in
PG e-book files have changed several times over
the years, and they may change again in future.
This makes fully hand-crafted PG parsers [1, 3,
9] an unsatisfactory solution. The best way to
obtain a robust solution is to use methods that
automatically adjust to changes in data.
3 Algorithm
Our solution identifies frequent lines of text in
the first and last sections of each file. These
frequent lines are recorded in a common data
structure. Then, each file is processed and a
sequence of GAP MAX infrequent lines is used
to detect a transition from a preamble to the
main text, and one from the main text to an
epilogue. A technical report [12] gives details.
3.1 Classification-Error Effects
Two types of errors may occur when trying
to identify the preamble from line frequencies.
If a sequence of false negatives occurs within
the preamble, then we may cut the preamble
short. In the simplistic analytic model where
false negatives occur with probability σ, the ex-
pected number of lines before GAP MAX false
negatives are encountered is
∑GAP MAX
k=1 σ
−k
which is 1.4 million lines for GAP MAX = 10
and σ = 0.25.
If some false positives occur shortly after the
preamble, then we may overestimate the size
of the preamble. Let p denote the probabil-
ity of a false positive. The expected num-
ber of misclassified lines following the pream-
ble is
∑GAP MAX
k=1 (1−p)
−k−GAP MAX. With
MAX GAP = 10, this is small for p ≤ 20%.
3.2 Data Structures
The algorithm’s first pass builds a data struc-
ture to identify the frequent lines in the corpus.
Several data structures are possible, depending
whether we require exact results and how much
memory we can use. One approach that we do
not consider in detail is taking a random sam-
ple of the data. If the frequent-item threshold
is low (say K = 5), too small a sample will lead
to many new false negatives. However, when K
is large, sampling might be used with any of the
techniques below.
3.2.1 Exact Counts Using Internal Memory
For exact results, we could build a hash ta-
ble that maps each line seen to an occurrence
counter. We need about 700MiB for our data
structure.
3.2.2 Exact Counts Using External Memory
To know exactly which lines occur frequently, if
we have inadequate main memory, an external-
memory solution is to sort the lines. Then a
pass over the sorted data can record the fre-
quent lines, presumably in main memory.
3.2.3 Checksumming
For nearly exact results, we can hash lines to
large integers, assuming that commonly used
hashing algorithms are unlikely to generate
many collisions. We chose the standard CRC-
64 checksum, and a routine calculation [18]
shows that with a 64-bit hash, we can expect to
hash roughly 264/2 distinct lines before getting
a collision.
3.2.4 Hashing to Millions of Counters
To use even less memory than CRC-64 hashing,
one solution is to use a smaller hash (e.g., a 23-
bit hash) and accept some collisions. Once the
range of the hash function is small enough, it
can directly index into an array of counters,
rather than requiring a lookup in a secondary
structure mapping checksums to counters.
In our experiments on the first PG DVD, we
process only files’ tops and bottoms, and we
use a 23-bit hash with the 3.4 million distinct
lines. Assume that hashing distributes lines
uniformly and independently across the coun-
ters. Then the probability that a randomly se-
lected infrequent line will share a counter with
one of the ≈ 3000 frequent lines is estimated as
≈ 3000 × 2−23 = 3.6 × 10−4. These few addi-
tional false positives should not be harmful.
2
It is more difficult to assess the additional
false positives arising when a collection of infre-
quent lines share a counter and together have
an aggregate frequency exceeding the frequent-
item threshold, K. By assuming that the line
frequency distribution is very skewed and lines
are frequent with a small probability p, we have
derived [12] that the probability of a false posi-
tive is less than p(n−1)/c where n is the number
of distinct lines and c is the number of counters
(c = 223). This was verified experimentally.
3.2.5 Tracking Hot Items
Many algorithms have been developed for de-
tecting “frequent items” in streams. In such
a context, we are not interested in counting
how many times a given item occur, we only
want to retrieve frequent items. Cormode and
Muthukrishnan survey some of them [5].
A particularly simple and fast determinis-
tic method, Generalized Majority (GM), has
been developed independently by several au-
thors [7, 11, 13]. GM uses c counters, where
c ≥ 1/f − 1 and f is the minimum (relative)
frequency of a frequent item; in the special case
where f = 1/2, a single counter is sufficient. In
the case where the distribution is very skewed,
the algorithm already provides a good approxi-
mation of the frequent items: it suffices to keep
only the items with the largest count values.
We believe this observation is novel.
3.3 Heuristic Improvements
A large majority of PG e-books can have their
preambles and epilogues detected by a few
heuristic tricks. Our heuristics [12] were im-
plemented using regular expressions.
4 Experimental Results
We implemented the data structures discussed
in § 3.2 in Java 1.5 (using Sun’s JDK 1.5.0) and
tested them on a older machine with Pentium 3
Xeon processors (700MHz with 2MiB cache)
and 2GiB of main memory.
4.1 Errors
Looking at epilogues, the choice of data struc-
ture did not have much effect on accuracy. For
preambles, the GM approach had moderately
higher errors in about 30% of the cases. How-
ever, this always involved 10 or fewer lines.
 50
 100
 150
 200
 1000  10000  100000  1e+06  1e+07  1e+08
23-bit hashCRC-64GMexact
Figure 1: Wall-clock times (s) vs. the number
of counters, c.
Comparing results on preamble detection,
the heuristics were somewhat helpful, but GM
still had difficulties compared to using exact
counts in about 30% of the cases.
4.2 Run Times
We had our data structures process all tops and
bottoms of the files on the first PG DVD. Ex-
periments considered a range of values for c,
the number of counters used. For each data
point, ten trials were made and their average is
shown in Fig. 1.
GNU/Linux shell utilities, presumably
highly optimized, could sort and build the list
of frequent lines in under 100 s.
4.3 Comparison to GutenMark
Of those software tools that reformat PG e-
books, it appears only GutenMark [3] formally
attempts to detect the preamble, so it can be
stripped. We used its most recent production
release, dated 2002, when PG e-books did not
have a long epilogue. Thus we can only test it
on preamble detection.
Despite several large errors compared to our
approach, in many cases the GutenMark ap-
proach worked reasonably well.
5 Conclusion
Detecting the PG-specific preambles and epi-
logues is maybe surprisingly difficult. There
are instances where a human without knowl-
edge of English probably could not accurately
determine where the preamble ends. Neverthe-
less, our approach based on line frequency can
3
approximately (within 10%) detect the boiler-
plate in more than 90% of the documents.
Line frequency follows a very skewed dis-
tribution and thus, as we have demonstrated,
hashing to small number of bits will not lead
to a large number of lines falsely reported
as frequent. Indeed, using 23-bit line hash-
ing, we can approximately find the frequent
lines, with an accuracy sufficient so that pream-
ble/epilogue detection is not noticeably af-
fected. Simple rule-based heuristic can improve
accuracy in some cases, as observed with epi-
logues.
About the Authors
Owen Kaser holds a BCSS from Acadia U. and
an MS and Ph.D. from SUNY Stony Brook.
Daniel Lemire received his B.Sc. and M.Sc.
from the U. of Toronto and a Ph.D. from the
E´cole Polytechnique de Montre´al.
References
[1] T. Atkins. Newgut program. on-
line: http://rumkin.com/reference/
gutenberg/newgut, 2004. last checked 18-
01-2007.
[2] D. C. Atkinson and W. G. Griswold. Effec-
tive pattern matching of source code using
abstract syntax patterns. Softw., Pract.
Exper., 36(4):413–447, 2006.
[3] R. S. Burkey. GutenMark download
page. online: http://www.sandroid.
org/GutenMark/download.html, 2005.
last checked 18-01-2007.
[4] L. Chen, S. Ye, and X. Li. Template de-
tection for large scale search engines. In
SAC ’06, pages 1094–1098, 2006.
[5] G. Cormode and S. Muthukrishnan.
What’s hot and what’s not: tracking most
frequent items dynamically. ACM Trans.
Database Syst., 30(1):249–278, 2005.
[6] S. Debnath, P. Mitra, and C. L. Giles. Au-
tomatic extraction of informative blocks
from webpages. In SAC ’05, pages 1722–
1726, 2005.
[7] E. D. Demaine, A. Lo´pez-Ortiz, and J. I.
Munro. Frequency estimation of internet
packet streams with limited space. In Pro-
ceedings of ESA-2002, LNCS 2461, pages
348–360. Springer-Verlag, 2002.
[8] D. Gibson, K. Punera, and A. Tomkins.
The volume and evolution of web page
templates. In WWW ’05, pages 830–839,
2005.
[9] J. Grunenfelder. Weasel reader: Free
reading. online: http://gutenpalm.
sourceforge.net/, 2006. last checked 18-
01-2007.
[10] D. Ideda and Y. Yamada. Gathering text
files generated from templates. In IIWeb
Workshop, VLDB-2004, 2004.
[11] R. M. Karp, S. Shenker, and C. H. Pa-
padimitriou. A simple algorithm for find-
ing frequent elements in streams and bags.
ACM Trans. Database Syst., 28(1):51–55,
2003.
[12] O. Kaser and D. Lemire. Removing man-
ually generated boilerplate from electronic
texts: Experiments with project guten-
berg e-books. Technical Report TR-07-
001, Dept. of CSAS, UNBSJ, 2007. avail-
able from http://arxiv.org/abs/0707.
1913.
[13] J. Misra and D. Gries. Finding re-
peated elements. Sci. Comput. Program.,
2(2):143–152, 1982.
[14] Project Gutenberg Literary Archive Foun-
dation. Project Gutenberg. http://www.
gutenberg.org/, 2007. checked 2007-05-
30.
[15] S. Schleimer, D. Wilkerson, and A. Aiken.
Winnowing: local algorithms for docu-
ment fingerprinting. In SIGMOD’2003,
pages 76–85, 2003.
[16] R. Segal, J. Crawford, J. Kephart, and
B. Leiba. SpamGuru: An enterprise anti-
spam filtering system. In Proceedings of
the First Conference on E-mail and Anti-
Spam, 2004.
[17] D. Sorokina, J. Gehrke, S. Warner, and
P. Ginsparg. Plagiarism detection in arxiv.
In ICDM ’06: Proceedings of the Sixth In-
ternational Conference on Data Mining,
pages 1070–1075, Washington, DC, USA,
2006. IEEE Computer Society.
[18] Wikipedia. Birthday paradox —
Wikipedia, the free encyclopedia, 2007.
[Online; accessed 18-01-2007].
4


A simple algorithm for finding frequent elements in streams and bags.

Automatic extraction of informative blocks from webpages.

Birthday paradox — Wikipedia, the free encyclopedia,

Effective pattern matching of source code using abstract syntax patterns.

Finding repeated elements.

Frequency estimation of internet packet streams with limited space.

Gathering text files generated from templates.

GutenMark download page. online: http://www.sandroid. org/GutenMark/download.html,

Literary Archive Foundation. Project Gutenberg.

Newgut program. online: http://rumkin.com/reference/ gutenberg/newgut,

Plagiarism detection in arxiv.

Removing manually generated boilerplate from electronic texts: Experiments with project gutenberg e-books.

SpamGuru: An enterprise antispam filtering system.

Template detection for large scale search engines.

The volume and evolution of web page templates.

Weasel reader: Free reading. online: http://gutenpalm. sourceforge.net/,

What’s hot and what’s not: tracking most frequent items dynamically.

Winnowing: local algorithms for document fingerprinting.

Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

R-libre

Removing Manually Generated Boilerplate from ElectronicTexts: Experiments with Project Gutenberg e-BooksOwen KaserUniversity of New BrunswickDaniel LemireUniversite´ du Que´bec a` Montre´alJuly 12, 2007AbstractCollaborative work on unstructured or semi-structured documents, such as in literature cor-pora or source code, often involves agreed upontemplates containing metadata. These tem-plates are not consistent across users and overtime. Rule-based parsing of these templates isexpensive to maintain and tends to fail as newdocuments are added. Statistical techniquesbased on frequent occurrences have the poten-tial to identify automatically a large fractionof the templates, thus reducing the burden onthe programmers. We investigate the case ofthe Project GutenbergTM corpus, where mostdocuments are in ASCII format with pream-bles and epilogues that are often copied andpasted or manually typed. We show that a sta-tistical approach can solve most cases thoughsome documents require knowledge of English.We also survey various technical solutions thatmake our approach applicable to large datasets.1 IntroductionThe Web has encouraged the wide distribu-tion of collaboratively edited collections of textdocuments. An example is Project Guten-berg1 [14] (hereafter PG), the oldest digital li-brary, containing over 20,000 digitized books.Meanwhile, automated text analysis is becom-ing more common. In any corpus of unstruc-tured text files, including source code [2], wemay find that some uninteresting “boilerplate”Copyright c© 2007 Owen Kaser and Daniel Lemire.Permission to copy is hereby granted provided the orig-inal copyright notice is reproduced in copies made.1Project Gutenberg is a registered trademark of theProject Gutenberg Literary Archive Foundation.text coexists with interesting text that we wishto process. This problem also exists when try-ing to “scrape” information fromWeb pages [8].We are particularly interested in cases whereno single template generates all text files —rather, there is an undetermined number andwe do not initially know which template wasused for a particular file. Some templates maydiffer only in trivial ways, such as in the useof white space, while other differences can besubstantial — as is expected when distributedteams edit the files over several years.Ikeda and Yamada [10] propose “substringamplification” to cluster files according to thetemplates used to generate them. The key ob-servations are that chunks of text belonging tothe template appear repeatedly in the set offiles, and that a suffix tree can help detect thelong and frequent strings.Using this approach with PG is undesirablesince the suffix array would consume muchmemory and require much processing: the totalsize of the files is large and growing. Instead,we should use our domain knowledge: the boil-erplate in PG is naturally organized in linesand only appears at the beginning or end of adocument. We expect to find similar patternsin other hand-edited boilerplate.1.1 Related WorkStripping unwanted and often repeated contentis a common task. Frequent patterns in textdocuments have been used for plagiarism detec-tion [17], for document fingerprinting [15], forremoving templates in HTML documents [6],and for spam detection [16]. Template detec-tion in HTML pages has been shown to improvedocument retrieval [4].1The specific problem of detecting pream-ble/epilogue templates in the PG corpus hasbeen tackled by several hand-crafted rule-basedsystems [1, 3, 9].2 Stripping PGIn PG e-books, there is a preamble that pro-vides various standard metadata. Followingthe transcribed body of the book, there is fre-quently an epilogue. We want an automatedsolution to remove the preamble and epilogue.The desired preambles and epilogues used inPG e-book files have changed several times overthe years, and they may change again in future.This makes fully hand-crafted PG parsers [1, 3,9] an unsatisfactory solution. The best way toobtain a robust solution is to use methods thatautomatically adjust to changes in data.3 AlgorithmOur solution identifies frequent lines of text inthe first and last sections of each file. Thesefrequent lines are recorded in a common datastructure. Then, each file is processed and asequence of GAP MAX infrequent lines is usedto detect a transition from a preamble to themain text, and one from the main text to anepilogue. A technical report [12] gives details.3.1 Classification-Error EffectsTwo types of errors may occur when tryingto identify the preamble from line frequencies.If a sequence of false negatives occurs withinthe preamble, then we may cut the preambleshort. In the simplistic analytic model wherefalse negatives occur with probability σ, the ex-pected number of lines before GAP MAX falsenegatives are encountered is∑GAP MAXk=1 σ−kwhich is 1.4 million lines for GAP MAX = 10and σ = 0.25.If some false positives occur shortly after thepreamble, then we may overestimate the sizeof the preamble. Let p denote the probabil-ity of a false positive. The expected num-ber of misclassified lines following the pream-ble is∑GAP MAXk=1 (1−p)−k−GAP MAX. WithMAX GAP = 10, this is small for p ≤ 20%.3.2 Data StructuresThe algorithm’s first pass builds a data struc-ture to identify the frequent lines in the corpus.Several data structures are possible, dependingwhether we require exact results and how muchmemory we can use. One approach that we donot consider in detail is taking a random sam-ple of the data. If the frequent-item thresholdis low (say K = 5), too small a sample will leadto many new false negatives. However, whenKis large, sampling might be used with any of thetechniques below.3.2.1 Exact Counts Using Internal MemoryFor exact results, we could build a hash ta-ble that maps each line seen to an occurrencecounter. We need about 700MiB for our datastructure.3.2.2 Exact Counts Using External MemoryTo know exactly which lines occur frequently, ifwe have inadequate main memory, an external-memory solution is to sort the lines. Then apass over the sorted data can record the fre-quent lines, presumably in main memory.3.2.3 ChecksummingFor nearly exact results, we can hash lines tolarge integers, assuming that commonly usedhashing algorithms are unlikely to generatemany collisions. We chose the standard CRC-64 checksum, and a routine calculation [18]shows that with a 64-bit hash, we can expect tohash roughly 264/2 distinct lines before gettinga collision.3.2.4 Hashing to Millions of CountersTo use even less memory than CRC-64 hashing,one solution is to use a smaller hash (e.g., a 23-bit hash) and accept some collisions. Once therange of the hash function is small enough, itcan directly index into an array of counters,rather than requiring a lookup in a secondarystructure mapping checksums to counters.In our experiments on the first PG DVD, weprocess only files’ tops and bottoms, and weuse a 23-bit hash with the 3.4 million distinctlines. Assume that hashing distributes linesuniformly and independently across the coun-ters. Then the probability that a randomly se-lected infrequent line will share a counter withone of the ≈ 3000 frequent lines is estimated as≈ 3000 × 2−23 = 3.6 × 10−4. These few addi-tional false positives should not be harmful.2It is more difficult to assess the additionalfalse positives arising when a collection of infre-quent lines share a counter and together havean aggregate frequency exceeding the frequent-item threshold, K. By assuming that the linefrequency distribution is very skewed and linesare frequent with a small probability p, we havederived [12] that the probability of a false posi-tive is less than p(n−1)/c where n is the numberof distinct lines and c is the number of counters(c = 223). This was verified experimentally.3.2.5 Tracking Hot ItemsMany algorithms have been developed for de-tecting “frequent items” in streams. In sucha context, we are not interested in countinghow many times a given item occur, we onlywant to retrieve frequent items. Cormode andMuthukrishnan survey some of them [5].A particularly simple and fast determinis-tic method, Generalized Majority (GM), hasbeen developed independently by several au-thors [7, 11, 13]. GM uses c counters, wherec ≥ 1/f − 1 and f is the minimum (relative)frequency of a frequent item; in the special casewhere f = 1/2, a single counter is sufficient. Inthe case where the distribution is very skewed,the algorithm already provides a good approxi-mation of the frequent items: it suffices to keeponly the items with the largest count values.We believe this observation is novel.3.3 Heuristic ImprovementsA large majority of PG e-books can have theirpreambles and epilogues detected by a fewheuristic tricks. Our heuristics [12] were im-plemented using regular expressions.4 Experimental ResultsWe implemented the data structures discussedin § 3.2 in Java 1.5 (using Sun’s JDK 1.5.0) andtested them on a older machine with Pentium 3Xeon processors (700MHz with 2MiB cache)and 2GiB of main memory.4.1 ErrorsLooking at epilogues, the choice of data struc-ture did not have much effect on accuracy. Forpreambles, the GM approach had moderatelyhigher errors in about 30% of the cases. How-ever, this always involved 10 or fewer lines. 50 100 150 200 1000  10000  100000  1e+06  1e+07  1e+0823-bit hashCRC-64GMexactFigure 1: Wall-clock times (s) vs. the numberof counters, c.Comparing results on preamble detection,the heuristics were somewhat helpful, but GMstill had difficulties compared to using exactcounts in about 30% of the cases.4.2 Run TimesWe had our data structures process all tops andbottoms of the files on the first PG DVD. Ex-periments considered a range of values for c,the number of counters used. For each datapoint, ten trials were made and their average isshown in Fig. 1.GNU/Linux shell utilities, presumablyhighly optimized, could sort and build the listof frequent lines in under 100 s.4.3 Comparison to GutenMarkOf those software tools that reformat PG e-books, it appears only GutenMark [3] formallyattempts to detect the preamble, so it can bestripped. We used its most recent productionrelease, dated 2002, when PG e-books did nothave a long epilogue. Thus we can only test iton preamble detection.Despite several large errors compared to ourapproach, in many cases the GutenMark ap-proach worked reasonably well.5 ConclusionDetecting the PG-specific preambles and epi-logues is maybe surprisingly difficult. Thereare instances where a human without knowl-edge of English probably could not accuratelydetermine where the preamble ends. Neverthe-less, our approach based on line frequency can3approximately (within 10%) detect the boiler-plate in more than 90% of the documents.Line frequency follows a very skewed dis-tribution and thus, as we have demonstrated,hashing to small number of bits will not leadto a large number of lines falsely reportedas frequent. Indeed, using 23-bit line hash-ing, we can approximately find the frequentlines, with an accuracy sufficient so that pream-ble/epilogue detection is not noticeably af-fected. Simple rule-based heuristic can improveaccuracy in some cases, as observed with epi-logues.About the AuthorsOwen Kaser holds a BCSS from Acadia U. andan MS and Ph.D. from SUNY Stony Brook.Daniel Lemire received his B.Sc. and M.Sc.from the U. of Toronto and a Ph.D. from theE´cole Polytechnique de Montre´al.References[1] T. Atkins. Newgut program. on-line: http://rumkin.com/reference/gutenberg/newgut, 2004. last checked 18-01-2007.[2] D. C. Atkinson andW. G. Griswold. Effec-tive pattern matching of source code usingabstract syntax patterns. Softw., Pract.Exper., 36(4):413–447, 2006.[3] R. S. Burkey. GutenMark downloadpage. online: http://www.sandroid.org/GutenMark/download.html, 2005.last checked 18-01-2007.[4] L. Chen, S. Ye, and X. Li. Template de-tection for large scale search engines. InSAC ’06, pages 1094–1098, 2006.[5] G. Cormode and S. Muthukrishnan.What’s hot and what’s not: tracking mostfrequent items dynamically. ACM Trans.Database Syst., 30(1):249–278, 2005.[6] S. Debnath, P. Mitra, and C. L. Giles. Au-tomatic extraction of informative blocksfrom webpages. In SAC ’05, pages 1722–1726, 2005.[7] E. D. Demaine, A. Lo´pez-Ortiz, and J. I.Munro. Frequency estimation of internetpacket streams with limited space. In Pro-ceedings of ESA-2002, LNCS 2461, pages348–360. Springer-Verlag, 2002.[8] D. Gibson, K. Punera, and A. Tomkins.The volume and evolution of web pagetemplates. In WWW ’05, pages 830–839,2005.[9] J. Grunenfelder. Weasel reader: Freereading. online: http://gutenpalm.sourceforge.net/, 2006. last checked 18-01-2007.[10] D. Ideda and Y. Yamada. Gathering textfiles generated from templates. In IIWebWorkshop, VLDB-2004, 2004.[11] R. M. Karp, S. Shenker, and C. H. Pa-padimitriou. A simple algorithm for find-ing frequent elements in streams and bags.ACM Trans. Database Syst., 28(1):51–55,2003.[12] O. Kaser and D. Lemire. Removing man-ually generated boilerplate from electronictexts: Experiments with project guten-berg e-books. Technical Report TR-07-001, Dept. of CSAS, UNBSJ, 2007. avail-able from http://arxiv.org/abs/0707.1913.[13] J. Misra and D. Gries. Finding re-peated elements. Sci. Comput. Program.,2(2):143–152, 1982.[14] Project Gutenberg Literary Archive Foun-dation. Project Gutenberg. http://www.gutenberg.org/, 2007. checked 2007-05-30.[15] S. Schleimer, D. Wilkerson, and A. Aiken.Winnowing: local algorithms for docu-ment fingerprinting. In SIGMOD’2003,pages 76–85, 2003.[16] R. Segal, J. Crawford, J. Kephart, andB. Leiba. SpamGuru: An enterprise anti-spam filtering system. In Proceedings ofthe First Conference on E-mail and Anti-Spam, 2004.[17] D. Sorokina, J. Gehrke, S. Warner, andP. Ginsparg. Plagiarism detection in arxiv.In ICDM ’06: Proceedings of the Sixth In-ternational Conference on Data Mining,pages 1070–1075, Washington, DC, USA,2006. IEEE Computer Society.[18] Wikipedia. Birthday paradox —Wikipedia, the free encyclopedia, 2007.[Online; accessed 18-01-2007].4

http://www.archipel.uqam.ca/351/1/gutheader_CASCON2007.pdf

Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Abstract

Similar works

Full text

Available Versions

Archipel - Université du Québec à Montréal

R-libre