Sample Data Set
ap.tgz is a sample data set, created out of the TREC AP (Associated Press) news wire articles, to test the tools with; its stats and results are shown below.
TERMINOLOGY
TREC - Stands for Text REtrieval Conference, a annual meeting of the IR community to study and evaluate retrieval methodologies. TREC creates and curates its own data sets. A sample if provided below.
RUN NAME - The name of the search result file produced as output by the search tools.
MAP - Mean Average Precision, a metric used to measure the quality of the search result.
STATS
2250 Documents from the Associated Press (on TREC DISK 3).
20 Queries from TREC-4 (Query IDs 201-250).
167 Relevance judgments.
RESULTS
Lucne
---------------------------------
RUN NAME MAP
---------------------------------
DEMO.a.s.bm25.20.D.x 0.4814
DEMO.a.s.bm25L.20.D.x 0.4335
DEMO.a.s.bm25e.20.D.x 0.4766
DEMO.a.s.tmpl.20.D.x 0.2402
DEMO.a.s.tmple.20.D.x 0.2402
---------------------------------
Terrier
---------------------------------
RUN NAME MAP
---------------------------------
DEMO.a.s.bm25.20.D.x 0.4728
DEMO.a.s.tf_idf.20.D.x 0.4732
DEMO.a.s.tmpl.20.D.x 0.2141
---------------------------------
On the Structure and Organization of TREC Data Sets
x query qrel corpus
----- ----------------------- ------ ------
task routing adhoc adhoc adhoc
----- ----------------------- ------ ------
TREC1 1-50 (cd2) 51-100 51-100 cd12
TREC2 51-100(cd3) 101-150 101-150 cd12
TREC3 101-150(cd3) 151-200 151-200 cd12
TREC4 201-250 201-250 cd23
TREC5 251-300 251-300 cd24
TREC6 301-350 301-350 cd45
TREC7 351-400 351-400 cd45-cr
TREC8 401-450 401-450 cd45-cr
cd1
wsj WSJ 1987, 1988, 1989
fr FR 1989
ap AP 1989
doe DOE
ziff ZF 1989, 1990
cd2
wsj WSJ 1990, 1991, 1992
fr FR 1988
ap AP 1988
ziff ZF 1989, 1990
cd3
sjm SJM 1991
ap AP 1990
pat PT 1983-1991
ziff ZF 1991, 1992
cd4
ft FT 1991-1994
cr CR 1993
fr FR 1994
cd5
fbis FBIS 1996
lat LA 1989, 1990
These three seems to be around always. DOC, DOCNO, TEXT
A title shows up in many forms. TTL, TITLE, HEADLINE, H3, HT
Useful text blocks. SUMMARY
Some TEXT sections are strewn with funny comment tags and other tags too. 'within+' denotes such a TEXT section with one or more such tags within it.
TREC document structure table
cd1 cd2 cd3 cd4 cd5
doe
ap HEAD+ HEAD+ HEAD
fr within+ within+ within+
wsj HL HL
ziff TITLE TITLE TITLE
SUMMARY SUMMARY
patents TTL
sjm LEADPARA
SECTION
HEADLINE
cr TTL
ft HEADLINE
fbis H3 (within+)
HT (within+)
la HEADLINE
within+
YEAR/TAG head num dom title desc smry narr con fac nat def
1-100 x x x x x x x x x
101-150 x x x x x x x x x x
151-200 x x x x
201-250 x x
251-300 x x x x
301-350 x x x x
351-400 x x x x
401-450 x x x x
Depending on how you configure a search engine's parser (which tag
contents to pick, etc.) documents may end up being empty, having no
usable content. I usually make parsers as liberal as possible so that
everything within a
So even after using the most liberal parser, there are some truly empty documents, and I have found 3 so far. Two of these fall prey to the tokenizer and stemmer;
File DOCNO
cd1/doe/doe1_096 DOE1-96-1081
cd1/doe/doe2_013 DOE2-13-0573
cd1/doe/doe2_051 DOE2-51-1160
cd1/doe/doe1_096
<DOC>
<DOCNO> DOE1-96-1081 </DOCNO>
<TEXT>
</TEXT>
</DOC>
cd1/doe/doe2_013
<DOC>
<DOCNO> DOE2-13-0573 </DOCNO>
<TEXT>
None.
</TEXT>
</DOC>
cd1/doe/doe2_051
<DOC>
<DOCNO> DOE2-51-1160 </DOCNO>
<TEXT>
None.
</TEXT>
</DOC>
Documents may have very long terms like this one from document LA072290-0141 in the CD5 LA Times sub-collection:
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch
This is a name of a village in Wales; see the Wikipedia page about Llanfairpwllgwyngyll.
It is therefore recommended that you neither allocated just a small number of bytes for tokens or terms when building parsers, nor mistake such oddities as parser errors.