Information Retrieval

Sample Data Set

ap.tgz is a sample data set, created out of the TREC AP (Associated Press) news wire articles, to test the tools with; its stats and results are shown below.

TERMINOLOGY

TREC - Stands for Text REtrieval Conference, a annual meeting of the IR community to study and evaluate retrieval methodologies. TREC creates and curates its own data sets. A sample if provided below.
RUN NAME - The name of the search result file produced as output by the search tools.
MAP - Mean Average Precision, a metric used to measure the quality of the search result.

STATS

2250 Documents from the Associated Press (on TREC DISK 3).
20   Queries from TREC-4 (Query IDs 201-250).
167  Relevance judgments.

RESULTS

Lucne
---------------------------------
RUN NAME                   MAP
---------------------------------
DEMO.a.s.bm25.20.D.x       0.4814
DEMO.a.s.bm25L.20.D.x      0.4335
DEMO.a.s.bm25e.20.D.x      0.4766
DEMO.a.s.tmpl.20.D.x       0.2402
DEMO.a.s.tmple.20.D.x      0.2402
---------------------------------

Terrier
---------------------------------
RUN NAME                   MAP
---------------------------------
DEMO.a.s.bm25.20.D.x       0.4728
DEMO.a.s.tf_idf.20.D.x     0.4732
DEMO.a.s.tmpl.20.D.x       0.2141
---------------------------------

On the Structure and Organization of TREC Data Sets

Test Collections
Document Corpus
Document Structure
Query Structure
Empty Documents
Very Long Terms

TEST COLLECTIONS

x       query                   qrel    corpus
-----   ----------------------- ------  ------
task    routing         adhoc   adhoc   adhoc
-----   ----------------------- ------  ------
TREC1      1-50 (cd2)    51-100  51-100 cd12
TREC2     51-100(cd3)   101-150 101-150 cd12
TREC3    101-150(cd3)   151-200 151-200 cd12
TREC4                   201-250 201-250 cd23
TREC5                   251-300 251-300 cd24
TREC6                   301-350 301-350 cd45
TREC7                   351-400 351-400 cd45-cr
TREC8                   401-450 401-450 cd45-cr

DOCUMENT CORPUS

cd1
    wsj  WSJ  1987, 1988, 1989
    fr   FR   1989
    ap   AP   1989
    doe  DOE
    ziff ZF   1989, 1990
cd2 
    wsj  WSJ  1990, 1991, 1992
    fr   FR   1988
    ap   AP   1988
    ziff ZF   1989, 1990
cd3
    sjm  SJM  1991
    ap   AP   1990
    pat  PT   1983-1991
    ziff ZF   1991, 1992
cd4
    ft   FT   1991-1994
    cr   CR   1993
    fr   FR   1994
cd5
    fbis FBIS 1996
    lat  LA   1989, 1990

DOCUMENT STRUCTURE

These three seems to be around always. DOC, DOCNO, TEXT

A title shows up in many forms. TTL, TITLE, HEADLINE, H3, HT

Useful text blocks. SUMMARY

Some TEXT sections are strewn with funny comment tags and other tags too. 'within+' denotes such a TEXT section with one or more such tags within it.

TREC document structure table

            cd1     cd2     cd3      cd4       cd5
 doe
 ap         HEAD+   HEAD+   HEAD
 fr         within+ within+          within+
 wsj        HL      HL
 ziff       TITLE   TITLE   TITLE
                    SUMMARY SUMMARY
 patents                    TTL
 sjm                        LEADPARA
                            SECTION
                            HEADLINE
 cr                                  TTL
 ft                                  HEADLINE
 fbis                                         H3 (within+)
                                              HT (within+)
 la                                           HEADLINE
                                              within+

QUERY STRUCTURE

YEAR/TAG head num dom title desc smry narr con fac nat def  
1-100    x    x   x   x     x         x    x   x       x
101-150  x    x   x   x     x    x    x    x   x       x
151-200       x       x     x         x
201-250       x             x
251-300       x       x     x         x
301-350       x       x     x         x
351-400       x       x     x         x
401-450       x       x     x         x

EMPTY DOCUMENTS

Depending on how you configure a search engine's parser (which tag contents to pick, etc.) documents may end up being empty, having no usable content. I usually make parsers as liberal as possible so that everything within a tag is consumed, but for contents and the mark-up tags themselves. The popular papers (and for that matter even recent ones) reporting experiments on TREC adhoc data don't mention details of the parser. TREC documents have a host of mark-up and little is known about why they were there, which search systems used them, or which TREC tasks were these annotations meant for.

So even after using the most liberal parser, there are some truly empty documents, and I have found 3 so far. Two of these fall prey to the tokenizer and stemmer;

File                DOCNO
cd1/doe/doe1_096    DOE1-96-1081
cd1/doe/doe2_013    DOE2-13-0573
cd1/doe/doe2_051    DOE2-51-1160

cd1/doe/doe1_096

<DOC>
<DOCNO> DOE1-96-1081 </DOCNO>
<TEXT>

</TEXT>
</DOC>

cd1/doe/doe2_013

<DOC>
<DOCNO> DOE2-13-0573 </DOCNO>
<TEXT>
None.
</TEXT>
</DOC>

cd1/doe/doe2_051

<DOC>
<DOCNO> DOE2-51-1160 </DOCNO>
<TEXT>
None.
</TEXT>
</DOC>

VERY LARGE TERMS

Documents may have very long terms like this one from document LA072290-0141 in the CD5 LA Times sub-collection:

Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch

This is a name of a village in Wales; see the Wikipedia page about Llanfairpwllgwyngyll.

It is therefore recommended that you neither allocated just a small number of bytes for tokens or terms when building parsers, nor mistake such oddities as parser errors.

The Writing Desk "Why is a raven like a writing-desk?"

Sample Data Set

On the Structure and Organization of TREC Data Sets