Notes on Using Terrier for IR Experiments
$TERRIERHOME in subsequent text refers to the directory the software is unpacked and placed in.
Terrier-4.0 is packaged with a file containing 733 stop-words which is to be found here;
Enabling stop-word removal in Terrier-4.0 most probably makes use of this list.
Here are some popular stemming algorithms. There are many more in Terrier-4.0 mentioned on this page:
PorterStemmer WeakPorterStemmer EnglishSnowballStemmer SStemmer
SStemmer is an implementation of the S-Stemmer algorithm described in the paper How effective is suffixing? (1991).
The code is a useful guide to modifying stemming algorithms in Terrier-4.0.
The names of models are in the 'models' file. The file contents are shown below. The first column are names I make Terrier uses to identify the Java class names listed in the second column. Since these strings would show up in the run tag, it was necessary to map them to shorter strings for readability. Also, it is useful to have descriptive strings for cryptic names used in Terrier. The models file provides the facility to do that kind of mapping.
bm25 BM25 dfic DFIC dfiz DFIZ dfr_bm25 DFR_BM25 dfree DFRee dfreeklim DFReeKLIM dlh DLH dlh13 DLH13 dph DPH dirichletlm DirichletLM dl Dl hiemstra_lm Hiemstra_LM ifb2 IFB2 inb2 InB2 inl2 InL2 in_expb2 In_expB2 in_expc2 In_expC2 js_kls Js_KLs lemurtf_idf LemurTF_IDF mdl2 MDL2 ml2 ML2 pl2 PL2 tf_idf TF_IDF tf Tf xsqra_m XSqrA_M tmpl TMPL
A full list of models available in Terrier-4.0 is to be found in the package org.terrier.matching.models:
TMPL.java is a no-frills version of a Java class that implements a retrieval model in Terrier-4.0. To implement a variation of BM25, say, BM25SIMPLE, you need to do the following:
Replace the class name 'TMPL' and constructor name 'TMPL' to 'BM25SIMPLE', and, rename the file to BM25SIMPLE.java.
Initialize the constants k1 and b in the constructor.
In the score() function, edit this line
double w = 1.0;
double w = (k1 * tf) / (k1 * (1.0 - b + b * (dl/adl)) + tf) * log(N/n);
Note that the score() function is called with respect to a query term and the following parameters are available in score() in these variables:
dl - Document length tf - Term frequency numberOfDocuments - Number of documents in the corpus documentFrequency - Number of documents in which the term exists averageDocumentLength - Average document length in the corpus keyFrequency - Frequency of the term in the query
Then move BM25SIMPLE.java to:
Recompile by typing "ant" in the shell.
To index a TREC corpus, execute the Bash script with the following options:
bin/trec_terrier.sh -i -Dcollection.spec=[r] -Dterrier.index.path=[s] -Dstopwords.filename=[t] -Dtermpipelines=[u] -DTrecDocTags.doctag=[v] -DTrecDocTags.idtag=[w] -DTrecDocTags.process=[x] -DTrecDocTags.skip=[y] -DTrecDocTags.casesensitive=[z]
where the configuration parameters are
[r] - A file listing the paths to each file of the corpus. Basically, this file contains the output of "find -L corpus/* -type f".
[s] - Path to the directory where the index will be put.
[t] - A file containing the stop-words.
[u] - s1,s2 (two comma-separated strings). 's1' is the string 'Stopwords', to tell Terrier to do 'stop-word removal' during indexing. 's2' is the Java class name of the stemmer. To switch off either of these two steps use the string 'NoOp' in place of s1 and s2. For example, 'NoOp,NoOp' means no stopping and no stemming.
[v] - The tag that encloses 'one' TREC document. This is not to be confused with the 'files' listed in the file specified using parameter [u]. Many TREC documents are kept concatenated in each of the files in that list, since it is inefficient to store millions of documents in as many files on disk.
[w] - The tag that enclosed the unique identifier of a TREC document. Ground-truth information maps queries to documents using these identifier strings. Queries have unique identifiers too.
[x] - The mark-up tags in documents whose content is considered to be part of Terrier's view of a document. Set this to "TEXT,TITLE,HEAD,HL" to index TREC CD 1-2, or, "TEXT,H3,DOC,TITLE,HEADLINE,TTL" to index TREC CD 4-5. To be able to index all the text within all the children of the 'DOC' tag, which is useful as a one-shot solution to index CD 1-5, leave [x] and [y] (mentioned next) blank. See the references below for sources of this information.
[y] - Comma-separated list of tags to ignore.
[z] - Set to 'false'
To index, run this script
bin/trec_terrier.sh -r -q -c i -Dterrier.index.path=[j] -Dtrec.topics=[k] -DTrecQueryTags.doctag=[l] -DTrecQueryTags.idtag=[m] -DTrecQueryTags.process=[n] -DTrecQueryTags.skip=[o] -DTrecQueryTags.casesensitive=[p] -Dstopwords.filename=[q] -Dtermpipelines=[r] -Dtrec.model=[s] -Dquerying.postprocesses.controls=[t] -Dquerying.postprocesses.order=[u] -Dtrec.qe.model=[v] -Dexpansion.terms=[w] -Dexpansion.documents=[x] -Dtrec.results=[y] -Dtrec.results.file=[z]
[i] - A 'term-frequency normalization constant'. Voodoo! The table shows a range of values to pick for different similarity models and the parts of a TREC query; Title (T) and Description (D).
BM25 PL2 LM-dirichlet TF_IDF ----------------------------------- T 0.3-0.5 4-7 750-1000 0.3-0.5 D 0.6-0.8 1-2 1500-2000 0.6-0.8 -----------------------------------
On the page below, scroll down to the 'Weighting Models and Parameters' section where this constant is mentioned:
[j] - Path to the directory that contains the index.
[k] - The file that contains the queries.
[l] - The tag that encloses each query in the file containing queries.
[m] - The tag that contains a query's unique identifier.
[n] - The tags to process. This is ugly. Terrier stipulates that you specify tags in [l] and [m] (see above) and once and again mention them here to tell the system to use its contents.
[o] - The parts of the query to skip, specified by a comma-separated list of the names of the tags that enclose those parts. It so happens that Terrier needs to be told simultaneously what to process and what to skip. process this, this and this, does not imply not processing the others. Terrier's documentation recommends specifying 'TOP,NUM,TITLE' for the '.process' tag and 'DESC,NARR' for the '.skip' tag in the situation where you only want to use the 'TITLE' part of a query.
[p] - false
[q] - A file containing the stop-words.
[r] - s1,s2 (Two comma-separated strings.) This is identical to the parameter [v] used for indexing.
[s] - The name of the retrieval model (i.e. the similarity algorithm).
[t] - "qe:QueryExpansion" To switch on query expansion. Voodoo.
[u] - "QueryExpansion" Some more voodoo.
[v] - The fully qualified Java class name of the query expansion algorithm. For example:
[w] - The number of terms to add to a query to expand it.
[x] - Number of top documents from the first round of search results to use for computing query expansion statistics.
[y] - The directory where to place the files that contain the search results.
[z] - The name of the files that contain the ranked search result.
Specifying the stop-words-file does not switch on the 'removal of stop-words'. It has to be switched on by another specification (See the parameters [u] and [v] in INDEXING).
Specifying a models parameter through a single specification, dubbed as 'term frequency normalization' is confusing because models have more than one parameter to be set and the parameters themselves can range from fractions to large integers. This kind of specification does not stipulate that the user also mention the model for which it is meant. (See [i] in RETRIEVAL for a table of values.)
Specifying the tags to 'process' does not implicitly exclude the remaining tags from being processed. The set of 'do not process' items, has to be specified again as a list of things to 'skip'. It is unclear what would happen if the two lists have items in common, or, one of the lists is empty. (See [x], [y] in INDEXING, and, [n] [o] in RETRIEVAL)
Two models in Terrier-4.0, named 'BM25' and 'TFIDF' is a source of confusion. The BM25 has two unconventional variations (which I label A and B) in the source code (BM25.java) which I cannot trace to any IR literature. In type A the constant 'K1' is absent from the numerator, and, in type B a factor '2 * tf' is present in the denominator (K = c + tf is set, and then the expression K + tf shows up, which reduces to c + 2 * tf). I could assume it is a bug, or a tweak to improve performance, as is often done to tune a term-weighting model. Lastly, the 'TFIDF' model is actually a variation of the BM25 formula as shown:
Note that there are two polymorphic functions (one deprecated) that implement the two types of BM25; A and B.
BM25 (A) k1 = 1.2 k3 = 8.0 b = 0.75 w = tf / (k1 * (1.0 - b + b * (dl / adl)) + tf) * log((N - n + 0.5) / (n + 0.5)) * (((k3 + 1.0) * tfq) / (k3 + tfq)) BM25 (B) k1 = 1.2 k3 = 8.0 b = 0.75 w = ((k1 + 1.0) * tf) / (k1 * (1.0 - b + b * (dl / adl)) + 2.0 * tf) * log((N - n + 0.5) / (n + 0.5)) * (((k3 + 1.0) * tfq) / (k3 + tfq)) TF_IDF k1 = 1.2 b = 0.75 w = (k1 * tf) / (k1 * (1.0 - b + b * (dl / adl)) + tf) * log((N / n) + 1.0)
Q: Why has Terrier-4.0 been used; there are newer versions?
A: This exposition is confined to Terrier-4.0. Newer versions of Terrier have switched the build system to Maven and I don't have time to make that shift. Until a major Terrier release swings by, most of what I document here should remain relevant.
Q: Has this mod's version number got anything to do with Terrier's version number?
A: No. I have no plans to keep up with Terrier.