Writings on Information Retrieval

  1. On the Provenance of tf-idf 2016
  2. Black Boxes are Harmful September 2016
  3. Lucene Notes 2016
  4. Terrier Notes 2016
  5. A Method for Cross-collection Comparison November 2014
  6. Forum for Information Retrieval Evaluation November 2013
  7. Overview of FIRE 2011 2013
  8. Simple Transliteration for CLIR 2013
  9. Using Negative Information in Search February 2011

On the Provenance of tf-idf 2016

Markdown document | Tables: SMART Notation PDF | Okapi BM Variants PDF | BM Constants PDF

I left NIST in August 2017, went to back to Kolkata and then moved to Vancouver in February 2018. My time in IR had concluded and I would move onto computer graphics and systems programming, a long cherished thought that I had not devote enough time to in the years past. As closure, I collected these notes and software related to the issue of repeatability and reproducibility in IR. In these notes I trace the provenance of term-weighting equations (tf-idf equations in IR parlance) to put to rest for once and for all any ambiguity about their structure and form. Some of the tables I used in the article are provided as separate PDF documents.


Black Boxes are Harmful September 2016

Lucene4IR Workshop, University of Strathclyde, Glasgow, UK (8-9 September 2016). Report on the Lucene4IR Workshop, SIGIR Forum, 50, 2 (December 2016), 58-75.

PDF | Slides PDF

By this time I was having qualms about the correctness of results produced by the search engines in use in the IR research community. Talking to people I found that there were word-of-mouth heuristics they relied upon to set up their experiment pipelines. Most of this detail was usually omitted in papers reporting experiments. In looking at the code, I found a incorrect implementation of the BM25 term-weighting equation in Terrier-4.0, a search engines used by many researchers. Researchers were also unaware of the optimizations in Apache's Lucene software that changed a document collection's document length distribution. This lead me to spend the rest of my time in IR on repeatability and reproducibility.


Lucene Notes 2016

Markdown document

Together with Terrier Notes, these notes were written for the Lucene search system I was working with at the time.


Terrier Notes 2016

Markdown document

This is my collection of notes on using Terrier-4.0 for IR experiments that clarifies some things hard to understand from documentation accompanying the software. Some of the bugs and pitfalls I point out are specific to Terrier-4.0. I wrote this in the summer of 2016 and since then newer versions of the software have appeared. Hopefully they don't exist any more. However, I still think, these notes will continue to help disambiguate parts of the documentation.


A Method for Cross-collection Comparison (with Donna Harman & Ian Soboroff) November 2014

Text REtrieval Conference (TREC), National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA. (November 2014)

Poster PDF

This poster was a preview of a project on studying IR experiment using Meta-analysis, a technique of applying inferential statistics. While working on this I took a detour to adress the issues with reproducing IR experiments.


Forum for Information Retrieval Evaluation November 2013

Text REtrieval Conference (TREC), National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA. (November 2012)

Poster PDF

In 2012 I moved from India to U.S.A. to work in NIST's Retrieval Group. Like I.S.I.'s FIRE back in India, NIST's TREC was the annual meeting for IR researchers to share and study text retrieval methodologies. At its poster session I put up this poster to spread the word about FIRE.


Overview of FIRE 2011 (with Prasenjit Majumder, Dipasree Pal, Ayan Bandyopadhyay, Mandar Mitra) 2011

Multilingual Information Access in South Asian Languages, Springer, Berlin, Heidelberg, (2013) 1-12.

Web page

The overview paper, written after the FIRE workshop every year, summarizes the year's participation statistics, research directions and submission.


Simple Transliteration for CLIR (with Prasenjit Majumder) 2013

Multilingual Information Access in South Asian Languages. Springer, Berlin, Heidelberg (2013), 241-251.

PDF

The acronym 'CLIR' in the title stands for 'Cross Lingual Information Retrieval'. This paper accompanied a set of IR experiments (search results) submitted for the workshop titled 'Forum for Information Retrieval Evaluation' (FIRE) held annually for the IR community in India. The IR Lab at I.S.I. Kolkata, where I worked, was also a co-organizer. One of FIRE's responsibilities is curating data sets in various Indian languages for IR research. At the annual workshop students and researchers share their experimental results produced using these data sets.


Using Negative Information in Search (with Sukomal Pal & Mandar Mitra) February 2011

Second International Conference on Emerging Applications of Information Technology, IEEE (February 2011), 53-56.

PDF | slides PDF

This was my introduction to the methods of setting up and running information retrieval experiments, and, writing a academic paper. I wrote this after I started working as a programmer at I.S.I. Kolkata's Information Retrieval Laboratory. I would then continue to work with IR experiments for the next six years.