Information Retrieval - Tools - txt

vocab.txt

 wrong 33      front 34         ey 16      revis 28      nevar 38       rude 14 
  1897 26      great  6       1990 42       open 15       lewi 22      sever 13 
 slope 47      final 27   adventur 29      spell 37    gardner 40     remark 12 
  issu 36      learn  9       make 10      quill 45       flap 48       flat 32 
  wide 17     curios  7     person 11     produc 30       time  5     propos 24 
 annot 41       gave 43     answer 25     martin 39      earli 35       desk 21 
 raven 19     reader 44       note 31     carrol 23     hatter  3     speech  8 
  hair  1      write 20       alic  4       hear 18        cut  2        dip 46

Internally tokens¹ are mapped to inegers and vocab.txt records the mapping. The word Alice was normalized² to alic.

alic 4 (3rd. column, last row) means the term alic is represented in the tokenized corpus by the integer 4.

1 We refer to strings that were words as tokens at this point.
2 Stemming algorithms use heuristics and does not lemmatize which is a linguistically correct normalization.

The Writing Desk "Why is a raven like a writing-desk?"

vocab.txt