vocab.txt
wrong 33 front 34 ey 16 revis 28 nevar 38 rude 14
1897 26 great 6 1990 42 open 15 lewi 22 sever 13
slope 47 final 27 adventur 29 spell 37 gardner 40 remark 12
issu 36 learn 9 make 10 quill 45 flap 48 flat 32
wide 17 curios 7 person 11 produc 30 time 5 propos 24
annot 41 gave 43 answer 25 martin 39 earli 35 desk 21
raven 19 reader 44 note 31 carrol 23 hatter 3 speech 8
hair 1 write 20 alic 4 hear 18 cut 2 dip 46
Internally tokens1 are mapped to inegers and vocab.txt records the mapping. The word Alice was normalized2 to alic.
alic 4 (3rd. column, last row) means the term alic is represented in the tokenized corpus by the integer 4.
1 We refer to strings that were words as tokens at this point.
2 Stemming algorithms use heuristics and does not lemmatize which is a
linguistically correct normalization.