CS 347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan
Today’s topics • Inverted index storage – Compressing dictionaries into memory • Processing Boolean queries – Optimizing term processing – Skip list encoding • Wild-card queries • Positional/phrase queries • Evaluating IR systems
Recall dictionary and postings files In memory Gap-encoded, on disk
Inverted index storage • Last time: Postings compression by gap encoding • Now: Dictionary storage – Dictionary in main memory, postings on disk • Tradeoffs between compression and query processing speed – Cascaded family of techniques
Dictionary storage - first cut • Array of fixed-width entries – 28 bytes/term = 14 MB. Allows for fast binary 20 bytes search into dictionary 4 bytes each
Fixed-width terms are wasteful • Most of the bytes in the Term column are wasted - we allot 20 bytes even for 1 -letter terms. – Still can’t handle supercalifragilisticexpialidocius. • Average word in English: ~8 characters. – Written English averages ~4. 5 characters: short words dominate usage. • Store dictionary as a string of characters: – Hope to save upto 60% of dictionary space.
Compressing the term list …. systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Total string length = 500 KB x 8 = 4 MB Pointers resolve 4 M positions: log 24 M= 22 bits = 3 bytes Binary search these pointers
Total space for compressed list • • • 4 bytes per term for Freq. 4 bytes per term for pointer to Postings. Now avg. 11 3 bytes per term pointer bytes/term, Avg. 8 bytes per term in term string not 20. 500 K terms 9. 5 MB
Blocking • Store pointers to every kth on term string. • Need to store term lengths (1 extra byte) …. 7 systile 9 syzygetic 8 syzygial 6 syzygy 11 szaibelyite 8 szczecin 9 szomo…. Save 9 bytes on 3 pointers. Lose 4 bytes on term lengths.
Exercise • Estimate the space usage (and savings compared to 9. 5 MB) with blocking, for block sizes of k = 4, 8 and 16.
Impact on search • Binary search down to 4 -term block; • Then linear search through terms in block. • Instead of chasing 2 pointers before, now chase 0/1/2/3 - avg. of 1. 5.
Extreme compression • Using perfect hashing to store terms “within” their pointers – not good for vocabularies that change. • Partition dictionary into pages – use B-tree on first terms of pages – pay a disk seek to grab each page – if we’re paying 1 disk seek anyway to get the postings, “only” another seek/query term.
Query optimization • Consider a query that is an AND of t terms. • The idea: for each of the t terms, get its term -doc incidence from the postings, then AND together. This is why we kept freq • Process in order of increasing freq: in dictionary. – start with smallest set, then keep cutting further.
Query processing exercises • If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen? • How can we perform the AND of two postings entries without explicitly building the 0/1 term-doc incidence vector?
General query optimization • e. g. , (madding OR crowd) AND (ignoble OR strife) • Get freq’s for all terms. • Estimate the size of each OR by the sum of its freq’s. • Process in increasing order of OR sizes.
Exercise • Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
Speeding up postings merges • Insert skip pointers • Say our current list of candidate docs for an AND query is 8, 13, 21. – (having done a bunch of ANDs) • We want to AND with the following postings entry: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22 • Linear scan is slow.
Augment postings with skip pointers (at indexing time) 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, . . . • At query time: • As we walk the current candidate list, concurrently walk inverted file entry - can skip ahead – (e. g. , 8, 21). • Skip size: recommend about (list length)
Query vs. index expansion • Recall, from lecture 1: – thesauri for term equivalents – soundex for hom*onyms • How do we use these? – Can “expand” query to include equivalences • Query car tyres automobile tires – Can expand index • Index docs containing car under automobile, as well
Query expansion • Usually do query expansion – No index blowup – Query processing slowed down • Docs frequently contain equivalences – May retrieve more junk • puma jaguar – Carefully controlled wordnets
Wild-card queries • mon*: find all docs containing any word beginning “mon”. • Solution: index all k-grams occurring in any doc (any sequence of k chars). • e. g. , from text “April is the cruelest month” we get the 2 -grams (bigrams) – $ is a special word boundary symbol $a, ap, pr, ri, il, l$, $i, is, s$, $t, th, he, e$, $c, cr, ru, ue, el, le, es, st, t$, $m, mo, on, nt, h$
Processing wild-cards • Query mon* can now be run as – $m AND mo AND on • But we’d get a match on moon. • Must post-filter these results against query. • Exercise: Work out the details.
Further wild-card refinements • Cut down on pointers by using blocks • Wild-card queries tend to have few bigrams – keep postings on disk • Exercise: given a trigram index, how do you process an arbitrary wild-card query?
Phrase search • Search for “to be or not to be” • No longer suffices to store only <term: docs> entries. • Instead store, for each term, entries – <number of docs containing term; – doc 1: position 1, position 2 … ; – doc 2: position 1, position 2 … ; – etc. >
Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of these docs could contain “to be or not to be”? Can compress position values/offsets as we did with docs in the last lecture.
Processing a phrase query • Extract inverted index entries for each distinct term: to, be, or, not • Merge their doc: position lists to enumerate all positions where “to be or not to be” begins. • to: – 2: 1, 17, 74, 222, 551; 4: 8, 27, 101, 429, 433; 7: 13, 23, 191; . . . • be: – 1: 17, 19; 4: 17, 191, 291, 430, 434; 5: 14, 19, 101; . . .
Evaluating an IR system • What are some measures for evaluating an IR system’s performance? – Speed of indexing – Index/corpus size ratio – Speed of query processing – “Relevance” of results
Standard relevance benchmarks • TREC - National Institute of Standards and Testing (NIST) • Reuters and other benchmark sets • “Retrieval tasks” specified – sometimes as queries • Human experts mark, for each query and for each doc, “Relevant” or “Not relevant”
Precision and recall • Precision: fraction of retrieved docs that are relevant • Recall: fraction of relevant docs that are retrieved • Both can be measured as functions of the number of docs retrieved
Tradeoff • Can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved – but precision usually decreases (in a good system)
Difficulties in precision/recall • Should average over large corpus/query ensembles • Need human relevance judgements • Heavily skewed by corpus/authorship
Glimpse of what’s ahead • Building indices • Term weighting and vector space queries • Clustering documents • Classifying documents • Link analysis in hypertext • Mining hypertext • Global connectivity analysis on the web • Recommendation systems and collaborative filtering • Summarization • Large enterprise issues and the real world
Resources for today’s lecture • Managing Gigabytes, Chapter 4. • Modern Information Retrieval, Chapter 3. • Princeton Wordnet – http: //www. cogsci. princeton. edu/~wn/