CS 347 Lecture 2 April 9 2001 Prabhakar (2024)

CS 347 Lecture 2 April 9 2001 Prabhakar (1)

CS 347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan

CS 347 Lecture 2 April 9 2001 Prabhakar (2)

Today’s topics • Inverted index storage – Compressing dictionaries into memory • Processing Boolean queries – Optimizing term processing – Skip list encoding • Wild-card queries • Positional/phrase queries • Evaluating IR systems

CS 347 Lecture 2 April 9 2001 Prabhakar (3)

Recall dictionary and postings files In memory Gap-encoded, on disk

CS 347 Lecture 2 April 9 2001 Prabhakar (4)

Inverted index storage • Last time: Postings compression by gap encoding • Now: Dictionary storage – Dictionary in main memory, postings on disk • Tradeoffs between compression and query processing speed – Cascaded family of techniques

CS 347 Lecture 2 April 9 2001 Prabhakar (5)

Dictionary storage - first cut • Array of fixed-width entries – 28 bytes/term = 14 MB. Allows for fast binary 20 bytes search into dictionary 4 bytes each

CS 347 Lecture 2 April 9 2001 Prabhakar (6)

Exercise • Is binary search really a good idea? • What’s a better alternative?

CS 347 Lecture 2 April 9 2001 Prabhakar (7)

Fixed-width terms are wasteful • Most of the bytes in the Term column are wasted - we allot 20 bytes even for 1 -letter terms. – Still can’t handle supercalifragilisticexpialidocius. • Average word in English: ~8 characters. – Written English averages ~4. 5 characters: short words dominate usage. • Store dictionary as a string of characters: – Hope to save upto 60% of dictionary space.

CS 347 Lecture 2 April 9 2001 Prabhakar (8)

Compressing the term list …. systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Total string length = 500 KB x 8 = 4 MB Pointers resolve 4 M positions: log 24 M= 22 bits = 3 bytes Binary search these pointers

CS 347 Lecture 2 April 9 2001 Prabhakar (10)

Blocking • Store pointers to every kth on term string. • Need to store term lengths (1 extra byte) …. 7 systile 9 syzygetic 8 syzygial 6 syzygy 11 szaibelyite 8 szczecin 9 szomo…. Save 9 bytes on 3 pointers. Lose 4 bytes on term lengths.

CS 347 Lecture 2 April 9 2001 Prabhakar (11)

Exercise • Estimate the space usage (and savings compared to 9. 5 MB) with blocking, for block sizes of k = 4, 8 and 16.

CS 347 Lecture 2 April 9 2001 Prabhakar (12)

Impact on search • Binary search down to 4 -term block; • Then linear search through terms in block. • Instead of chasing 2 pointers before, now chase 0/1/2/3 - avg. of 1. 5.

CS 347 Lecture 2 April 9 2001 Prabhakar (13)

Extreme compression • Using perfect hashing to store terms “within” their pointers – not good for vocabularies that change. • Partition dictionary into pages – use B-tree on first terms of pages – pay a disk seek to grab each page – if we’re paying 1 disk seek anyway to get the postings, “only” another seek/query term.

CS 347 Lecture 2 April 9 2001 Prabhakar (14)

Query optimization • Consider a query that is an AND of t terms. • The idea: for each of the t terms, get its term -doc incidence from the postings, then AND together. This is why we kept freq • Process in order of increasing freq: in dictionary. – start with smallest set, then keep cutting further.

CS 347 Lecture 2 April 9 2001 Prabhakar (15)

Query processing exercises • If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen? • How can we perform the AND of two postings entries without explicitly building the 0/1 term-doc incidence vector?

CS 347 Lecture 2 April 9 2001 Prabhakar (16)

General query optimization • e. g. , (madding OR crowd) AND (ignoble OR strife) • Get freq’s for all terms. • Estimate the size of each OR by the sum of its freq’s. • Process in increasing order of OR sizes.

CS 347 Lecture 2 April 9 2001 Prabhakar (17)

Exercise • Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

CS 347 Lecture 2 April 9 2001 Prabhakar (18)

Speeding up postings merges • Insert skip pointers • Say our current list of candidate docs for an AND query is 8, 13, 21. – (having done a bunch of ANDs) • We want to AND with the following postings entry: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22 • Linear scan is slow.

CS 347 Lecture 2 April 9 2001 Prabhakar (19)

Augment postings with skip pointers (at indexing time) 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, . . . • At query time: • As we walk the current candidate list, concurrently walk inverted file entry - can skip ahead – (e. g. , 8, 21). • Skip size: recommend about (list length)

CS 347 Lecture 2 April 9 2001 Prabhakar (20)

Query vs. index expansion • Recall, from lecture 1: – thesauri for term equivalents – soundex for hom*onyms • How do we use these? – Can “expand” query to include equivalences • Query car tyres automobile tires – Can expand index • Index docs containing car under automobile, as well

CS 347 Lecture 2 April 9 2001 Prabhakar (21)

Query expansion • Usually do query expansion – No index blowup – Query processing slowed down • Docs frequently contain equivalences – May retrieve more junk • puma jaguar – Carefully controlled wordnets

CS 347 Lecture 2 April 9 2001 Prabhakar (22)

Wild-card queries • mon*: find all docs containing any word beginning “mon”. • Solution: index all k-grams occurring in any doc (any sequence of k chars). • e. g. , from text “April is the cruelest month” we get the 2 -grams (bigrams) – $ is a special word boundary symbol $a, ap, pr, ri, il, l$, $i, is, s$, $t, th, he, e$, $c, cr, ru, ue, el, le, es, st, t$, $m, mo, on, nt, h$

CS 347 Lecture 2 April 9 2001 Prabhakar (23)

Processing wild-cards • Query mon* can now be run as – $m AND mo AND on • But we’d get a match on moon. • Must post-filter these results against query. • Exercise: Work out the details.

CS 347 Lecture 2 April 9 2001 Prabhakar (24)

Further wild-card refinements • Cut down on pointers by using blocks • Wild-card queries tend to have few bigrams – keep postings on disk • Exercise: given a trigram index, how do you process an arbitrary wild-card query?

CS 347 Lecture 2 April 9 2001 Prabhakar (25)

Phrase search • Search for “to be or not to be” • No longer suffices to store only <term: docs> entries. • Instead store, for each term, entries – <number of docs containing term; – doc 1: position 1, position 2 … ; – doc 2: position 1, position 2 … ; – etc. >

CS 347 Lecture 2 April 9 2001 Prabhakar (26)

Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of these docs could contain “to be or not to be”? Can compress position values/offsets as we did with docs in the last lecture.

CS 347 Lecture 2 April 9 2001 Prabhakar (27)

Processing a phrase query • Extract inverted index entries for each distinct term: to, be, or, not • Merge their doc: position lists to enumerate all positions where “to be or not to be” begins. • to: – 2: 1, 17, 74, 222, 551; 4: 8, 27, 101, 429, 433; 7: 13, 23, 191; . . . • be: – 1: 17, 19; 4: 17, 191, 291, 430, 434; 5: 14, 19, 101; . . .

CS 347 Lecture 2 April 9 2001 Prabhakar (28)

Evaluating an IR system • What are some measures for evaluating an IR system’s performance? – Speed of indexing – Index/corpus size ratio – Speed of query processing – “Relevance” of results

CS 347 Lecture 2 April 9 2001 Prabhakar (29)

Standard relevance benchmarks • TREC - National Institute of Standards and Testing (NIST) • Reuters and other benchmark sets • “Retrieval tasks” specified – sometimes as queries • Human experts mark, for each query and for each doc, “Relevant” or “Not relevant”

CS 347 Lecture 2 April 9 2001 Prabhakar (30)

Precision and recall • Precision: fraction of retrieved docs that are relevant • Recall: fraction of relevant docs that are retrieved • Both can be measured as functions of the number of docs retrieved

CS 347 Lecture 2 April 9 2001 Prabhakar (31)

Tradeoff • Can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved – but precision usually decreases (in a good system)

CS 347 Lecture 2 April 9 2001 Prabhakar (32)

Difficulties in precision/recall • Should average over large corpus/query ensembles • Need human relevance judgements • Heavily skewed by corpus/authorship

CS 347 Lecture 2 April 9 2001 Prabhakar (33)

Glimpse of what’s ahead • Building indices • Term weighting and vector space queries • Clustering documents • Classifying documents • Link analysis in hypertext • Mining hypertext • Global connectivity analysis on the web • Recommendation systems and collaborative filtering • Summarization • Large enterprise issues and the real world

CS 347 Lecture 2 April 9 2001 Prabhakar (34)

Resources for today’s lecture • Managing Gigabytes, Chapter 4. • Modern Information Retrieval, Chapter 3. • Princeton Wordnet – http: //www. cogsci. princeton. edu/~wn/

CS 347 Lecture 2 April 9 2001 Prabhakar (2024)
Top Articles
Latest Posts
Article information

Author: Golda Nolan II

Last Updated:

Views: 5632

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Golda Nolan II

Birthday: 1998-05-14

Address: Suite 369 9754 Roberts Pines, West Benitaburgh, NM 69180-7958

Phone: +522993866487

Job: Sales Executive

Hobby: Worldbuilding, Shopping, Quilting, Cooking, Homebrewing, Leather crafting, Pet

Introduction: My name is Golda Nolan II, I am a thoughtful, clever, cute, jolly, brave, powerful, splendid person who loves writing and wants to share my knowledge and understanding with you.