Skip to content

Latest commit

 

History

History
29 lines (28 loc) · 2.35 KB

File metadata and controls

29 lines (28 loc) · 2.35 KB

Language Model Ranking

Goal

The Goal of this project is to make a mini search engine program over a movie folder using language model(which contains 2000 file/document about movie reviews).

"Instead of overtly modeling the probability P(R=1|q,d) of relevance of a document d to query q, as in the traditional probabilistic approach to IR, the basic language modeling approach instead builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md)."[p237,Introduction to Information Retrieval, By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze © 2008 Cambridge University Press.]

Intuition
Good queries: contain words likely appear in a relevant document
Key Idea
The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. The Basic language modeling approach builds a probilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).

Reference
[Introduction to Information Retrieval, By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze © 2008 Cambridge University Press.]

Libraries

All libraries are listed in requirements.txt
Please run following command to install all the library that are needed:

First make the bash script executable by:

> chmod +x download.sh 

run the script by:

> ./download.sh

Parts

There are two parts in this project.
The first part is create_index: which take in the input source directory(which contain bunch of files/documents) and collect some statistic information that is needed for later ranking computations.
The second part is lm_query(language model query): which uses the index statistic information(language model) that is collected in part1(create index) to perform the language model ranking.