Lecture “Information Retrieval and Web Search Engines” (SS 2010)

Information
Classification: 
Master Informatik, Master Wirtschaftsinformatik
Credits: 
4
Exam: 
Oral (scoring 50% of total exercise points is required to take final exam)
Regular Dates: 
Thursdays, 9.45–12.00; first lecture on April 8, 2010
Informatikzentrum, Mühlenpfordtstraße 23, room 161
Contents
Contents: 

This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course.

Exam: Please contact our secretary Regine Dalkιran to arrange an examination date.

Materials

Lectures

Note that lectures and exercises will not be separated. Exercise parts will be interlaced into the lecture whereever adequate.

  Date Contents References Materials
1  8.4.2010 Introduction and fundamental notions (Bush, 1945) Slides
Print slides
Video
Homework 1
reuters-21578.mat
Homework update: Instead of trying to download MATLAB from Mathworks' website, please send a mail to Joachim Selke; he will send you a download link.
2 15.4.2010 Fuzzy retrieval model
Coordination level matching
Vector space retrieval model
(Zadeh, 1965)
(Zadeh, 1978)
(Ogawa et al., 1991)
(Salton et al., 1975)
(Luhn, 1961)
(Spärck Jones, 1972)
(Robertson and Spärck Jones, 1976)
(Salton et al., 1983)
Slides
Print slides
Video
3 22.4.2010 Indexing (Porter, 1980)
(Zobel and Moffat, 2006)
Slides
Print slides
Video
Homework 2
4 29.4.2010 Probabilistic retrieval models (Robertson, 1977)
(Maron and Kuhns, 1960)
(van Rijsbergen, 1977)
(Croft and Harper, 1979)
(Greiff, 1998)
Slides
Print slides
No video (recording is broken, sorry)
5  6.5.2010 Latent Semantic Indexing (Salton and Buckley, 1988)
(Deerwester et al., 1990)
(Berry et al., 1995)
(Landauer and Dumais, 1997)
Slides
Print slides
No video (recording is broken again, sorry)
Sample solutions to assignment 2
Homework 3
Homework update: Please use the stemmed version of the Reuters term–document matrix.
Second update: Due date changed to June 3.
6 20.5.2010 Document clustering   Slides
Print slides
Video
MATLAB example: K-means clustering
MATLAB example: Agglomerative clustering
7  3.6.2010 Language models
Retrieval evaluation
(Saracevic, 2007) Slides
Print slides
Video
Sample solutions to assignment 3
Homework 4
8 10.6.2010 Relevance feedback
Classification
(Rocchio, 1971) Slides (updated on July 14)
Print slides
Video
9 17.6.2010 Support vector machines (Burges, 1998)
(Ivanciuc, 2007)
(Joachims, 1998)
(Joachims, 2002)
Slides
Print slides
Video
Homework 5
reuters-21578-stemmed-with-topics.mat
Sample solutions to assignment 4
10 24.6.2010 Introduction to Web retrieval (Fetterly et al., 2004) Slides
Print slides
Video
11  1.7.2010 Web crawling   Slides
Print slides
No video (recording is broken, sorry)
12  8.7.2010 Link analysis (Boyack et al., 2005)
(Page et al., 1999)
(Kleinberg, 1999)
Slides
Print slides
Video
Homework 6
Sample solutions to assignment 5
13 15.7.2010 Miscellaneous   Slides
Print slides
Video

Books

(Manning et al., 2008)
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ http ]
(Belew, 2000)
Richard K. Belew. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press, 2000. [ http ]
(Baeza-Yates and Ribeiro-Neto, 1999)
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. [ http ]
(van Rijsbergen, 1979)
Cornelis Joost van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. [ .html ]

Further Readings

(Bush, 1945)
Vannevar Bush. As we may think. The Atlantic Monthly, 176(1):101-108, 1945.
(Zadeh, 1965)
Lotfi A. Zadeh. Fuzzy sets. Information and Control, 8(3):338-353, 1965. [ DOI ]
(Zadeh, 1978)
Lotfi A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1):3-28, 1978. [ DOI ]
(Ogawa et al., 1991)
Yasushi Ogawa, Tetsuya Morita, and Kiyohiko Kobayashi. A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets and Systems, 39(2):163-179, 1991. [ DOI ]
(Salton et al., 1975)
Gerard Salton, Chung-Shu Yang, and Anita Wong. A vector space model for automatic indexing. Communications of the ACM, 18(11):613-620, 1975. [ DOI ]
(Luhn, 1961)
Hans Peter Luhn. The automatic derivation of information retrieval encodements from machine-readable texts. Volume 3, Part 2 of Information Retrieval and Machine Translation, pages 1021-1028. Interscience Publishers, 1961.
(Spärck Jones, 1972)
Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11-21, 1972. [ DOI
(Robertson and Spärck Jones, 1976)
Stephen E. Robertson and Karen Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129-146, 1976.
(Salton et al., 1983)
Gerard Salton, Edward A. Fox, and Harry Wu. Extended boolean information retrieval. Communications of the ACM, 26(11):1022-1036, 1983. [ DOI ]
(Robertson, 1977)
Stephen E. Robertson. The probabilistic ranking principle in IR. The Journal of Documentation, 33(4):294-304, 1977.
(Maron and Kuhns, 1960)
Melvin E. Maron and John L. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3):216-244, 1960. [ DOI ]
(van Rijsbergen, 1977)
Cornelis Joost van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. The Journal of Documentation, 33(2):106-119, 1977. [ http ]
(Croft and Harper, 1979)
W. Bruce Croft und David J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4):285-295, 1979.
(Greiff, 1998)
Warren R. Greiff. A theory of term weighting based on exploratory data analysis. In W. Bruce Croft, Alistair Moffat, Cornelis Joost van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pages 11-19. ACM Press, 1998. [ DOI ]
(Salton and Buckley, 1988)
Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513-523, 1988. [ DOI ]
(Deerwester et al., 1990)
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990. [ DOI ]
(Berry et al., 1995)
Michael W. Berry, Susan T. Dumais, and Gavin W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995. [ DOI ]
(Landauer and Dumais, 1997)
Thomas K. Landauer and Susan T. Dumais. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211-240, 1997. [ DOI ]
(Saracevic, 2007)
Tefko Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology, 58(13):1915-1933, 2007. [ DOI ]
(Rocchio, 1971)
Joseph J. Rocchio, Jr. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313-323. Prentice Hall, 1971.
(Burges, 1998)
Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [ DOI ]
(Ivanciuc, 2007)
Ovidiu Ivanciuc. Applications of support vector machines in chemistry. In Kenneth B. Lipkowitz, Thomas R. Cundari, and Donald B. Boyd, editors, Reviews in Computational Chemistry, volume 23, pages 291-400. Wiley, 2007. [ .pdf ]
(Joachims, 1998)
Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML 1998), volume 1398 of Lecture Notes in Artificial Intelligence, pages 137-142. Springer, 1998. [ DOI ]
(Joachims, 2002)
Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002), pages 133-142. ACM Press, 2002. [ DOI ]
(Porter, 1980)
Martin F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.
(Zobel and Moffat, 2006)
Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. [ DOI ]
(Fetterly et al., 2004)
Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of Web pages. Software: Practice and Experience, 34(2):231-237, 2004. [ DOI ]
(Boyack et al., 2005)
Kevin W. Boyack, Richard Klavans, and Katy Börner. Mapping the backbone of science Scientometrics, 64(3):351-374, 2005. [ DOI ]
(Page et al., 1999)
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.
(Kleinberg, 1999)
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999. [ DOI ]