Information Retrieval and Text Mining
Full course description
Using today’s search engines allows us to find the needle in the haystack much easier than before. But how do you find out what the needle looks like and where the haystack is? That is exactly the problem we will discuss in this course. An important difference with standard information retrieval (search) techniques is that they require a user to know what he or she is looking for, while text mining attempts to discover information that is not known beforehand. This is very relevant, for example, in criminal investigations, legal discovery, (business) intelligence, sentiment- & emotion mining or clinical research. Text mining refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining encompasses several computer science disciplines with a strong orientation towards artificial intelligence in general, including but not limited to information retrieval (building a search engine), statistical pattern recognition, natural language processing, information extraction and different methods of machine learning (including deep learning), clustering and ultimately integrating it all using advanced data visualization and chatbots to make the search experience easier and better.
In this course we will also discuss ethical aspect of using Artificial Intelligence for the above tasks, including the need for eXplainable AI (XAI), training deep-learning large language-models more energy efficient, and several ethical problems that may arise related to bias, legal, regulatory and privacy challenges.
This course is closely related with the course Advanced Natural Language Processing (ANLP). In the ANLP course, the focus is more on advanced methods and architectures to deal with complex natural language tasks such as machine translation, and Q&A systems. IRTM focusses more on building search engines and using text-analytics to improve the search experience. In the IRTM course, we will use a number of the architectures that are discussed in more detail in ANLP. The overlap between the two courses is kept to a minimum. There is no need to follow the courses in a specific order.
Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Cambridge University Press, 2008. In bookstore and online: http://informationretrieval.org.