Machine Learning and Multivariate Statistics
Full course description
Machine Learning (ML) is the scientific study of algorithms and (multivariate) statistical methods that computer systems use to perform a task relying on patterns in the available data and inference. In its more general formulation, machine learning algorithms build a mathematical model in order to make predictions or decisions on new data. ML is related to computational statistics and uses methods and theories drawn from of mathematical optimization. Generally, ML tasks are classified into several broad categories. In supervised learning, a mathematical model is built from a set of data that contains both the inputs and the desired outputs (“training data”). Classification algorithms and regression algorithms are types of supervised learning. Classification algorithms are used when the outputs are restricted to a limited set of values. Regression algorithms are used when the output may have any value within a range. In unsupervised learning, a mathematical model is built from a set of data which contains only inputs and no desired output labels (also named Data Mining). Unsupervised learning algorithms are used to find structure in the data, like grouping or clustering of data points. Unsupervised learning can discover patterns in the data, and can group the inputs into categories, as in feature learning. Dimensionality reduction is the process of reducing the number of "features", or inputs, in a set of data.
In System Biology, ML algorithms can aid in several ways. In data-rich scenarios (e.g. omics data) ML can help identifying the relevant pieces of information given a question, and how to make sense out of it (a "data mining" issue). Additionally, ML algorithms allow understanding how biological systems behave as the result of the integration and interaction between many individual components that can be monitored simultaneously. At the same time bioinformatics and systems biology have already induced significant new developments of general interest in ML, for example in the context of learning with structured data, graph inference, semi-supervised learning, system identification, and novel combinations of optimization and learning algorithms.
In this course, ML will be applied in the context of neuroscientific investigations that on the basis of brain signals try to make sense of how the human brain processes information (visual auditory tactile) as well as in the analysis of omics data. You will study basic theoretical concepts of multivariate statistics and optimization that are fundamental to all ML algorithms. The problems of supervised (classification and regression) and unsupervised (clustering and dimensionality reduction methods) will be introduced formally together with common algorithmic solutions (e.g. Naïve Bayes, Support Vector Machines, Ridge Regression, [deep] neural networks, K-Means clustering). Practical examples of applications will be discussed highlighting best practices in the use of ML as well as in the communication of the obtained results. Finally, you will be tasked to use ML on available data to answer a specific question and report your results in the form of code and a written report.
Course objectives
During this course, you will acquire the fundamentals to machine learning approaches and their application to system biology. By relating theoretical knowledge with practical examples, you will be inspired to develop a critical view on ML applications and the way they are reported. You will train, theoretical and analytical skills, and (verbal as well as written) communication skills.
The course specific intended learning outcomes (ILOs) are:
- The ability to describe fundamental concepts in machine learning.
- Compare algorithms tasked to specific machine learning problems on theoretical grounds.
- The ability to apply the acquired theoretical knowledge to solve practical examples.
- Practice the scientific method by devising a Machine Learning analysis strategy to solve a problem within the System Biology field.
- The ability to justify choices in the methodological approach, critically assess its outcomes.
- The ability to effectively communicate results in writing together with the dissemination of the analysis methods and pipeline with particular attention to fostering reproducibility of the results.
- The ability to critically evaluate the usage of machine learning in the biological literature and identify potential issues.
Recommended reading
Mandatory Literature:
Pattern Classification – R.O. Duda, P.E. Hart, D.G. Stork; John Wiley & Sons (2012)
Additional Literature:
In addition, various articles for the Journal clubs will be selected by the tutors and are available through the university library subscriptions.