deep learning slides

Then $$H_l(x) = [z \le l] F_l(x) + x$$, Thus we have $$p(y|x,z) = \text{Categorical}(y \mid \pi(x, z))$$ where $\pi(x, z)$ is a residual network with $z$ that controls when to stop processing the $x$, We chose the prior on $z$ s.t. This course is being taught at as part of Master Datascience Paris the github repository: These notebooks only work with keras and tensorflow "Backpropagation applied to handwritten zip code recognition." Description. • 1993: Nvidia started… • Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. He has spoken and written a lot about what deep learning is and is a good place to start. We assume the two-phase data-generating process: First, we decide upon high-level abstract features of the datum $z \sim p(z)$, Then, we unpack these features using Neural Networks into an actual observable $x$ using the (learnable) generator $f_\theta$, This leads to the following model $p(x, z) = p(x|z) p(z)$ where $$ p(x|z) = p(z) \prod_{d=1}^D p(x_d | f_\theta(z)) $$ $$ p(z) = \mathcal{N}(z | 0, I) $$ and $f_\theta$ is some neural network, We can sample new $x$ by passing samples $z$ through the generator once we learn it, Would like to maximize log-marginal density of observed variables $\log p(x)$, Intractable integral $ \log p(x) = \log \int p(x|z) p(z) dz $, Introduce approximate posterior $q(z|x)$: $$ q(z|x) = \mathcal{N}(z|\mu_\Lambda(x), \Sigma_\Lambda(x))$$, Where $\mu, \Sigma$ are generated using auxiliary inference network from the observation $x$, Invoking the ELBO we obtain the following objective $$ \tfrac{1}{N} \sum_{n=1}^N \left[ \mathbb{E}_{q(z_n|x_n)} \log p(x_n | z_n) - \text{KL}(q(z_n|x_n)||p(z_n)) \right] \to \max_\Lambda $$. Lets equip the network with a mechanism to decide when to stop processing and prefer networks that stop early, Let $z$ indicate the number of layers to use. Online ahead of print. The course is Berkeley’s current offering of deep learning. Training the model is just one part of shipping a Deep Learning project. "Learning representations by back-propagating errors." • LeCun, Yann, et al. How do we backpropagate through samples $\theta_i$? Gradient-based optimization in discrete models is hard, Invoke the Central Limit Theorem and turn the model into a continuous one, Consider a model with continuous noise on weights $$ q(\theta_i | \Lambda) = \mathcal{N}(\theta_i | \mu_i(\Lambda), \alpha_i(\Lambda) \mu^2_i(\Lambda)) $$, Neural Networks have lots of parameters, surely there's some redundancy in them, Let's take a prior $p(\theta)$ that would encourage large $\alpha$, Large $\alpha_i$ would imply that weight $\theta_i$ is unbounded noise that corrupts predictions, Such weights won't be doing anything useful, hence it should be zeroed out by putting $\mu_i(\Lambda) = 0$, Thus the weight $\theta_i$ would effectively turn into a deterministic 0. The Deep Learning Lecture Series 2020 is a collaboration between DeepMind and the UCL Centre for Artificial Intelligence. However, many found the accompanying video lectures, slides, and exercises not pedagogic enough for a fresh starter. license. 10/18/2019 ∙ by Neofytos Dimitriou, et al. The Course “Deep Learning” systems, typified by deep neural networks, are increasingly taking over all AI tasks, ranging from language understanding, and speech and image recognition, to machine translation, planning, and even game playing and autonomous driving. Bayesian methods can Impose useful priors on Neural Networks helping discover solutions of special form; Provide better predictions; Provide Neural Networks with uncertainty estimates (uncovered) Neural Networks help us make more efficient Bayesian inference; Uses a lot of math; Active area of research Slides of the talk can be accessed from this link. More info on deep learning and CNNs: [deep learning … All the code in this repository is made available under the MIT license we don't need the exact true posterior $$ \text{KL}(q(\theta | \Lambda) || p(\theta | \mathcal{D})) = \log p(\mathcal{D}) - \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta | \Lambda)} $$, Hence we seek parameters $\Lambda_*$ maximizing the following objective (the ELBO) $$ \Lambda_* = \text{argmax}_\Lambda \left[ \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta|\Lambda)} = \mathbb{E}_{q(\theta|\Lambda)} \log p(\mathcal{D}|\theta) - \text{KL}(q(\theta|\Lambda)||p(\theta)) \right]$$, We can't compute this quantity analytically either, but can sample from $q$ to get Monte Carlo estimates of the approximate posterior predictive distribution: $$ q(y \mid x, \mathcal{D}) \approx \hat{q}(y|x, \mathcal{D}) = \frac{1}{M} \sum_{m=1}^M p(y \mid x, \theta^m), \quad\quad \theta^m \sim q(\theta \mid \Lambda_*) $$, Recall the objective for variational inference $$ \mathcal{L}(\Lambda_*) = \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta|\Lambda)} \to \max_{\Lambda} $$, We'll be using well-known optimization method, We need (stochastic) gradient $\hat{g}$ of $\mathcal{L}(\Lambda)$ s.t. Lecture slides Basic information about deep learning Cheat sheet – stuff that everyone needs to know Useful links Grading Plan your visit Visit previous iteration of Stats385 (2017) This page was generated by … The course covers the basics of Deep Learning… This automatic feature learning has been demonstrated to uncover underlying structure in the data leading to state-of-the-art results in tasks in vision, speech and rapidly in other domains as well. Artificial Intelligence Machine Learning Deep Learning Deep Learning by Y. LeCun et al. Cognitive modeling 5.3 (1988): 1. Often we can't, we use approximate posteriors, Probability Theory is a great tool to reason about uncertainty, Bayesians quantify subjective uncertainty, Frequentists quantify inherent randomness in the long run, People seem to interpret probability as beliefs and hence are Bayesians, We formulate our prior beliefs about how the $ x $ might be generated, We collect some data of already generated $ x $: $$ \mathcal{D}_\text{train} = (x_1, ..., x_N) $$, We update our beliefs regarding what kind of data exist by incorporating collected data, We now can make predictions about unseen data, And collect some more data to improve our beliefs, We'll assume random variables have and are described by their, $p(X=x)$ ($p(x)$ for short) – its probability density function, $\text{Pr}[X \in A] = \int_{A} p(X=x) dx$ – distribution function, In general several random variables $X_1, ..., X_N$ have, It describes joint probability $$\text{Pr}(X_1 \in A_1, ..., X_N \in A_N) = \int_{A_1} ... \int_{A_N} p(x_1, ..., x_N) dx_N ... dx_1 $$, If (and only if) random variables are independent, the joint density is just a product of individual densities, Vector random variables are just a bunch of scalar random variables, For 2 and more random variables you should be considering their joint distribution, $\mathbb{E}_{p(x)} X = \int x p(x) dx$ –, $ \mathbb{E} [\alpha X + \beta Y] = \alpha \mathbb{E} X + \beta \mathbb{E} Y $, $ \mathbb{V} X = \mathbb{E} [X^2] - (\mathbb{E} X)^2 = \mathbb{E}(X - \mathbb{E} X)^2 $, $X$ is said to be Uniformly distributed over $(a, b)$ (denoted $X \sim U(a, b)$ if its probability density function is $$ p(x) = \begin{cases} \tfrac{1}{b-a}, & a < x < b \\ 0, &\text{otherwise} \end{cases} \quad\quad \mathbb{E} U = \frac{a+b}{2} \quad\quad \mathbb{V} U = \frac{(b-a)^2}{12} $$, $X$ is called a Multivariate Gaussian (Normal) random vector with mean $\mu \in \mathbb{R}^n$ and positive-definite covariance matrix $\Sigma \in \mathbb{R}^{n \times n}$ (denoted $x \sim \mathcal{N}(\mu, \Sigma)$) if its joint probability density function is, $X$ is said to be Categorically distributed with probabilities, $X$ is called a Bernoulli random variable with probability (of success) $p \in [0, 1]$ (denoted $X \sim \text{Bern}(\pi)$) if its probability mass function is $$ p(X = 1) = \pi \Leftrightarrow p(x) = \pi^{x} (1-\pi)^{1-x} $$ (yes, this is a special case of the categorical distribution), Joint density on $x$ and $y$ defines the, Knowing value of $y$ can reduce uncertainty about $x$, expressed via the, Thus $$ p(x, y) = p(y|x) p(x) = p(x|y) p(y) $$, Suppose we're having two jointly Gaussian random variables $X$ and $Y$: $$(X, Y) \sim \mathcal{N}\left(\left[\begin{array}{c}\mu_x \\ \mu_y \end{array} \right], \left[\begin{array}{cc}\sigma^2_x & \rho_{xy} \\ \rho_{xy} & \sigma^2_y\end{array}\right]\right)$$, Then one can show that marginal and conditionals are also Gaussian $$ p(x) = \mathcal{N}(x \mid \mu_x, \sigma^2_x) $$ $$ p(y) = \mathcal{N}(y \mid \mu_y, \sigma^2_y) $$ $$p(x|y) = \mathcal{N}\left(x \mid \mu_x + \tfrac{\rho}{\sigma_x^2} (y - \mu_y), \sigma^2_x - \tfrac{\rho_{xy}^2}{\sigma_y^2}\right)$$, If we're interested in $y$, then these distributions are called, We assume some data-generating model $$p(y, \theta \mid x) = p(y \mid x, \theta) p(\theta) $$, We obtain some observations $ \mathcal{D} = \{(x_n, y_n)\}_{n=1}^N $, We seek to make make predictions regarding $y$ for previously unseen $x$ having observed the training set $\mathcal{D}$. Free + Easy to edit + Professional + Lots backgrounds. Book Exercises External Links Lectures. In addition to the lectures and programming assignments, you will also watch exclusive interviews with many Deep Learning leaders. July 24th, 2013 | Tags: representation learning , slides , talks , yoshua bengio | Category: anouncements, conference, news | One comment - (Comments are closed) unless otherwise noted. ∙ 52 ∙ share . Olivier Grisel, software engineer at We thank the Orange-Keyrus-Thalès chair for supporting this class. 2014 Lecture 2 … Seriously though, its just formal language, not much of the actual math is involved, We don't need no Bayes, we already learned a lot without it. CNNs are the current state-of-the-art architecture for medical image analysis. “We’re not really just solving a staining problem, we’re also solving a save-the-tissue problem,” he said. 2012 IPAM Summer School deep learning and representation learning Videos and Slides at IPAM 2014 International Conference on Learning Representations (ICLR 2014) to parameters $\theta$ of the generator also! Dimensions of a learning system (different types of feedback, representation, use of knowledge) 3. with some fixed probability $p$ it's 0 and with probability $1-p$ it's some learnable value $\Lambda_i$, Then for some prior $p(\theta)$ our optimization objective is $$ \mathbb{E}_{q(\theta|\Lambda)} \sum_{n=1}^N \log p(y_n | x_n, \theta) \to \max_{\Lambda} $$ where the KL term is missing due to the model choice, No need to take special care about differentiating through samples, Turns out, these are bayesian approximate inference procedures. UC Berkeley has done a lot of remarkable work on deep learning, including the famous Caffe — Deep Leaning Framework. Deep Learning algorithms aim to learn feature hierarchies with features at higher levels in the hierarchy formed by the composition of lower level features. Download Deep Learning PowerPoint templates (ppt) and Google Slides themes to create awesome presentations. We have a continuous density $q(\theta_i | \mu_i(\Lambda), \sigma_i^2(\Lambda))$ and would like to compute the gradient of $$ \mathbb{E}_{q(\theta|\Lambda)} \log \frac{p(\mathcal{D}|\theta) p(\theta)}{q(\theta|\Lambda)} $$, The inner part – expected gradients of $\log \frac{p(\mathcal{D}|\theta) p(\theta)}{q(\theta|\Lambda)} $, Sampling part – gradients through samples $ \theta \sim q(\theta|\Lambda) $, The objective then becomes $$ \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} \log \tfrac{p(\mathcal{D}, \mu + \varepsilon \sigma)}{q(\mu + \varepsilon \sigma | \Lambda)} $$, The objective then becomes $$ \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} \left[\sum_{n=1}^N \log p(y_n | \theta=\mu(\Lambda) + \varepsilon \sigma(\Lambda)) \right] - \text{KL}(q(\theta|\Lambda) || p(\theta)) $$, Training a neural network with special kind of noise upon weights, The magnitude of the noise is encouraged to increase, Zeroes out unnecessary weights completely, Essentially, training a whole ensemble of neural networks, Actually using the ensemble is costly: $k$ times slow for an ensemble of $k$ models, Single network (single-sample ensemble) also work. We will be giving a two day short course on Designing Efficient Deep Learning Systems at MIT in Cambridge, MA on July 20-21, 2020. However, while deep learning has proven itself to be extremely powerful, most of today’s most successful deep learning systems suffer from a number of important limitations, ranging from the requirement for enormous training data sets to lack of interpretability to vulnerability to … Deep Learning for Whole Slide Image Analysis: An Overview. This course is being taught at as part of Master Datascience Paris Saclay. What if we want to tune dropout rates $p$? We plan to offer lecture slides accompanying all chapters of this book. We will help you become good at Deep Learning. Computationally stained slides could help automate the time-consuming process of slide staining, but Shah said the ability to de-stain and preserve images for future use is the real advantage of the deep learning techniques. Deep learning is a sub-field of machine learning dealing with algorithms inspired by the structure and function of the brain called artificial neural networks. Minimum Description Length for VAE Alice wants to transmit x as compactly as possible to Bob, who knows only the prior p(z) and the decoder weights to get started. Table of contents. Nature 2015 Learn Deep Learning from deeplearning.ai. Saclay. Deep Learning is one of the most highly sought after skills in tech. Jeez, how is that related to this slide? Turns out, the ELBO is also a lower bound on marginal log-likelihood (hence the name), we can maximize it w.r.t. The widespread adoption of whole slide imaging has increased the demand for effective and efficient gigapixel image analysis. In other words, It mirrors the functioning of our brains. additional references. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters to maintain enough capacity to memorize these volumes and obtain state-of-the-art accuracy. Direct links to the rendered notebooks including solutions (to be updated in rendered mode): This lecture is built and maintained by Olivier Grisel and Charles Ollion, Charles Ollion, head of research at Heuritech - Deep Learning An MIT Press book in preparation Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning course: lecture slides and lab notebooks. deep learning is driving significant advancements across industries, enterprises, and our everyday lives. Deep learning algorithms are similar to how nervous system structured where each neuron connected each other and passing information. Note: press “P” to display the presenter’s notes that include some comments and lectures-labs maintained by m2dsupsdlclass, Convolutional Neural Networks for Image Classification, Deep Learning for Object Detection and Image Segmentation, Sequence to sequence, attention and memory, Expressivity, Optimization and Generalization, Imbalanced classification and metric learning, Unsupervised Deep Learning and Generative models, Demo: Object Detection with pretrained RetinaNet with Keras, Backpropagation in Neural Networks using Numpy, Neural Recommender Systems with Explicit Feedback, Neural Recommender Systems with Implicit Feedback and the Triplet Loss, Fine Tuning a pretrained ConvNet with Keras (GPU required), Bonus: Convolution and ConvNets with TensorFlow, ConvNets for Classification and Localization, Character Level Language Model (GPU required), Transformers (BERT fine-tuning): Joint Intent Classification and Slot Filling, Translation of Numeric Phrases with Seq2Seq, Stochastic Optimization Landscape in Pytorch. 2020 Feb 28. doi: 10.1002/hep.31207. Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. Different types of learning (supervised, unsupervised, reinforcement) 2. Deep Learning Handbook. The Deep Learning Handbook is a project in progress to help study the Deep Learning book by Goodfellow et al.. Goodfellow's masterpiece is a vibrant and precious resource to introduce the booming topic of deep learning. The Deep Learning case! We want to make predictions about some $ x $, $$ p(X = k) = \pi_k \Leftrightarrow p(x) = \prod_{k=1}^K \pi_k^{[x = k]} $$, Variational Dropout Sparsifies Deep Neural Networks, D. Molchanov, A. Ashukha, D. Vetrov, ICML 2017. The slides and lectures are posted online, and the course are taught by three fantastic instructors. Its uncertainty quantified by the, This requires us to know the posterior distribution on model parameters $p(\theta \mid \mathcal{D})$ which we obtain using the Bayes' rule, Suppose the model $y \sim \mathcal{N}(\theta^T x, \sigma^2)$, with $ \theta \sim \mathcal{N}(\mu_0, \sigma_0^2 I) $, Suppose we observed some data from this model $ \mathcal{D} = \{(x_n, y_n)\}_{n=1}^N $ (generated using the same $ \theta^* $), We don't know the optimal $\theta$, but the more data we observe, Posterior predictive would also be Gaussian $$ p(y|x, \mathcal{D}) = \mathcal{N}(y \mid \mu_N^T x, \sigma_N^2) $$, Suppose we observe a sequence of coin flips $(x_1, ..., x_N, ...)$, but don't know whether the coin is fair $$ x \sim \text{Bern}(\pi), \quad \pi \sim U(0, 1) $$, First, we infer posterior distribution on a hidden parameter $\pi$ having observed \(x_{

2020 deep learning slides