Sebastian Höfer's BlogA blog about how to understand AI, robotics and machine learning intuitively, including random posts about all sorts of different topics.
http://shoefer.github.io/
Wed, 28 Nov 2018 07:00:26 +0000Wed, 28 Nov 2018 07:00:26 +0000Jekyll v3.7.4Yet Another Machine Learning 101<p>This post is a somewhat short recap of machine learning in general. I wrote it
as lecture notes for a tutorial that I gave in one of the <a href="http://www.robotics.tu-berlin.de">robotics courses at
TU Berlin</a>.</p>
<p>It will probably be difficult to understand it in detail if you are a beginner.
For beginners, I recommend reading <a href="http://www.intuitivemi.de">my blog on intuitive machine
intelligence</a>.</p>
<p>The post is also available as an <a href="https://github.com/shoefer/shoefer.github.io/blob/master/notebooks/Yet%20Another%20Machine%20Learning%20101.ipynb">ipython notebook</a> which allows you to play around with the python examples.</p>
<h2 id="general-setting">General setting</h2>
<p>In general, we distinguish three general paradigms in ML</p>
<ul>
<li><strong>Supervised Learning</strong></li>
<li><strong>Unsupervised Learning</strong></li>
<li><strong>Reinforcement Learning</strong></li>
</ul>
<p>Additionally, there is a variety of other paradigms, which are not going to be
covered here, such as <a href="https://en.wikipedia.org/wiki/Semi-supervised_learning"><strong>semi-supervised
learning</strong></a> and
<a href="http://arxiv.org/abs/1511.06429"><strong>learning from side information</strong></a> which will
not be covered here.</p>
<p>In all types of machine learning the goal is to learn a function to predict some
output from some input data:</p>
<script type="math/tex; mode=display">f: X \rightarrow Y</script>
<p>The different paradigms differ in</p>
<ul>
<li>the goal, i.e. what kind of data $X$ and $Y$ are</li>
<li>the training data available to learn $f$</li>
</ul>
<p>We will now review the different paradigms and give examples, however, mostly
focusing on <em>supervised learning</em>.</p>
<h1 id="supervised-learning">Supervised Learning</h1>
<p>Supervised learning is by far the most popular paradigm. In this setting, we are
given examples from $X$ and $Y$, i.e. $N$ pairs ${ x^{(i)}, y^{(i)}
}_{i=1\ldots N }$. The data $x^{(i)}$ are called the <em>input data</em> and $y^{(i)}$
the <em>labels</em>.
Together, they are called the <em>training data</em>.</p>
<p>Any ML method usually consists of (at least) three ingredients.</p>
<ul>
<li>A representation of $f$, i.e. whether it is a linear function, polynomial,
neural network, or non-parametric, etc.</li>
<li>An appropriate <em>loss function</em> $\mathcal{L}$,</li>
<li>which is minimized using an <em>optimization method</em>.</li>
</ul>
<p>The loss function for supervised learning looks as follows:</p>
<script type="math/tex; mode=display">f = \textrm{argmin}_{f}\ \mathcal{L} ( f, \{ x^{(i)}, y^{(i)} \}_{i=1\ldots N
})</script>
<p>A <em>loss function</em> can be thought of as an assessment of how good $f$ fits the
training data. If its value is high, it means that $f(x^{(i)})$, i.e. our
prediction of the label for $x^{(i)}$ given current function $f$, is very
different from the known true value $y^{(i)}$. If $\mathcal{L} = 0$, it means
the $f$ perfectly fits the data.</p>
<p>An <em>optimization</em> method can be thought of as a method that is given data and a
loss function and automatically tries to find the best function minimizing the
loss. There is a wide variety of optimization methods, and they mostly differ by
the properties of $f$ they exploit (e.g. if $f$ is differentiable, we can
compute its derivative, set it to zero, and go step by step in the direction of
steepest descent; this is called <em>gradient descent optimization</em>, see below). We
only look at some very basic optimization methods, but it is of course the
success in learning a function greatly depends on the optimization used.</p>
<p>The <em>representation</em> of $f$ also drastically influences its learnability. The
easiest and best-understood representations are linear functions, but also
neural networks (which are some sort of nonlinear functions) are very common
nowadays.</p>
<p>Usually, as a practitioner, you don’t have to worry about making choices these
three things.
An ML method usually determines the loss, optimization and function
representation, and is tuned in such a way that all of them work together
nicely.</p>
<h3 id="overfitting">Overfitting</h3>
<p>We said that the goal of supervised learning is to learn a function $f$ from
training data, using the loss $\mathcal{L}$. However, it is important that a
perfect loss, $\mathcal{L}=0$, for your training does not necessarily solve the
problem – the actual task of supervised learning is to learn an $f$ that
<em>generalizes</em> to <em>unseen examples</em> $x^{(j)}$. And actually, fitting the training
data is very easy: just memorize all of them by heart! This will trivially give
us $\mathcal{L}=0$ for our training data. But as we will see, that might mean
that for unseen data we will actually make blatantly wrong predictions.</p>
<p>To repeat, the core problem of any machine learning method is to learn an $f$
that generalizes to unseen data. If $f$ only works well on the training data,
but not on unseen data, we say $f$ <em>overfits</em> the data.</p>
<p>To ensure that our function does not overfit, we divide the data into three
sets:</p>
<ul>
<li><em>Training data</em><br />The data used for actually learning/fitting the function</li>
<li><em>Validation data</em><br />Additional data, not used for learning the function, but
only used to compute the loss for our current estimate of the function. It is
also used to tune the hyperparameters of your method, or decide which ML method
to use.</li>
<li><em>Test data</em> <br />Another additional data set, neither used for training nor
for validation. You can think of it as being locked in some secret drawer, and
it is only released and used to evaluate $f$ once you are sure that you have
properly learned $f$.</li>
</ul>
<h3 id="example-linear-regression">Example: Linear regression</h3>
<p>Let us consider a simple example for supervised learning: <em>linear regression</em>.</p>
<p>Assume we want to predict the price of apartments in Berlin, given their size of
square meters of the apartment. Here is some fictitious data:</p>
<table style="border 1px solid black;">
<tr>
<th> $X$ (square meters) </th>
<th> $Y$ (monthly rent in Euro)</th>
</tr>
<tr>
<td>40</td><td>500</td>
</tr>
<tr>
<td>65</td><td>620</td>
</tr>
<tr>
<td>80</td><td>855</td>
</tr>
<tr>
<td>81</td><td>910</td>
</tr>
<tr>
<td>100</td><td>1100</td>
</tr>
<tr>
<td>120</td><td>1250</td>
</tr>
</table>
<p>Let’s plot the data</p>
<p><strong>In [1]:</strong></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">%</span><span class="n">pylab</span> <span class="n">inline</span>
<span class="n">set_printoptions</span><span class="p">(</span><span class="n">suppress</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">array</span><span class="p">([</span><span class="mi">40</span><span class="p">,</span> <span class="mi">65</span><span class="p">,</span> <span class="mi">80</span><span class="p">,</span> <span class="mi">81</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">120</span><span class="p">])</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">array</span><span class="p">([</span><span class="mi">500</span><span class="p">,</span> <span class="mi">620</span><span class="p">,</span> <span class="mi">855</span><span class="p">,</span> <span class="mi">910</span><span class="p">,</span> <span class="mi">1100</span><span class="p">,</span> <span class="mi">1250</span><span class="p">])</span>
<span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">);</span></code></pre></figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Populating the interactive namespace from numpy and matplotlib
</code></pre></div></div>
<p><img src="http://shoefer.github.io/notebooks/yet-another-machine-learning-101_files/yet-another-machine-learning-101_2_1.png" alt="png" /></p>
<p>The idea is linear regression is to fit a linear function, i.e. a line of the
form $f(x) = w_1 x + w_0$ to the data. To learn such a line we define as loss
the <em>mean-squared error</em> which penalizes large deviations of $f(x)$ from the
known $y$:</p>
<script type="math/tex; mode=display">\mathcal{L}_\textrm{MSE}
= \frac{1}{2N} \sum_{i=0}^N ( f(x^{(i)}) - y^{(i)})^2\\
= \frac{1}{2N} \sum_{i=0}^N ( w_1 x^{(i)} + w_0 - y^{(i)})^2</script>
<p>Maybe you are a bit confused about the fraction $\frac{1}{2N}$. The reason we
use it, because if we compute the derivative $\mathcal{L}_\textrm{MSE}$, the
square within the sum cancels out this fraction; and the $\frac{1}{N}$
normalizes for the number of training examples.</p>
<p>Note that usually $x$ is multi-dimensional, i.e. a vector $\mathbf{x}$. Then
$w_1$ becomes a weight vector $\mathbf{w}$.
We can do another trick to get rid of $w_0$ (called the <em>bias</em> or <em>intercept</em>
term) by appending a $1$ to $\mathbf{x}$, and extending $\mathbf{w}$ by one
dimension. This facilitates notation a bit, as it allows us to write the loss
using vector notation:</p>
<script type="math/tex; mode=display">\mathcal{L}_\textrm{MSE} = \frac{1}{2N} \sum_{i=0}^N ( \mathbf{w}^T
\mathbf{x}^{(i)} - y^{(i)})^2</script>
<h3 id="gradient-descent">Gradient descent</h3>
<p>How to optimize it? One way is to compute the partial derivative wrt. each
$j$-th element of the weight (the so-called <em>gradient</em>), and change the updates:</p>
<script type="math/tex; mode=display">\nabla \mathcal{L}_\textrm{MSE} =
\frac{\partial \mathcal{L}_\textrm{MSE}}{\partial {w}_j}
= \frac{1}{2N} \sum_{i=0}^N \frac{\partial}{\partial {w}_j}
( \mathbf{w}^T \mathbf{x}^{(i)} - y^{(i)})^2 \\
= \frac{1}{2N} \sum_{i=0}^N 2 w_j^{(i)} (w_j x_j^{(i)} - y^{(j)})\\
= \frac{1}{N} \sum_{i=0}^N w_j (f(x_j^{(i)}) - y^{(j)})</script>
<p>The <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a> algorithm
(for linear regression with MSE) then goes as follows:</p>
<ul>
<li>Initialize the weights $\mathbf{w}$ randomly</li>
<li>Update each $j$-th entry in the weight vector by following the negative
gradient:<br />$w_j^\textrm{new} = w_j - \alpha \frac{1}{N} \sum_{i=0}^N w_j
(f(x_j^{(i)}) - y^{(j)})$</li>
</ul>
<p>Here, $\alpha$ denotes the step size. It has to small enough, otherwise we might
overshoot and miss the minimum.</p>
<p>If we image the loss function to be a “bowl” and every $\mathbf{w}$ to be a
point in this bowl, the gradient points towards the bottom of this bowl, and
thus the minimum of the loss. The update rule then takes small steps towards
this minimum.
This is a cool visualization which is also accompanied by a <a href="http://www.mathworks.com/matlabcentral/fileexchange/35389-gradient-
descent-visualization">Matlab
script</a>:
<img src="http://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/35389/versions/1/screenshot.png" style="width:30%" alt="Obtained from http://www.mathworks.com/matlabcentral/fileexchange/35389-gradient-descent- visualization" /></p>
<p>Gradient descent is a very important technique, very popular especially for
training neural networks (see below).</p>
<h3 id="ordinary-least-squares">Ordinary least squares</h3>
<p>In the linear regression case, however, there is a more direct solution. If we
consider $\mathbf{w}$ to be a matrix rather than a vector, we can write the loss
in the following way:</p>
<script type="math/tex; mode=display">\mathcal{L}_\textrm{MSE} = \frac{1}{2} (\mathbf{X}\mathbf{w} - \mathbf{y})^T
(\mathbf{X}\mathbf{w} - \mathbf{y})</script>
<p>where $\mathbf{X}$ contains in each $i$-th <em>row</em> on training example
$\mathbf{x}^{(i)}$, and $\mathbf{y}$ in each $i$-th row a label $y^{(i)}$.</p>
<p>Now we can compute the derivative of $\mathcal{L}_\textrm{MSE}$, set it to 0,
and solve for $\mathbf{w}$. As you can check yourself, the derivative is given
by:
<script type="math/tex">\frac{d \mathcal{L}_\textrm{MSE}}{d \mathbf{w}} = \mathbf{X}\mathbf{w} -
\mathbf{y} \\
\mathbf{X}\mathbf{w} - \mathbf{y} = 0\\
\mathbf{X}\mathbf{w} = \mathbf{y}</script></p>
<p>To now solve it for $\mathbf{w}$, we need to invert $\mathbf{X}$ – which is
usually not possible because it is not square in the general case. But we can
apply a trick, namely use the <em>pseudo-inverse</em>:</p>
<script type="math/tex; mode=display">\mathbf{X}\mathbf{w} = \mathbf{y}\\
\mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}\\
\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}\\</script>
<p>where the <em>pseudo-inverse</em> is given by $(\mathbf{X}^T\mathbf{X})^{-1}
\mathbf{X}^T$.</p>
<h3 id="computational-example">Computational example</h3>
<p>Let’s now implement this in python.</p>
<p><strong>In [2]:</strong></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># transpose training data and append 1</span>
<span class="n">Xhat</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="mf">1.</span><span class="p">))])</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">inv</span><span class="p">(</span><span class="n">Xhat</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Xhat</span><span class="p">))</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Xhat</span><span class="o">.</span><span class="n">T</span><span class="p">)</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span>
<span class="n">w</span></code></pre></figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([ 10.04581152, 58.78926702])
</code></pre></div></div>
<p><strong>In [3]:</strong></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Xhat</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">));</span></code></pre></figure>
<p><img src="http://shoefer.github.io/notebooks/yet-another-machine-learning-101_files/yet-another-machine-learning-101_5_0.png" alt="png" /></p>
<p>Luckily, there are libraries that do all that for us. One of the most popular ML
libraries in python is <em>scikit learn</em>.
We can easily verify that it computes the same function:</p>
<p>(we see that sklearn automatically adds a bias, stored in the variable
“intercept_”)</p>
<p><strong>In [4]:</strong></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">X_rng</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">120</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
<span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">X_rng</span><span class="p">,</span> <span class="n">lr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_rng</span><span class="p">));</span>
<span class="n">lr</span><span class="o">.</span><span class="n">coef_</span><span class="p">,</span> <span class="n">lr</span><span class="o">.</span><span class="n">intercept_</span></code></pre></figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(array([ 10.04581152]), 58.789267015706741)
</code></pre></div></div>
<p><img src="http://shoefer.github.io/notebooks/yet-another-machine-learning-101_files/yet-another-machine-learning-101_7_1.png" alt="png" /></p>
<p>We see that the weights and our prediction (the blue line) are identical. And
you see that it fits the data Ok, but not perfectly. Still, it looks like a
reasonable guess. Most importantly, it also makes a prediction for inputs $x$
for which we did not have any training data.</p>
<h2 id="overfitting-1">Overfitting</h2>
<p>What if we don’t want to fit a line, but some more complex model, e.g. a
polynomial? This is easily done by <em>augmenting our input</em> (also called <em>feature
expansion</em>) by different powers of the input:</p>
<p>$f(\tilde{\mathbf{x}}) = [\mathbf{x}, \mathbf{x}^2, \mathbf{x}^3, … ]$</p>
<p>Let’s try that out:</p>
<p><strong>In [5]:</strong></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">polynomial_feature_expansion</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X</span><span class="p">,</span> <span class="n">X</span><span class="o">**</span><span class="mi">2</span><span class="p">,</span> <span class="n">X</span><span class="o">**</span><span class="mi">3</span><span class="p">,</span> <span class="n">X</span><span class="o">**</span><span class="mi">4</span><span class="p">,</span> <span class="n">X</span><span class="o">**</span><span class="mi">5</span><span class="p">,])</span>
<span class="n">X_rng</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">140</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">polynomial_feature_expansion</span><span class="p">(</span><span class="n">X</span><span class="p">),</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">X_rng</span><span class="p">,</span> <span class="n">lr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">polynomial_feature_expansion</span><span class="p">(</span><span class="n">X_rng</span><span class="p">)));</span>
<span class="n">ylim</span><span class="p">(</span><span class="mi">200</span><span class="p">,</span> <span class="mi">2000</span><span class="p">)</span>
<span class="n">lr</span><span class="o">.</span><span class="n">coef_</span><span class="p">,</span> <span class="n">lr</span><span class="o">.</span><span class="n">intercept_</span></code></pre></figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(array([ 14413.72962393, -394.63014472, 5.20499616, -0.03319069,
0.00008213]), -201202.688457913)
</code></pre></div></div>
<p><img src="http://shoefer.github.io/notebooks/yet-another-machine-learning-101_files/yet-another-machine-learning-101_9_1.png" alt="png" /></p>
<p>We see that the training data is fitted almost perfectly; but the function
hallucinates weird values inbetween and before/after the training data! Also,
the weights are actually very high. This is a classical example of overfitting:
we used a function that is too “powerful”, as it has many more parameters than
the linear model. There are different ways to remedy this problem:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Big_data">More training data</a>!</li>
<li>Restricting the function to a simpler one (e.g. less parameters)<br />(The
danger can be <em>underfitting</em>, i.e. choosing a too simple model)</li>
<li>Model selection, e.g. using <a href="https://en.wikipedia.org/wiki
/Cross-validation_%28statistics%29">cross-validation</a></li>
<li>Incorporating prior knowledge about the problem, e.g. by <a href="https://en.wikipedia.org/wiki/Feature_engineering">feature
engineering</a></li>
<li>Regularization</li>
</ul>
<p>At this point we will only talk about
<a href="https://en.wikipedia.org/wiki/Regularization_%28mathematics%29">regularization</a>
which a special way to force a powerful function not to overfit. The idea is to
put additional terms into the loss function. A popular regularization is <em>L2</em>
which puts a <a href="http://mathworld.wolfram.com/L2-Norm.html">L2-norm</a> penalty on the
weights, i.e. $||\mathbf{w}||^2$ into the loss. The optimizer then has to make
sure not only to fulfill the initial loss, e.g. the mean-squared error, but also
the regularization.</p>
<p>A linear regression with L2 regularization is called <em>ridge regression</em>. The
loss then becomes:
<script type="math/tex">\mathcal{L}_\textrm{Ridge} = \frac{1}{2} (\mathbf{X}\mathbf{w} -
\mathbf{y})^T (\mathbf{X}\mathbf{w} - \mathbf{y}) + \alpha ||\mathbf{w}||^2</script></p>
<p>where $\alpha$ weights the influence of the regularizer.</p>
<p>Ridge regression is also implemented in scikit learn:</p>
<p><strong>In [6]:</strong></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Ridge</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">Ridge</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">20.</span><span class="p">)</span>
<span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">polynomial_feature_expansion</span><span class="p">(</span><span class="n">X</span><span class="p">),</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">X_rng</span><span class="p">,</span> <span class="n">lr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">polynomial_feature_expansion</span><span class="p">(</span><span class="n">X_rng</span><span class="p">)));</span>
<span class="n">ylim</span><span class="p">(</span><span class="mi">200</span><span class="p">,</span> <span class="mi">2000</span><span class="p">)</span>
<span class="n">lr</span><span class="o">.</span><span class="n">coef_</span><span class="p">,</span> <span class="n">lr</span><span class="o">.</span><span class="n">intercept_</span></code></pre></figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(array([-0.05208327, -2.0080791 , 0.05283245, -0.00047523, 0.00000145]),
1401.9265643035942)
</code></pre></div></div>
<p><img src="http://shoefer.github.io/notebooks/yet-another-machine-learning-101_files/yet-another-machine-learning-101_11_1.png" alt="png" /></p>
<p>We see, that the weights are lower and the values inbetween are much smoother;
but still for values $>120$ and $<40$ the linear model reflects our intuition
about the real $f$ better. Therefore, we might choose to increase the
regularization, collect more data or choose a different function
parametrization, e.g. a polynomial with lower degree.</p>
<h2 id="regression-vs-classification">Regression vs. Classification</h2>
<p>Before we talk about more sophisticated supervised learning methods, we should
clarify the terms <em>regression</em> and <em>classification</em>. The only difference between
these two concepts is whether $y$ is <em>discrete</em> or <em>continuous</em>. In the previous
example we used regression, i.e. we treated the price as a continuous variable.
In classification, we are usually given a discrete, finite set of <em>classes</em>, and
we are only interested in predicting the class of a new input. The only thing
that changes because of this is the loss. We won’t bother about these loss
functions now, but in case you are interested, common choices are the
<a href="https://en.wikipedia.org/wiki/Cross_entropy#Cross-
entropy_error_function_and_logistic_regression">categorical cross-entropy</a> loss or the <a href="https://en.wikipedia.org/wiki/Hinge_loss">hinge
loss</a>.</p>
<p>But watch out, the terminology is not always fully consistent: training a linear
model with a categorical cross-entropy loss is called <em>logistic regression</em> –
although it is actually a <em>classification</em> method!</p>
<h2 id="deep-neural-networks">(Deep) Neural Networks</h2>
<p>Currently, they are probably the most popular approach in supervised learning.
The idea is to compose the function $f$ of small slightly nonlinear functions
(neurons) and connect them. This small nonlinear functions are called <em>neurons</em>,
and together they form a neural network that can be visualized as follows:</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/e/e4/Artificial_neural_network.svg" alt="Artificial neural network (https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg)" style="width: 20%;" /></p>
<p>The picture (image taken from <a href="https://commons.wikimedia.org/wiki/Fil
e:Artificial_neural_network.svg">wikipedia</a>) depicts a network with an input layer
($=\mathbf{x}$), an output (should be equal to our labels $y$) and a hidden
layer. This hidden layer can learn some representation of $\mathbf{x}$ that is
suitable for predicting $y$. For the record, this network structure is also
sometimes called <em>multi-layer perceptron</em>.</p>
<p>What does a (non-input and non-output) neuron look like? In fact, a neuron
basically multiplies a linear weight vector with its input (sounds exactly like
linear regression, right?) and then applies some nonlinearity on the output of
this operation (that’s different from linear regression). Let’s make this
formal; a neuron $h_i$ (in the hidden layer), given input $\mathbf{z}$ (in the
network above $\mathbf{z} = \mathbf{x}$), computes its output as follows:</p>
<script type="math/tex; mode=display">h_i(\mathbf{z}) = \sigma(\mathbf{w}_{h_i}^T\mathbf{z})</script>
<p>where $\sigma$ denotes some nonlinear <em>activation function</em>, often something
like the <em>sigmoid</em>-function:</p>
<p><script type="math/tex">\sigma(t) = \frac{1}{1 + e^{-t}}</script>,</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/5/53/Sigmoid-function-2.svg" alt="Signmoid function (https://commons.wikimedia.org/wiki/File :Sigmoid-function-2.svg)" style="width: 40%;" /></p>
<p>although the <a href="http://mathworld.wolfram.com/HyperbolicTangent.html">hyperbolic
tangent</a> and
<a href="https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29">rectifiers</a> are
more en vogue in state-of-the art neural networks.</p>
<p>A hidden layer $f_h$ composed of $H$ neurons then computes its output as
follows:</p>
<script type="math/tex; mode=display">f_h(\mathbf{z}) = \sigma(\mathbf{W}_{h} \mathbf{z}),</script>
<p>where ${\mathbf{W} _ h}$ is a $\mathrm{dim}(\mathbf{z}) \times k$ matrix
composed of the stacked (transposed) weight vectors $\mathbf{w}_{h_i}^T, i=1
\ldots H$, and the activation function $\sigma$ is applied separately to each
output dimension of $\mathbf{W}_H$.</p>
<p>So what are <em>deep</em> networks? The idea is to stack multiple hidden layers – the
more hidden layers there are, the “deeper” the network is. Mathematically, it is
just a repeated application of multiplying a linear weight with the output of
the previous layer, then computing the activation, passing it to the next layer,
and so on. The advantage of these networks is, that they can learn more
sophisticated representations of the input data, and thus solve more challenging
learning problems.</p>
<p>Finally, we have to say how to train neural networks. We can use the same loss
functions as for linear regression (or classification, of course), and apply
gradient descent. However, if we have multiple layers, we need to apply some
tricks to adapt gradient descent. The first trick is <em>backpropagation</em>; it
basically says that to train multiple layers, we apply gradient descent to each
layer separately, and pass the loss backwards through the network layer by layer
until we reach the beginning. For this to work well, we must apply also some
additional tricks, e.g. setting the initial values of all weights appropriately
and so on. But in principle, training a neural network is not much different
from performing a linear or logistic regression.</p>
<p>Of course there is a lot more to know abouot deep learning, and if you want a
more in-depth treatment of this topic, check out <a href="http://www.deeplearningbook.org/">this recent
book</a>.</p>
<h1 id="unsupervised-learning">Unsupervised Learning</h1>
<p>Unsupervised differs from supervised learning that we are only given
${x^{(i)}}_{i=1 \ldots N}$, an no labels. This obviously means that the loss
functions we’ve seen so far will not work. Instead the loss functions impose
some “statistical” constraints on $y$. A good example is <em>Principal component
analysis</em> (PCA): here we want to learn a low-dimensional variant of $x$ –
however, which still contains roughly the same “information” as the original
$x$. The question is how to quantify “information”. This very complicated and
highly depends on the task; but PCA takes a pragmatic stance by defining
information as <em>high variance</em> (in a statistical sense). Therefore, the loss for
PCA is roughly equivalent to:</p>
<script type="math/tex; mode=display">{\mathcal{L}}_\textrm{PCA} \approx -\textrm{Var}(f(\mathbf{x}))</script>
<p>where $f$ is indeed a linear function as used above.
The $\approx$ sign reflects that formally the loss is a bit different - we need
some additional constraints to make this problem well-defined, i.e. not find
infinitely large weights. I will not go into details here, but you should
understand that one can formulate learning objectives without any supervised
signal, just by formulating some desired properties of the result of $f$ in the
loss function.</p>
<p>Note that PCA is somewhat the “regression” variant of unsupervised learning.
There are also methods that map data into discrete representations, e.g. in
clustering. The most popular and yet simplest method is probably
<a href="https://en.wikipedia.org/wiki/K-means_clustering">k-means</a>.</p>
<p>Also note that unsupervised learning has somewhat different applications than
supervised learning. Often it is used for pre-processing the input data, in
order to then feed its output to a supervised learning method. Another important
application is exploratory data analysis, i.e. studying and finding patterns in
your data.</p>
<h1 id="reinforcement-learning">Reinforcement Learning</h1>
<p>In reinforcement learning our outputs $Y$ are <em>actions</em> that an agent should
take, and our input <em>X</em> is the state. Therefore, we also rename $X$ and $Y$:
the state is denoted by $\mathbf{s}$, the actions by $\mathbf{a}$, and the
function we want to learn is called <em>policy</em> $\pi$ (instead of $f$):</p>
<script type="math/tex; mode=display">\pi(\mathbf{s}) = \mathbf{a}</script>
<p>A crucial difference to supervised learning is that we do not know the correct
actions $\mathbf{a}$. Rather, we only get a <em>reward signal</em> $r(\mathbf{s},
\mathbf{a})$ for every action we take (in a certain state). The reward is a real
number that is high if the action was good, and low if it was bad.</p>
<p>Obviously, this problem is much harder as learning becomes indirect – you need
to figure out the right action only from getting a reward. The biggest problem
is that rewards are usually sparse and <em>delayed</em>; that means the agent might
receive a positive reward only after it has executed a set of actions. However,
it might be that actually the first action in this sequence was the most
important one.</p>
<p>There is a wide variety of different techniques, such as <em>policy search</em>,
<em>value-based methods</em> and <em>model-based reinforcement learning</em> to tackle this
problem. We will look at these techniques at this point, but it is important
that you at least understand the general setting of reinforcement learning, and
its difference to supervised learning (and unsupervised learning).</p>
Thu, 12 May 2016 00:00:00 +0000
http://shoefer.github.io/2016/05/12/yet-another-machine-learning-101.html
http://shoefer.github.io/2016/05/12/yet-another-machine-learning-101.htmlpythonnotebookThe Curse of Dimensionality<p>In the last post we have looked at one of the big problems of machine learning: when we want to learn <a href="/intuitivemi/2015/12/28/functions.html">functions</a> from data, we have to fight <a href="/intuitivemi/2015/08/07/overfitting.html">overfitting</a>. In this post we will look at another archenemy of learning: dimensionality.</p>
<h3 id="parameters">Parameters</h3>
<p>Let’s briefly recap our stock price prediction example. In an <a href="/intuitivemi/2015/12/30/learning-functions.html">earlier post</a> we used random search to find the parameters of a line that explains the training data examples well.
The algorithm learned two numbers, namely the <em>parameters</em> p<sub>1</sub> and p<sub>2</sub>:</p>
<div class="pseudoformula">
<b>Stock price</b> = f(<b>Revenue</b>) = p<sub>1</sub> * <b>Revenue</b> + p<sub>2</sub>,
</div>
<p>where visually, p<sub>1</sub> changes the slope of the line and p<sub>2</sub> the shift along the y-axis of a line.</p>
<p>The random search learned the numbers p<sub>1</sub>=0.00015 and p<sub>2</sub>=58:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-12-29-learning-random_guess.png" class="" width="65%" height="" />
</div>
<p>This line is not very far away from the parameters of the “true” function p<sub>1</sub><sup>true</sup>=0.00013 and p<sub>2</sub><sup>true</sup>=70. Interestingly, for the learned function the slope p<sub>1</sub> is a bit higher, but this is “compensated” by a lower shift p<sub>2</sub>. Intuitively, this makes sense: the higher the slope, the more we need to shift the line downwards in order to approximately hit the training examples. We will see in a second why this observation is important.</p>
<h3 id="adding-dimensions">Adding dimensions</h3>
<p>Now, I would like to reason about the influence of the number of parameters of a function on the difficulty of learning that function. You can think about the parameters of the line, but in fact I will formulate it in a general way.</p>
<p>For the sake of the argument let’s assume that we do not consider all possible numbers as possible parameter values, but that we restrict ourselves a fixed list of numbers (in mathematics that’s called <em>discretization</em>), and we make this list finite. To make it really simple, we will only use the numbers from 1 to 10.</p>
<p>Assume that we have a function that has only <em>one</em> parameter (for example, only the slope of a line), we immediately see that there are 10 possible values that the parameter can take. We visualize each value of the parameter by a blue box:</p>
<table border="0" style="border-collapse: collapse; margin: 0 0 15px 25px;">
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
1
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
2
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
3
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
4
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
5
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
6
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
7
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
8
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
9
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
10
</td>
</tr>
</table>
<p>What happens if we add a second parameter that can take 10 values? Well, you might think that results in 10 values for parameter one and another 10 for parameter two, which makes 20. But unfortunately that’s wrong: we need to consider all possible combinations of the parameter values! The reason is that the parameters are usually dependent as we have seen in the line example: when changing parameter one (here the slope), we can still improve how well the line fits the data by changing parameter two (shift). Let’s visualize all possible combinations:</p>
<table border="0" style="border-collapse: collapse; margin: 0 0 15px 25px;">
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(1, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(2, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(3, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(4, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(5, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(6, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(7, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(8, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(9, 10)
</td>
</tr>
<tr>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 1)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 2)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 3)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 4)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 5)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 6)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 7)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 8)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 9)
</td>
<td style="width:40px; height:30px; border:1px solid blue; font-size: 8pt; color: blue; " align="center">
(10, 10)
</td>
</tr>
</table>
<p>We see that we have 10 <em>times</em> 10, i.e. 100 possible parameter pairs. What happens if we add a third parameter? We get a cube with 10 times 10 times 10 equals 1000 parameters! And with 10 parameters each taking 10 values, we get 10.000.000.000 which are 10 billion different combinations of parameter values!</p>
<p>You probably see the formula behind our reasoning: If we have <em>n</em> values a parameter can take, and <em>m</em> parameters, we end up with <em>n</em><sup><em>m</em></sup> possible parameter value assignments. We say that the number of parameter values grows <em>exponentially</em>.</p>
<p>Now how big of a problem is it? Well, it is very big indeed, which is why this problem is called <em>the curse of dimensionality</em>. The problem is that data are usually high-dimensional, and that each parameter usually has significantly more than 10 possible values.</p>
<p>For example, the tiny pictures we played around with in an <a href="/intuitivemi/2015/07/25/vector-spaces.html">earlier post</a> had 27x35, that is 945 pixels. If we were to learn a function that has a parameter for every pixel and again every parameter can only take 1 out of 10 values we would still end up with 10<sup>945</sup> parameter values - this is a number consisting of a 1 with 945 trailing zeros, and it is several orders of magnitudes higher than the <a href="http://www.quora.com/How-many-particles-are-there-in-the-universe">number of particles in the entire universe</a>! We will never be able to try out even a tiny fraction of all possible parameter values.</p>
<p>So we see that the curse of dimensionality forces us to find smarter ways of finding the parameters of functions. We will save this for later articles.</p>
<h3 id="hughes-effect">Hughes effect</h3>
<p>Irrespective of the way how smart we go about searching for parameters, high dimensionality has another very problematic (and somewhat unintuitive) implication, namely for classification. This problem is also known as the <em>Hughes effect</em>.
I will only sketch the idea very briefly but you find a more elaborate explanation with nice illustrative figures <a href="http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/">in this blog article</a>.</p>
<p>The problem is stated as follows: recall that in classification, we aim to find a function that discriminates between two or more categories of input data. We presented a way of doing so, namely by finding <a href="/intuitivemi/2015/07/25/vector-spaces.html">hyperplanes in space</a> that separate the categories. So how much easier or simpler does it get to find a hyperplane if the dimensionality of the data is higher?</p>
<p>Here comes the problem: the more dimensions the data has, the <em>easier</em> it is to find a hyperplane separating categories in the <em>training data</em>; but at the same time, the <em>harder</em> it gets to also perform well on the <em>unseen data</em> (for example, <em>test data</em>). The reason is that because we have more dimensions that we can choose from to lay the hyperplane through, we are much more prone to <a href="/intuitivemi/2015/08/07/overfitting.html">overfitting</a>. Thus, having higher-dimensional data has the same effect as allowing our function to have more “wrinkles”. And the less training data we have, the less sure we are that we have identified the dimensions that really matter for discriminating between the categories.</p>
<!--
<div class="pseudoformula">
f(<b>Image</b>) = 1 if <br/>
<b>Image</b><sub>(1,1)</sub> * 10 + <br/>
<b>Image</b><sub>(1,2)</sub> * 1.1 + <br/>
... <br/>
<b>Image</b><sub>(27,35)</sub> * 2.5 <br/>
> 0
f(<b>Input</b>) = 0 otherwise
</div>
### Real dimensionality of data
-->
<h3 id="summary">Summary</h3>
<p>In this post, we got to know another nemesis of data: high dimensionality. When looking at learning from the perspective of searching for the right parameters, each additional dimensionality means that we must search an exponential number of more parameter values. And in classification, every additional dimension makes it harder to find the right hyperplane to discriminate between the categories.</p>
<p>But another curse is already on its way; and it has to do with (no) free lunch.</p>
<h3 id="tldr"><a href="http://de.urbandictionary.com/define.php?term=tl%3Bdr">TL;DR</a>:</h3>
<ul>
<li>Curse of dimensionality refers to the dimensionality of the data and parameters to be learned</li>
<li>Searching for the right parameter values becomes exponentially harder with every dimension</li>
<li>Similarly, in classification every dimension makes us more prone to overfitting by choosing the wrong dimension as discriminatory</li>
</ul>
<h3 id="further-reading"><a name="further"></a>Further reading:</h3>
<ol>
<li><a name="[1]"></a>Excellent <a href="http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/">blog article</a> by Vincent Spruyt on the curse of dimensionality in classification, aka the Hughes effect.</li>
</ol>
Sun, 03 Jan 2016 13:45:00 +0000
http://shoefer.github.io/intuitivemi/2016/01/03/dimensionality.html
http://shoefer.github.io/intuitivemi/2016/01/03/dimensionality.htmlintuitivemiLearning Functions from Data: A Primer<p>In the introductory articles we have learned that <a href="/intuitivemi/2015/07/19/data-numbers-representations.html">data is a bunch of numbers</a> encoding some information, and that data can be multi-dimensional which makes them live in <a href="/intuitivemi/2015/07/25/vector-spaces.html">vector spaces</a>.
We have also looked at the core competence of machine intelligence: applying <a href="/intuitivemi/2015/12/28/functions.html">functions</a> to data. In this and the following posts we will look at the most powerful tool of machine intelligence: <em>learning functions from data</em>.</p>
<p>The roadmap is as follows. In this article, we will understand why learning functions from data is in principle rather straightforward. Indeed, at the end of this article we will have a developed a very simple learning method.</p>
<p>The next few posts will then bring us back down to earth and explain some fundamental problems that learning from data has, and discuss solutions to these problems. This will endow you with a powerful intuition of how learning from data works and what its limitations are.</p>
<!--
QUESTION: better explain by intuitive example, e.g. correlating the hypothesis that it is raining to the
hmm, but isn't that more about priors?
-->
<h4 id="learning-functions-from-data">Learning Functions from Data</h4>
<p>In order to understand how learning from data works, let’s use as our running example the stock price prediction problem introduced <a href="/intuitivemi/2015/12/28/functions.html">earlier</a>: given the annual revenue of a company, we want to predict the company’s stock price. We have learned that such a prediction is represented by a function. We discussed two representations of functions; first, tabular functions:</p>
<table class="data-table">
<tr>
<th><i>input</i>: Annual Revenue in January (Euro)</th>
<th><i>output</i>: Price (Euro)</th>
</tr>
<tr>
<td>40.000</td>
<td>122</td>
</tr>
<tr>
<td>50.000</td>
<td>135</td>
</tr>
<tr>
<td>60.000</td>
<td>148</td>
</tr>
<tr>
<td>80.000</td>
<td>174</td>
</tr>
<tr>
<td>100.000</td>
<td>200</td>
</tr>
</table>
<p>These functions had the drawback that we could not extrapolate from them, that is we do not know from this table what the stock price would be if, for example, the company’s revenue was 120.000 Euros per year.</p>
<p>Therefore, we were interested in finding a different function representation, and I suggested to use linear functions, namely this one:</p>
<div class="pseudoformula">
<b>Stock price</b> = f(<b>Revenue</b>) = 0.00013 * <b>Revenue</b> + 70
</div>
<p>I have told you that since the function is a <em>linear</em> we can visualize it with a line:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-20-functions_sizeprice.png" class="" width="65%" height="" />
</div>
<p>More importantly, the regularity enabled us to predict the stock price given any annual revenue. Intuitively, when the the revenue gets higher, the stock price moderately increases.</p>
<h4 id="how-to-extrapolate-from-data">How to extrapolate from data?</h4>
<p>By drawing the line through the bunch of dots, I actually did the task that the computer is supposed to do: I looked at the data, and I found out that there is a linear relationship between stock price and annual revenue. But what I really would like to do is to give the machine the data from the table above and let it figure out <em>on its own</em> how to make stock price predictions into the future! How do we do that?</p>
<p>Automating exactly this process is the core of <em>machine learning</em>. In machine learning, you are given a set of <em>training examples</em> - which correspond to the table given above (and are visualized by the blue <em>dots</em>).
These training examples are a bunch of data that need to be in the form <em>(input, output)</em> where output is the true, known output for the given input. The computer should then spit out an appropriate function, namely the blue line.
Luckily, the stock price training data are exactly in the form (input, output). So how to learn from them?</p>
<p>I will now show you the simplest learning procedure I could come with. And it is indeed <em>very</em> simple! Intuitively, the computer will do the following: we tell the compute to use some line which should relate input and output, and the computer will then fiddle around with this line until it comes as close as possible to as many training examples as possible. That’s it!</p>
<p>Problem solved?</p>
<h3 id="defining-the-learning-problem">Defining the learning problem</h3>
<p>Well, almost. If we would ask our computer to do that, it would reply with some of the following questions:</p>
<ol>
<li>What function should I find? How is it represented?</li>
<li>What does “as close as possible to as many training examples as possible” mean?</li>
<li>What does “fiddle around” mean?</li>
</ol>
<p>So let’s give our eager computer some answers:</p>
<ol>
<li>Start off with a single line. A line is represented by two basic <em>parameters</em>: how much it is shifted up or down (the 70 in the example above), and its slope (the 0.00013).</li>
<li>Calculate the difference of every training example to the value predicted by the function the computer guesses. When we sum all of these differences up we obtain a single number which we call the <em>training error</em>. The training error allows us to calculate how well any line fits the training examples.</li>
<li>Do random guessing; we randomly guess a line (more precisely, its <em>parameters</em>) and see how well it fulfils the criterion defined in 2.</li>
</ol>
<p>That sounds like a plan, doesn’t it?</p>
<p>Let’s first get an intuition what the <em>training error</em> looks like for some randomly guessed function:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-12-29-learning-random_guess_training_error.png" class="" width="65%" height="" />
</div>
<p>Here, the blue dots are the training examples, the red curve is the random guess, the dotted lines indicate the discrepancy between the guessed line and the training data, and the numbers next to the lines indicate how big the discrepancy is. By summing over all these differences, we obtain the red number at the bottom. This number is the training error - we want to get it as low as possible in order to find a highly predictive function.</p>
<p>So how does random guessing perform? The following video shows what this looks like for 100 random guesses. The green line is the one with the lowest training error so far. We see that it takes some guesses, but eventually we get a close match of the green line and the blue dots, the guess and the training data.</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-12-29-learning-random_guess_animation.png" id="learning-random-guess-animation" class="" width="500" height="375" />
</div>
<script>
$('#learning-random-guess-animation').gifplayer({label:'play'});
</script>
<div>
<img style="display:none; width:0px; height:0px; visibility:hidden; border:0; margin:0; padding:0;" src="/intuitivemi/images/2015-12-29-learning-random_guess_animation.gif" />
</div>
<p>Congratulations: you have witnessed your first machine learning algorithm <a href="#[1]">[1]</a>!</p>
<h4 id="caveats">Caveats</h4>
<p>You might be a bit disappointed now. Random guessing? That’s machine intelligence? That doesn’t seem intelligent at all! Fair enough, that is the simplest thing you can do.
And in practise, it won’t work very well. Yet, some of the most powerful algorithms do apply random guessing (which is officially called <em>random search</em>) but combine it with some “tricks” to make the guessing smarter. So random search is an important building block for creating learning algorithms, although it’s not very powerful when used on its own.</p>
<p>At this point, we can already think about why random search does not work so well on its own. Here are a couple of thoughts.</p>
<p>First of all, you don’t now how big you should go. Should you shift the line only between -1 to +1, or from -1000 to +1000? Should you try out both 1.1 and 1.10001? That means, in which granularity should you vary the parameters? In fact, in the animation above, I have set the bounds of the parameters very tightly around the real (but in the usual case unknown) parameters. Otherwise, the search would have taken much longer.</p>
<p>Second, we don’t learn from our mistakes. If I have guessed line 1 that got a training error of 1000. Next, I guessed line 2 with a training error of 800. Intuitively, I should try to find out what might line 2 better than line 1, and progress into that direction. Instead, random search will just look for a new line which might be much worse than both line 1 and line 2.</p>
<p>Can you think of more issues? <a href="#[2]">[2]</a></p>
<h4 id="terminology">Terminology</h4>
<p>Before we close, a note on terminology. Some paragraphs ago, we had to specify three things for our learning algorithm to work. I want to quickly state how these things are called in machine intelligence.</p>
<ol>
<li><b>Model</b>, <b>function representation</b> (sometimes also <b>hypothesis space</b>): What kind of functions do I look for? <br />(In the example: single straight lines)</li>
<li><b>Objective</b> or <b>loss function</b> <a href="#[3]">[3]</a>: Criterion for evaluating whether the learned function is good. Often this is the <b>training error</b>, a measure of how well the learned function accounts for the training data. <br />(In the example: distance between samples and function)</li>
<li><b>Learning algorithm</b>: In which way to come up with functions (or parameters) in order to get a good value for the loss function.<br />(In the example: random search)</li>
</ol>
<p>Moreover, if you learn from training data of the form (input, output), the approach is called <em>supervised learning</em>. The picture is that the pupil comes up with different inputs, and a supervisor gives the pupil a hint by telling her the right output, and the pupil tries to figure out the regularity (aka function) from these hints.</p>
<p>Finally, there are two general types of supervised learning, which depend on the type of function you want to learn. If you learn to map an input to a continuous number (as done in the stock price example), the learning task is called a <em>regression task</em>; in the last post we called this fitting a line <em>through</em> the data. In contrast, if you map the input to a category (as in the <a href="/intuitivemi/2015/07/19/data-numbers-representations.html">image classification example</a>), it is called a <em>classification task</em>. We solve this class by finding a (hyper)plane or line that <em>separates</em> the data.</p>
<p>In the next post, we will critically assess what we have done, and find out that there is a fundamental issue in learning from data which we have neglected so far. It is coarsely related to the question: why did we actually use a straight line, and not something with the shape of a curve or a snake?</p>
<h3 id="tldr"><a href="http://de.urbandictionary.com/define.php?term=tl%3Bdr">TL;DR</a>:</h3>
<ul>
<li>Machine intelligence learns functions from data</li>
<li>Data are given as training examples, that is input-output pairs</li>
<li>This type of learning is called supervised learning</li>
<li>To allow a compute to learn, we have to define model, objective and a learning algorithm</li>
<li>Random guessing is a learning algorithm that works for very simple problems</li>
</ul>
<h3 id="footnotes"><a name="further"></a>Footnotes:</h3>
<ol>
<li><a name="[1]"></a>An <a href="https://en.wikipedia.org/wiki/Algorithm">algorithm</a> is the machine intelligence people name for “recipe”. It is a list of steps that you have to execute in a certain order; it has to be detailed enough so that the machine has no doubt about what it has to do.</li>
<li><a name="[2]"></a>Think, for example, what happens if your function has more parameters, let’s say twice as many. How much harder does the learning problem then get?</li>
<li><a name="[3]"></a>Note the term loss <em>function</em>. The criterion itself is again formulated as a function - but this is <em>not</em> the function want to learn! It is an auxiliary function that takes as input three arguments: a set of (1) inputs and (2) outputs of all the training data and (3) the corresponding outputs of the learned function. It outputs a single number, assessing the quality of the learned function for the training data. In contrast, the learned function takes only one input, namely the one it requires to predict the output.</li>
</ol>
Tue, 29 Dec 2015 23:00:00 +0000
http://shoefer.github.io/intuitivemi/2015/12/29/learning-functions.html
http://shoefer.github.io/intuitivemi/2015/12/29/learning-functions.htmlmachinelearning,statisticallearningintuitivemiFunctions as Data Translators<p>If you have read the previous posts carefully you should now be familiar with high-dimensional data. In this last introductory article we are now going to look at what machine intelligence people mean when they think about manipulating data: they apply <em>functions</em> to data.</p>
<p>We will learn what functions are and look at different examples. Once we know what they are, we can in the next post see how to learn them - which is essentially what learning machines do.</p>
<h2 id="mathematical-functions">Mathematical Functions</h2>
<p>Mathematicians use the term function rather differently from the common sense definition. In everyday language we talk about the function of things in the sense “what is the purpose of the thing” or “how does this thing work”. A function in mathematical sense is rather different. It can be best thought of as a translation from one type of data to another type. Or in other words, you give some input (data) to the function, and get some output (data):</p>
<div class="imgcenter imgcap">
<img src="https://upload.wikimedia.org/wikipedia/commons/3/3b/Function_machine2.svg" class="" width="35%" height="" />
</div>
<p>(This figure and the first example are taken from the highly recommended <a href="https://en.wikipedia.org/wiki/Function_(mathematics)">Wikipedia page on functions</a>)</p>
<p>Let me give you a couple of examples of these things called functions.</p>
<h4 id="object-description-functions">Object description functions</h4>
<p>Let’s assume a couple of primitive shapes, such as triangles, squares, etc. all of different color. We can now define a function that, given a shape, outputs the color of the function:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-20-functions-color_example_wp.svg" class="" width="35%" height="" />
</div>
<p>This visualization shows how to map <em>input data</em> X (the shape) to some <em>output</em>, the shape’s color Y. All the arrows between X and Y are the defining elements of the function.</p>
<p>Note that for each object in X there is only one arrow pointing away from it. This is indeed a requirement because we want a function to be a unique mapping: for each input we get exactly one output. The inverse is not required, though: the square and the triangle have the same color, so the red color in Y has two arrows pointing to it. You also notice that some colors, namely blue and purple have no arrows pointing to it. All of these things are very common in functions.</p>
<!--
We can also come up with a whole bunch of other functions, defined on shapes: for example, we could define functions that counts the edges of each shape; or a function that calculates the area of the shape, and so on.
-->
<!-- You see, many concepts and things can be cast as function. However,
-->
<p>So far we have only talked about functions conceptually but we have not stated how a function <em>can automatically compute</em> the output from the input. Let us therefore look at a simpler example to shed some light on this.</p>
<h4 id="stock-price-prediction-function">Stock price prediction function</h4>
<p>Let us now re-introduce our simple example from the <a href="/intuitivemi/2015/07/28/datascience-showoff.html">introductory post</a>. We want to assess the stock price of some company. For the sake of simplicity, we will only use information about the annual revenue of the company (in Euro) and try to predict the stock price (also in Euro). We can describe this relationship in a big table:</p>
<table class="data-table">
<tr>
<th style="color: gray">Year</th>
<th>Annual Revenue in January (Euro)</th>
<th>Price (Euro)</th>
</tr>
<tr>
<td style="color: gray">2010</td>
<td>40.000</td>
<td>122</td>
</tr>
<tr>
<td style="color: gray">2011</td>
<td>50.000</td>
<td>135</td>
</tr>
<tr>
<td style="color: gray">2012</td>
<td>60.000</td>
<td>148</td>
</tr>
<tr>
<td style="color: gray">2013</td>
<td>80.000</td>
<td>174</td>
</tr>
<tr>
<td style="color: gray">2014</td>
<td>100.000</td>
<td>200</td>
</tr>
</table>
<p>This table can be considered a function in at least three ways: either we use the year as input, we use the annual revenue as input, or we use the stock price as input. We choose the annual revenue as the input and the stock price as the output because we believe that the annual revenue is more predictive for the stock price than the year - if the company was founded 10 years earlier, we would still expect the relationship of annual revenue and stock price to be similar (unless there was something like a global crisis, but we ignore that for now).</p>
<p>So the input of our function is <em>annual revenue</em> and the output <em>stock price</em>. And “computing” this function is very simple: given a number for the annual revenue, we look up the row in the table containing this number and return the corresponding stock price.</p>
<p>Such tabular functions are common but they have a severe drawback: we cannot <em>extrapolate</em> - from this table alone we do not know how to get the value for the stock price for an annual revenue of 90.000 Euros or 200.000 Euros!
However, the example data I have given exhibits a regularity (surprise, surprise!), namely that the stock price values in the right column are exactly 0.00013 times the annual revenue plus 70 (check this for yourself). This gives us a much more concise way of describing this function:</p>
<div class="pseudoformula">
<b>Stock price</b> = 0.00013 * <b>Revenue</b> + 70
</div>
<p>You see that the function gets as input the revenue, and gives as output the stock price. To make this even clearer, we usually denote the function by the symbol <em>f</em> and write:</p>
<div class="pseudoformula">
f(<b>Revenue</b>) = 0.00013 * <b>Revenue</b> + 70
</div>
<p>The parentheses behind the <em>f</em> contain the inputs of the function, sometimes called <em>arguments</em>.</p>
<p>Let us visualize this function by plotting a graph which has on one axis (the <em>x-axis</em>) the annual revenue and on the other (the <em>y-axis</em>) the stock price:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-20-functions_sizeprice.png" class="" width="65%" height="" />
</div>
<p>We see two interesting things here: first, we can now <em>predict</em> the stock price value from the annual revenue, no matter which revenue. Secondly, the relationship between revenue and stock price in our example turns out to be a line. The fact that such relationships / functions can be drawn by a line results in them being called <em>linear functions</em>. Linear functions are amongst the most important types of relationships in mathematics - and actually they are one of the few that mathematics can really deal with <a href="#[2]">[2]</a>. Therefore, the majority of methods in machine intelligence are based on linear functions as the one given here.
<!-- We will talk about them in more detail in the next article. --></p>
<!--
##### Truthfulness of a function
In the previous paragraph, I have come up with a function to predict the stock price, by postulating a linear function that relates annual revenue to stock price. Now the question is: Who guarantees this function is actually "true" in the real world?
To be honest: *no one* can guarantee you that! I have come up with this function because I saw the regularity in the data; but I might be totally off. Maybe the stock price drops at 1100000 Euros annual revenue?
-->
<p>In this example, we have talked about functions that takes a single number as input, and outputs another single number. Let’s now look how functions can be applied to vectors.</p>
<h4 id="vectorial-functions">Vectorial functions</h4>
<p>In the <a href="/intuitivemi/2015/07/25/vector-spaces.html">previous post</a> we have dealt with the question of representing high-dimensional images as vectors. Of course, we can also define functions on these vectors. Recall that a vector is merely a list of numbers of fixed size, e.g. a 3-dimensional vector looking like this:</p>
<table class="data-table">
<tr>
<td style="background-color: #000; opacity: 0.909; width: 30px">0.909</td>
<td style="background-color: #000; opacity: 1.0; width: 30px">1.000</td>
<td style="background-color: #000; opacity: 0.860; width: 30px">0.860</td>
</tr>
</table>
<p>So let’s define a <em>linear</em> function on 3-dimensional <em>vectors</em>:</p>
<div class="pseudoformula">
f<sub>a</sub>(<b>Image</b>) = f(<b>Image</b><sub>1</sub>, <b>Image</b><sub>2</sub>, <b>Image</b><sub>3</sub>) = 2*<b>Image</b><sub>1</sub> + 5*<b>Image</b><sub>2</sub> - 1* <b>Image</b><sub>3</sub>
</div>
<p>What does this function do? With the little subscript we denote the individual dimensions of the input image. The function therefore computes the sum of the individual dimensions of the input vector, each dimension multiplied with some number. These numbers (here 2, 5 and -1, which I have chosen arbitrarily in this example) are called the <em>parameters</em> of a function. The result for this function applied to the example vector above is:</p>
<div class="pseudoformula">
f<sub>a</sub>(<b>Image</b>) = 2*0.909 + 5*1.0 - 1*0.86 = 5.958
</div>
<p>At first sight this function does not really seem to make much sense. Why should we sum up pixel values of an image? For example, it allows us draw some conclusions on whether the image is rather dark (low value of f<sub>a</sub>) or light (high value); and the different parameters of f<sub>a</sub> allow us to emphasize certain regions of the image more than others.
We will see later that this is actually very useful for recognizing things in images.
<!--
But wait a second: we have as many parameters (2, 5 and 1) as input dimensions. This means, if the input data is an image, the parameters are in principle also an image! If you now think of input that are real 945-dimensional images, the parameters of the function can be used to "weigh" certain areas of the images higher. If you don't quite see this now, don't worry, we will talk about this later.
--></p>
<p>You might have noticed that the previous function mapped a vectorial input to a single number. But we can also define functions that map vectors to vectors:</p>
<div class="pseudoformula">
f<sub>b</sub>(<b>Image</b>)</b> = [ 2*<b>Image</b><sub>1</sub> + 4*<b>Image</b><sub>2</sub>,
3*<b>Image</b><sub>2</sub> + 1*<b>Image</b><sub>3</sub>
]
</div>
<p>This function maps the 3-dimensional input vector to a <em>2-dimensional</em> output vector by summing over subparts of the input (the brackets indicate that the output is a vector, and the comma separates the two output dimensions). These types of functions will be very useful as they allow us to transform data into different representations, for example lower-dimensional ones.</p>
<h4 id="image-classification-function">Image classification function</h4>
<p>I will now introduce a last type of functions which I call decision functions or <em>classification functions</em>. In a nutshell, these functions look like that:</p>
<div class="pseudoformula">
f<sub>c</sub>(<b>Input</b>) = 1 If <b>Input</b> > 10 <br />
f<sub>c</sub>(<b>Input</b>) = 0 otherwise
</div>
<p>Classification functions map input data to 2 (or more) categories which we simply enumerate from 0 to the number of categories (minus one).</p>
<p>If we put together classification functions with our knowledge about vectorial functions, we can reconsider the example in the <a href="/intuitivemi/2015/07/25/vector-spaces.html">previous post</a> we have seen that we can draw (hyper)planes to separate two categories of objects, namely blobfish and Sebastians:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-21-vector_spaces-arrow-plane.png" id="vector-spaces-arrow-plane" class="" width="500" height="375" />
</div>
<script>
$('#vector-spaces-arrow-plane').gifplayer({label:'play'});
</script>
<div>
<img style="display:none; width:0px; height:0px; visibility:hidden; border:0; margin:0; padding:0;" src="/intuitivemi/images/2015-07-21-vector_spaces-arrow-plane.gif" />
</div>
<p>We would now assign the category Sebastian to 0 and blobfish to 1, and make the if-else-part of f<sub>c</sub> such that it takes into account whether an input sample lies on one or the other side of the line. I will not write that out explicitly, but in fact the left-of-or-right-of-line can also be cast as a multiplication of the input with a bunch of numbers. So classification functions are exactly what we want if we want to solve classification tasks - surprised?</p>
<h4 id="relationship-between-functions">Relationship between functions</h4>
<p>Before closing this article, I would like to point out the relationship between the stock price and the image classification function.</p>
<p>For image classification, the trick for visualizing the hyperplane that separated Sebastians from blobfish was to shrink the 27x35 images to a 3x1 image which allowed us to treat images as points (vectors) in 3D space.
What does this look like if we shrink the images even further, namely to a 2x1 image? We can then visualize the images as vectors in 2D. Similar as before we can now ask how to separate the shrinked Sebastians and blobfish. The answer is that we have to find a “2D hyperplane” - which turns out to be just a line!</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-20-functions_image_class_2d.png" id="2015-07-20-functions_image_class_2d" class="" width="65%" height="" />
</div>
<p>Interestingly, this looks very much like the stock price prediction above. The main difference is that we do not draw the line <em>through</em> the data but in such a way to <em>separate</em> the data. Still, in both applications we use a <em>linear</em> function.</p>
<!--
Therefore, we can down write the Sebastian-vs.-Blobfish discriminator in the following way:
<div class="pseudoformula">
<b>Category = <i>Sebastian</i></b> <b>IF</b> { -0.9 * <b>gray value 1</b> + 0.00013 * <b>gray value 2</b> + 1 } <b> > x </b><br/>
<b>Category = <i>Blobfish</i></b> <b>IF</b> { -0.1 * <b>gray value 1</b> + 0.00013 * <b>gray value 2</b> + 1 } <b> < x </b><br/>
</div>
The expression in
The two expressions only differ with respect to the > or < after the linear expression.
-->
<p>Maybe it’s now a bit clearer how these linear functions translate into higher-dimensional spaces: the plane is a generalization of the line to 3D, and the hyperplane a generalization of the line/plane to higher dimensions! So in the future if we think about discriminating hyperplanes, we will usually visualize this as a line or a plane separating two sets of 2D or 3D points. And as before the mathematical field of <a href="https://en.wikipedia.org/wiki/Linear_algebra">linear algebra</a> describes how to deal with these lines, planes and hyperplanes.</p>
<h4 id="summary">Summary</h4>
<p>For now, you should have gotten a feeling for what functions do: translating input to output data. Functions can take different inputs and return different outputs, such as numbers and vectors. (In fact, they can even take other functions as inputs as well! But we won’t bother with these insane cases now.)</p>
<p>In the next post, we will get to the real meat: how to learn functions automatically.</p>
<h3 id="tldr"><a href="http://de.urbandictionary.com/define.php?term=tl%3Bdr">TL;DR</a>:</h3>
<ul>
<li>Functions in the mathematical sense translate an input to an output</li>
<li>The simplest and best understood functions are linear functions</li>
<li>Functions map between numbers, or more general, vectors.</li>
<li>Functions can map to single numbers, but also to vector</li>
<li>If functions map onto a limited set of categories/classes, we call them classification functions</li>
</ul>
<h3 id="footnotes"><a name="further"></a>Footnotes:</h3>
<ol>
<li><a name="[1]"></a>Of course this is largely oversimplified. The actual price would also depend on the size of the company at the time, economic health of the country, stock market crashes, etc. But for the sake of the example let’s assume none of these factors plays a role.</li>
<li><a name="[2]"></a>To cite the <a href="http://books.google.de/books?id=dUhMAQAAQBAJ&pg=PA182&lpg=PA182&dq=Classical+mathematics+concentrated+on+linear+equations+for+a+sound+pragmatic+reason:+it+could+not+solve+anything+else.&source=bl&ots=PuRT666z3D&sig=YBZtoUP_y0siL0RUXfC14keMGe4&hl=de&sa=X&ei=upteVPDfBIysPJChgZgE&ved=0CCsQ6AEwAQ#v=onepage&q=Classical%20mathematics%20concentrated%20on%20linear%20equations%20for%20a%20sound%20pragmatic%20reason%3A%20it%20could%20not%20solve%20anything%20else.&f=false">mathematician Ian Stewart</a>: “Classical mathematics concentrated on linear equations for a sound pragmatic reason: it could not solve anything else.”</li>
</ol>
<!--
3. <a name="[3]"></a>You might remember that in the post on [vector spaces](intuitivemi/2015/07/25/vector-spaces) we mentioned that *linear algebra* is the mathematical discipline taking care of moving or rotating vectors. It is not a coincidence that the term *linear* appears in the title of this field: every possible movement or rotation of objects can be formalized by exactly the same kind of linear functions that we have seen above! Only that they are defined on vectors not on numbers.
4. <a name="[4]"></a> Indeed, finding a function for the image captioning problem is a very difficult task. Only recently, a [very complicated type of functions](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) has been found which shows promising performance at this task. These functions - called recurrent neural networks - are actually not that different from the linear functions we have just learned above, and we will hopefully cover them at a later point.
-->
Mon, 28 Dec 2015 18:10:00 +0000
http://shoefer.github.io/intuitivemi/2015/12/28/functions.html
http://shoefer.github.io/intuitivemi/2015/12/28/functions.htmlintuitivemiStop Teaching Math in School?<p>Today, I watched the <a href="https://www.youtube.com/watch?v=xyowJZxrtbg">TEDx talk by John Bennett</a> who advocates that we should stop teaching math to middle and high school students.
I find it great if people make controversial claims and defend them argumentatively. And indeed, there seems to be a problem with how math is taught in school, given the infamous “math anxiety” that Bennett is alluding to. Yet, I strongly disagree with Bennett’s conclusion.</p>
<p>In his talk, he tells us that at some point he realized that 99% of the American population don’t need math. Therefore, you shouldn’t teach math to them. He also rejects the argument that you need math in school at least to get a high SAT score - which is important for getting a good job. He aims to disprove this argument by citing Alfie Kohn with the following quote:</p>
<blockquote>
<p>The SAT is a measure of resources more than of reasoning. Year after year, College Board’s own statics depict a virtually linear correlation between SAT scores and family income.”</p>
</blockquote>
<p>Whether or not you buy the argument, my question is: how are you going to understand this argument and this quote, if you don’t know what a <em>linear correlation</em> is - which you only get with a basic education in higher math?
Even worse, how are you going to understand other claims based on correlations unless you understand the relationship between correlation and causation? Are you going to understand that any argument based on a <a href="http://www.tylervigen.com/spurious-correlations">single correlation</a> is prone to be false?</p>
<p>There is a deeper point that I want to make here. In my opinion, Bennett misconceives education as <em>preparation for professional life</em>. Of course, if you are not becoming an engineer, you might never see an integral sign again in your life. However, education is a <em>preparation for being a member of the society</em> - you should be able to critically read the newspaper, question claims and lines of argumentation, which are nowadays often backed up by some sort of statistics.
And sound argumentation, sharp thinking and - obviously - statistics are at the very core of math. All of this is required because you have the right to vote for your leaders - if you are well-informed and able to make up your own opinion on valid logical ground, you are less likely to fall for populists and propaganda.</p>
<p>I do agree that my math curriculum did not do a great job at teaching me all of the this - but it should. We most definitely should change the way math is taught, and there are various ideas of doing so (I have <a href="/intuitivemi/2015/07/19/why-no-formulas.html">some thoughts</a> about what is going wrong in university courses, and <a href="https://www.youtube.com/watch?v=fu-gFkuls_c">some people</a> have ideas on how to fix them). However, dropping maths from the curriculum does not sound like a viable solution to me.
The number of people supporting <a href="http://www.der-postillon.com/2016/11/erster-clown.html">clowns</a> like Donald Trump shows that we have an education problem. Let’s try to make education better, and let’s make teaching better math part of this endeavour.</p>
Mon, 28 Dec 2015 13:43:00 +0000
http://shoefer.github.io/education/2015/12/28/stop-teaching-math.html
http://shoefer.github.io/education/2015/12/28/stop-teaching-math.htmleducation,matheducationThe Curse of Overfitting<p>In the last post we obtained an understanding of how to <a href="/intuitivemi/2015/12/30/learning-functions.html">learn functions from data</a>, and we developed our first learning method. In this post, we will start building an understanding of why learning from data is actually pretty hard. The first problem we are facing is <em>the curse of overfitting</em>. Let’s see what that is.</p>
<!--
QUESTION: better explain by intuitive example, e.g. correlating the hypothesis that it is raining to the
hmm, but isn't that more about priors?
-->
<p>Recap again the <a href="/intuitivemi/2015/12/28/functions.html">stock price prediction problem</a>: given the annual revenue of a company, we want to predict the company’s stock price. I have given you set of example data and we have found the following <em>linear</em> regularity in the data:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-20-functions_sizeprice.png" class="" width="65%" height="" />
</div>
<p>Alhough I told you that these <em>linear</em> functions are really cool and mathematics can cope with them quite well you have probably already thought of many cases where linear functions do not suffice and we need different ones. And indeed, even in the stock price problem, there seems to be no reason for disregarding different functions, for example this one (red curve):</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-08-07-overfitting-sizeprice.png" class="" width="65%" height="" />
</div>
<p>It is not as nice as the line, but it perfectly passes through all of the data points. And
from an economic perspective this function looks much better as it predicts that if annual revenues exceed 1000000 Euro the stock price will go up much more quickly! So better go and buy some shares now!</p>
<h3 id="occams-razor">Occam’s Razor</h3>
<p>Let me give you a reason in defense of the linear function: the function’s simplicity. Although both functions perfectly fit the data (both the blue line and the red curve perfectly coincide with the blue dots) the line has much less wrinkles than the curve - namely zero rather than six! For the line we do not have to choose whether the wrinkles go up or down, how high/low they should go etc.</p>
<p>In fact, this type of reasoning is so common that it has already been brought up over 700 years ago by William of Ockham. Therefore, this principle is called <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam’s razor</a> (although Ockham wasn’t the first to bring it up, nor was he known for being particularly passionate about shaving). It states that “among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected”.</p>
<h3 id="overfitting">Overfitting</h3>
<p>Although this explanation of Occam’s Razor sounds reasonable, there is an even better way of arguing why we should prefer the line in this stock price example. Occam’s razor rejects the wrinkled function because we have to make too many choices about the direction and the size of the wrinkles. In fact, there are infinitely many wrinkled functions going through all of the points, for instance this one (green line):</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-08-07-overfitting-sizeprice2.png" class="" width="65%" height="" />
</div>
<p>The problem with that is that the chance of picking the <em>wrong</em> wrinkled function is much higher than picking the wrong straight line - because (at least in our example) there is only <em>one line</em> that goes exactly through all of the points. I cannot rotate or shift the line up without it losing contact with one or more of the points. Therefore, we should prefer the line. So in a way, we have given a justification of Occam’s razor by realizing that using a more complex function for predictions forces us to take more arbitrary decisions, which makes us more prone to making mistakes.</p>
<h3 id="random-fluctuations">Random fluctuations</h3>
<p>In the case where the data lies exactly on a line the answer was clear that the line is the best choice. However, in reality, data is not perfect but almost always undergoes random fluctuations and noise. So let us make the stock prediction data a bit more realistic and then look at how well line and wrinkled function explain the data:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-08-07-overfitting-sizeprice_noise.png" class="" width="65%" height="" />
</div>
<p>We see that no line will be able to pass directly through all the data points, but we have easily found a wrinkled function doing so. Still the line shown here seems to capture the main trend quite well. Which of the two functions do you prefer for making predictions about future stock prices?</p>
<p>Since I have told you that the fluctuations are random, you will probably prefer the line - even though it does not fit the data exactly. Although the wrinkled function explains the individual data points well, it does not explain the more moderate trend of the revenue-stock price relationship well; it focuses too much on explaining the random fluctations in the data. Data scientists call this general effect <em>overfitting</em>, and we say that the wrinkled function <em>overfits</em> the data.</p>
<h3 id="battling-overfitting">Battling Overfitting</h3>
<p>Out there in the real world, discovering whether your function overfits or not is really hard. Probably, it is <em>the</em> problem of machine learning.
It is so prominent that two machine learning professors have even written a song about it. Seriously:</p>
<div class="imgcenter">
<iframe width="560" height="315" src="https://www.youtube.com/embed/DQWI1kvmwRg" frameborder="0" allowfullscreen=""></iframe>
</div>
<p>Luckily, there are a few things you can do to counteract overfitting apart from singing about it. I want to give you a few examples.</p>
<p>First, you can retain some preference for simpler explanations, by making a similar argument as before: even for the randomly fluctuating data there are still only a few lines that are not too far away from any of the points. But since there are infinitely many wrinkled functions passing directly through all of the points, the chance of picking the wrong wrinkled function is very high. So using a simpler function is better.</p>
<p>Secondly, you can hold back some of your data and not use it for finding the function - you only use it <em>after</em> having computed the function to test how your function copes with that unseen data. If your function really sucks at explaining the data you have held out (called <em>test data</em>) you should consider using a different type of functions. In fact, using test data is a standard procedure in machine learning and everyone has to do it because fitting well your training data does not tell you at all whether you learned the right thing.</p>
<p>Third, you can use <em>more data</em>. In fact, this is why everyone is so excited about <em>big data</em> which heavily relies on lots of data being available. It does work well, but we will see in the next post, that there are some limitations about how much data you can get.</p>
<p>Forth, you can get more knowledge about which shape of the functions you actually expect. If for example you know that the stock price varies in cycles, that is goes up, slightly down, then up again etc., you can try to find a function that reflects this knowledge.</p>
<p>Although all of these things are good ideas and they definitely help to counteract overfitting, in the end you still have to keep your fingers crossed and hope your function predicts the right thing. There is no guarantee that your function doesn’t overfit although all of the things suggested make it less likely.</p>
<p>I hope that by looking at this very simple example, and by understanding how overfitting can occur here, you have already come to think more critically about decision making - both in machines and in humans.</p>
<h3 id="underfitting">Underfitting</h3>
<p>Last but not least, let me say that of course also the opposite thing can happen: <em>underfitting</em>. If we choose a function that is too simple to explain a relationship we will not get a good prediction, either. I leave it as an exercise for you to think about a good example of underfitting.</p>
<!-- Quadratic function -->
<p>That’s basically it - you have now understood the biggest problem of learning from data.
And in case you want to brag in front of your friends, here is some more terminology: the topic of this post, that a function learned from data always has to balance under- and overfitting is also called the <em>bias-variance trade-off</em>. <em>Variance</em> relates to how much all possible different wrinkled functions vary - the more wrinkles you allow, the higher is the variance of your learning method and hence the more prone the learner is to overfitting. <em>Bias</em> refers to underfitting, stating that a too simple function adds a systematic bias to the prediction which cannot be overcome by the learner, not even when given more data. The bias-variance trade-off is a smart way of stating that if don’t consider the simplest yet not too simple hypothesis, your hypothesized function is more likely to be wrong – too simple and wrong or too complicated and wrong.</p>
<p>This post elucidated the first big problem of machine learning: overfitting. Next, we will look at <em>the curse of dimensionality</em>.</p>
<!--In the next post we will look at the problem of overfitting in the more complex image classification scenario, and we will see how the dimensionality aggrevates the problem of learning and overfitting even more. -->
<h3 id="tldr"><a href="http://de.urbandictionary.com/define.php?term=tl%3Bdr">TL;DR</a>:</h3>
<ul>
<li>Finding the right predictive function for data is hard</li>
<li>Even if the function perfectly fits the data it might be totally wrong: an effect called overfitting</li>
<li>Occam’s razor states that one should prefer simpler solutions</li>
<li>A justification for this is: when using simpler functions we are less likely to pick the wrong one</li>
<li>Underfitting occurs when choosing too simple functions</li>
<li>Singing about data science is fun</li>
</ul>
<h3 id="further-reading"><a name="further"></a>Further Reading:</h3>
<ol>
<li><a name="[1]"></a><a href="http://courses.cs.washington.edu/courses/cse546/12wi/slides/cse546wi12LinearRegression.pdf">Lecture notes</a> by Luke Zettlemoyer from University of Washington on overfitting.</li>
<li><a name="[2]"></a><a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Wikipedia article on Occam’s razor</a>.</li>
<li><a name="[2]"></a><a href="http://mste.illinois.edu/exner/java.f/leastsquares/">Simple Java applet</a> for playing around with data fitting. The switch “degree of polynomial” corresponds to the “wrinkles” in the function.</li>
</ol>
Fri, 07 Aug 2015 15:00:00 +0000
http://shoefer.github.io/intuitivemi/2015/08/07/overfitting.html
http://shoefer.github.io/intuitivemi/2015/08/07/overfitting.htmlintuitivemiThe Power of Machine Intelligence<p>Thinking machines have revolutionalized our everyday life, maybe even more than industrialization or the invention of the car. But although everyone of has an intuition about how a car works thinking machines remain obscure and magical to most of us.</p>
<p>I believe that everyone should have at least a “car intuition” about thinking machines. Unfortunately, common introductions to artificial intelligence and machine learning require quite some knowledge about programming and math and scare laymen off. Luckily, none of these skills is required to grasp the core ideas behind thinking machines. And these core ideas are what I would like to tell you about in the forthcoming series of articles in this <em>introduction to machine intelligence</em>.</p>
<p>To whet your appetite I would like to present four prototypical applications of machine intelligence. We will get back to these applications later and get a more detailed understanding of how machine intelligence approaches each of these problems.</p>
<h2 id="image-understanding">Image Understanding</h2>
<p>For decades, researchers have been trying to make machines see the world we do. In 1966, one of the most famous artificial intelligence researchers, Marvin Minsky, hired an undergrad to solve the following problem as a summer project: connect a TV camera to a computer and make the computer describe what it sees. Now, 50 years later, we still haven’t fully solved the problem, but at least we have made some progress. In the <em>image classification</em> task computers are now able to <a href="https://gigaom.com/2015/02/13/microsoft-says-its-new-computer-vision-system-can-outperform-humans/">outperform humans</a> in certain regards. In this task, the computer has to decide for any given image what kind of object is depicted on that image. As manually programming the computer to do so has failed, machine intelligence tries to solve the problem by learning from data.
To learn successfully huge amounts of <em>labeled data</em> (images together with descriptions of what object category is depicted on each image) are required which is why the community created a <a href="http://www.image-net.org/">database of over 15 million images</a> containing 21841 object types. By learning from these images, using a method called <em>deep neural networks</em>, machine intelligence is now able to predict the object category correctly in over 95% of all cases - which is above human performance. These neural networks will one of the topics covered later.</p>
<div class="imgcenter imgcap">
<a href="/images/image-classification-msrdl1.jpg" target="_blank">
<img src="/images/image-classification-msrdl1.jpg" id="image-understanding" class="" width="300" height="" />
</a>
<div class="thecap">Top 5 predictions by an image classification approach. Credit: Microsoft Research</div>
</div>
<p>Note however that the general problem of seeing machines hasn’t been solved yet since the image classification task is limited to pictures where only one object is present. Describing entire scenes is much more complicated, but recently <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">progress has been made there</a>, too.</p>
<h2 id="algorithmic-trading-stock-price-prediction">Algorithmic Trading: Stock Price Prediction</h2>
<p>Machines are slowly taking over the stock markets using <a href="https://en.wikipedia.org/wiki/Algorithmic_trading">algorithmic trading</a>. One important thing these machine traders have to do is predicting stock prices. How would machine intelligence approach this problem? In fact, the task is very similar to the image classification problem, only that, instead of an object type, the computer should predict a continuous valued number: the stock price. But what data should the stock price predictor use for learning?
In the image classification task this was obvious: the image itself. Here, it is not clear what influences the stock price. It could be last months’s revenue of the company, the size of the company, the stock price from last month, last year, and many more.
A data scientist will approach this problem by collecting as many of these pieces of information, so-called <em>features</em>, and try to weigh the features in the best way to predict the stock price. We will look later in more detail how the computer can automatically weigh features to make predictions, and take a look at why the task of “weighing in the best way” is much more difficult then it sounds.</p>
<h2 id="spam-classification">Spam Classification</h2>
<p>Everyone knows the problem of <a href="http://www.stevenburgess.net/wp-content/uploads/2014/12/Spam-Can.jpg">spam</a>. It tastes weird and, even worse, it fills your mailbox with juicy offers for <a href="http://www.mensjournal.com/health-fitness/health/the-hard-truth-about-penis-enlargement-20141027">penis enlargement</a> or earning money from <a href="http://www.419eater.com/img/news.pdf">fake Nigerian royalty</a>. But thanks to machine intelligence, nowadays most spam automatically finds its way into your junk folder. This application of machine intelligence is called <em>spam classification</em> and - in its simplest form - works as follows: your mail program receives an email and extracts features (as in the example before) from the email, such as what words the email contains, whether you know the author of the email, and many more. It then weighs each of these features and makes a binary decision: spam or no spam.</p>
<p>We see a pattern emerging here. Spam classification is very similar to image classification and stock price prediction: data scientists collects many spam emails, extract features from these emails, and use fancy tools to find the best weighting of all the features. The main difference is that not a continuous number (stock price), or an object category, but a binary output (spam / no spam) is computed.</p>
<h2 id="robot-skill-learning">Robot Skill Learning</h2>
<p>The last and maybe most exciting application of machine intelligence is robotics. Interestingly, you can (to some extent) apply exactly the same techniques - as before - to make a robot do cool things. The following video shows how to teach a robot to play table tennis.</p>
<div class="imgcenter">
<iframe width="280" height="155" src="https://www.youtube.com/embed/SH3bADiB7uQ" frameborder="0" allowfullscreen=""></iframe>
</div>
<p>The type of learning slightly differs from the approaches seen before. Instead of telling the robot exactly what the right motion, the researchers let the robot try out motions on its own and give the robot a reward if it performs a good stroke, or a penalty if it performs bad (in fact, the robot doesn’t get lollipops as rewards, but delicious large numbers). The robot will then repetitively generate motions, collect rewards/penalties and step by step correct its motions in order to get more reward and finally execute an excellent table tennis stroke.</p>
<!--
However, you can also formulate the problem In a nutshell, the researchers make something similar as in stock price prediction: from the position of the ball you try to "predict", or rather decide for, a motor command to send to the robot. Executing this motion will then allow the robot to hit the table tennis ball.
-->
<p>I hope I was able to get you excited about machine intelligence! In the next couple of posts I will try to convey the basic ideas of how machine intelligence approaches all of these problems.</p>
<p>The next post will tell you in more detail why you don’t need to understand complex math and programming to understand machine intelligence.</p>
Tue, 28 Jul 2015 15:00:00 +0000
http://shoefer.github.io/intuitivemi/2015/07/28/datascience-showoff.html
http://shoefer.github.io/intuitivemi/2015/07/28/datascience-showoff.htmlintuitivemiHigh-dimensional Spaces<p>In the <a href="/intuitivemi/2015/07/19/data-numbers-representations.html">last post</a> we have seen examples for how numbers can encode information, becoming data. In this post we will talk about a very important way to look at data. This view will allow us to play around with data in a powerful way, and this view lies at the core of machine intelligence and the science of data.</p>
<h3 id="data-samples-as-points-in-space">Data samples as points in space</h3>
<p>I have previously told you that any <a href="/intuitivemi/images/2015-07-19-data-numbers-representations_picture.png">image</a> can be represented by a <a href="/intuitivemi/images/2015-07-19-data-numbers-representations_numbers.png">table of numbers</a> where each number encodes the gray scale value of the image. For the trick that I would like to teach you, this image is actually a bit too large. So for the sake of the argument, let us shrink the original 27x35 pixel image drastically until it gets really really tiny, namely 3x1, that means only 3 pixels wide and 1 pixel high - bear with me for a second even if this sounds silly to you. We won’t see anything on this image but that doesn’t matter right now. Our 3x1 image expressed in gray scale values now looks like this:</p>
<table class="data-table">
<tr>
<td style="background-color: #000; opacity: 0.909; width: 30px">0.909</td>
<td style="background-color: #000; opacity: 1.0; width: 30px">1.000</td>
<td style="background-color: #000; opacity: 0.860; width: 30px">0.860</td>
</tr>
</table>
<p>I have set the background color of each table cell to correspond to the actual gray scale value the number encodes. Here comes the trick: although we know that these three numbers encode gray scale values we can <em>pretend they were encoding the location of a point in 3D space</em>. So instead of encoding “luminosity” we think of the numbers are being in “meters” or “centimeters”. As you might remember from high school we draw a location as an arrow in an coordinate system. So let’s draw our 3x1 “image” in a 3D coordinate system:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-21-vector_spaces-arrow.png" id="vector-spaces-arrow" class="" width="500" height="375" />
</div>
<script>
$('#vector-spaces-arrow').gifplayer({label:'play'});
</script>
<div>
<img style="display:none; width:0px; height:0px; visibility:hidden; border:0; margin:0; padding:0;" src="/intuitivemi/images/2015-07-21-vector_spaces-arrow.gif" />
</div>
<p>So you might think: “drawing arrows is really fun but why the heck are we doing this?” There are two reasons for that: the first reason is this that we humans are really good at manipulating objects in 3D space. We know how to move objects, we know how to rotate them, how to distort and mirror them, how to project them on a 2D planar surface (by taking a picture of it), and much more. Thus, if we treat data as locations in space we can apply all our spatial 3D knowledge to it.
In fact, <a href="https://en.wikipedia.org/wiki/Linear_algebra">linear algebra</a> has got all the math worked out to simulate these 3D operations on computers (as you can admire in Toy Story, Madagascar, and so on) <a href="#[1]">[1]</a>.</p>
<p>Now the question is, what does the 3D arrow sketched above have to do with the original 27x35 pixel wide image of my face? Here’s the gist: linear algebra is so general that it does not care about the dimensionality of the data - it works for 3-dimensional spaces in the same way as for 5-dimensional or 500-dimensional spaces - even though our brains are not capable of imagining 500-dimensional data visually! This allows us to treat the 27x35 <a href="/intuitivemi/images/2015-07-19-data-numbers-representations_picture.png">picture of my face</a> as a point in some crazy unimaginable 945-dimensional space <a href="#[2]">[2]</a>! We will see later that applying such operations as moving and rotating a point in this huge space will be the basis for extracting information from it, as for example detecting what is in the image represented by this point <a href="#[3]">[3]</a>.</p>
<p>So just to make this clear: I will now continue drawing 3D arrows, but these arrows are in fact just sketches of 945-dimensional arrows (since they are impossible to draw). Therefore I will <em>write</em> about 945-dimensional, but <em>draw</em> 3-dimensional arrows.</p>
<h3 id="discriminating-between-categories--finding-separating-planes">Discriminating between categories = finding separating planes</h3>
<p>The really cool thing is that we can now understand how the computer can discriminate between images that depict different objects. Imagine we have another image of, say a <a href="https://teara.govt.nz/en/photograph/5281/blobfish-and-snailfish">blobfish</a>. By the same procedure as before we can make the blobfish a point in 945-dimensional space. Now let’s assume we do not only have one picture of me and one of the blobfish, but a whole bunch of pictures for each category, Sebastians and blobfish.</p>
<p>How could a computer automatically discriminate between these two categories? Geometrically, of course! By treating each image as a point in a 945-dimensional space, we can now come up with a geometrical interpretation of discriminating between categories: we find a plane which separates these two sets of points (blue is me, red the blobfish):</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-21-vector_spaces-arrow-plane.png" id="vector-spaces-arrow-plane" class="" width="500" height="375" />
</div>
<script>
$('#vector-spaces-arrow-plane').gifplayer({label:'play'});
</script>
<div>
<img style="display:none; width:0px; height:0px; visibility:hidden; border:0; margin:0; padding:0;" src="/intuitivemi/images/2015-07-21-vector_spaces-arrow-plane.gif" />
</div>
<p>Using this plane, the computer can now automatically discriminate between blobfish and Sebastians by checking <em>on which side of the plane</em> the point lies.</p>
<p>Sit back for a second and make sure that you have understood this. Because if so you can be proud - you have just understood the key principle behind <a href="/intuitivemi/2015/07/28/datascience-showoff.html">image classification</a>, the problem that I have presented in the intro article! The majority of all approaches in machine intelligence draws such planes through the high-dimensional image space, assigns a category to each side of the plane, and then checks on which side of the plane a new, previously unseen image lies.</p>
<h4 id="caveats">Caveats</h4>
<p>Of course, there are some details that I have omitted.
As mentioned before, the sketch is in 3 dimensions, although we should draw 945 dimensions, which is not possible; and also the thing separating the two object categories in 945 dimensions is not really a plane, but a <em>hyperplane</em> (that’s what a plane is called in four or more dimensions - not to be confounded with <a href="http://starwars.wikia.com/wiki/Hyperspace">hyperspace</a>).
But still, the math - which I have not shown you, so you need to trust me on this one - assures us that the concepts of points and planes also exist in 945-dimensional space, so it is totally valid to think of it in 3 dimensions only.</p>
<p>Another important detail that I have omitted is how to find this (hyper-)plane. And in fact this is <em>the</em> central problem of machine intelligence, or more precisely <em>machine learning</em>. We will talk about how this learning works in the post after the next one.</p>
<p>Finally, I have to lower your expectations about this method a bit: discriminating Sebastians and blowfish by finding a plane in this 945-dimensional space does not really work well in practise. The reason is that pictures of Sebastians and blowfish are not as nicely separable as I have suggested in the 3D arrow picture above, but rather scattered all around the space. But the general idea of finding discriminating hyperplanes in high-dimensional spaces still holds. And we can get it to work by first bringing the image into a different <em>representation</em>, that is a different space, and then finding the hyperplane in this new space (at this point you should remember our discussion about representations in <a href="/intuitivemi/2015/07/19/data-numbers-representations.html">the previous post</a>). You can either try to transform images in these representations explicitly, i.e. think of a good representation and program it yourself, or implicitly, trying to find a way of doing this automatically. We will talk about both approaches later, too.</p>
<h4 id="terminology">Terminology</h4>
<p>Some last notes on terminology.</p>
<p>Often, sequence of numbers, such as the 3d point or the 945-dimensional representation of the image is called a <em>vector</em>, and it then lives in a <em>vector space</em> of some dimensionality. Although being basically the same thing, the term vector has a slightly different connotation which we will ignore for now. But I will mostly use the vector mainly because it sounds cooler.</p>
<p>Secondly, mathematics tries to be parsimonious with concepts, mainly because it makes talking about things easier. Therefore, we get rid of the concept “numbers” by treating them as 1-dimensional vectors. So in the future if I talk about vectors, you can often think of them as just being numbers.</p>
<h4 id="summary">Summary</h4>
<p>To summarize, the approach of machine intelligence is the following: we transform some input, for example an <a href="/intuitivemi/images/2015-07-19-data-numbers-representations_picture.png">image</a> into a <a href="/intuitivemi/images/2015-07-19-data-numbers-representations_numbers.png">table of numbers</a>, call this a <em>vector</em> and then treat this vector as a point in a high-dimensional space. We then apply our knowledge of how to manipulate points in 3D to transform this high-dimensional vector in order to extract information from it. Moreover, we can classify between different types of data, for example object categories, by finding discriminating hyperplanes in this space.</p>
<p>In the next post we will look at how be more precise about what we mean by transformations and how to describe these hyperplanes by introducing the concept of functions.</p>
<h3 id="tldr"><a href="http://de.urbandictionary.com/define.php?term=tl%3Bdr">TL;DR</a>:</h3>
<ul>
<li>Data can be viewed as points in high-dimensional spaces</li>
<li>We can apply the same transformations to points in this space as in 3D</li>
<li>Such points are called vectors</li>
<li>Hyperplanes in space discriminate between vectors, and hence categories</li>
</ul>
<h3 id="footnotes"><a name="further"></a>Footnotes:</h3>
<ol>
<li><a name="[1]"></a>Actually, linear algebra can do even a bit more than what is “physically possible” in the 3D world, e.g. mirroring and skewing objects.</li>
<li><a name="[2]"></a>To turn images into an arrow/point, we need to remove the column and row structure of the image and write all of the numbers in one very long row. And since 27 times 35 = 945, the original image becomes a 945-dimensional point. This has indeed been common practise when learning from images (although more recent algorithms exploit the spatial, i.e. “grid” structure of the image, rather than just stitching the columns together).</li>
<li><a name="[3]"></a>It is important to notice that moving or rotating a point in 945-dimensional space which represents an image <em>is not the same</em> as shifting or rotating the image! Since moving in 3D is equivalent to “adding a number to one or more dimensions”, this is also the definition of moving in 945-dimensional space. If you push a 3D object 3m up and 5m to the right you effectively add these two numbers some coordinates of the object. Hence, in the image example, moving is equivalent to changing the gray scale value of certain pixels. Neither do vector rotations actually rotate the image.
<!-- Rotations look even weirder: -->
<!-- TODO rotated image --></li>
</ol>
Sat, 25 Jul 2015 16:02:00 +0000
http://shoefer.github.io/intuitivemi/2015/07/25/vector-spaces.html
http://shoefer.github.io/intuitivemi/2015/07/25/vector-spaces.htmlintuitivemiData, Numbers and Representations<p>I would like to start the intuitive introduction to machine intelligence by looking at the term <em>data</em>. Let us try to gain an informal but sufficient understanding of how we could define data.</p>
<h3 id="data-are-numbers">Data are numbers</h3>
<p>Big data, data mining, data analysis, data, data, data - everyone wants data. But what <em>is</em> data? The philosophical answer could be: information. The data scientist’s answer would probably be: a bunch of numbers.
Indeed, we are going to stick to the latter definition, namely that <em>data is everything that can be encoded by numbers</em> <a href="#numbersencode">[1]</a>.</p>
<p>Some examples: a picture you have taken with your digital camera is data. You can look at it, see the things depicted on it. But your digital camera uses numbers to store the pictures. Also the MP3 sound file that your are currently listening to is data. Even this text here is data! So how is data numbers?
<!-- However, information is also quite an obscure term. So let's define data as "everything you might consider doing something with". --></p>
<p>Let’s look at the concrete example of a picture:</p>
<div class="imgcenter imgcap">
<img src="/intuitivemi/images/2015-07-19-data-numbers-representations_numbers.png" class="" width="85%" height="" />
</div>
<p>What do we see here? It’s a picture of <a href="/intuitivemi/images/2015-07-19-data-numbers-representations_picture.png">my face</a> of course! Ok, if you don’t see it you should definitely watch <a href="https://en.wikipedia.org/wiki/The_Matrix">The Matrix</a> more often. In case that doesn’t work, how do we get the picture back? The trick is that I have taken the gray scale picture of <a href="/intuitivemi/images/2015-07-19-data-numbers-representations_picture.png">my face</a> and replaced every dot in the original picture with a number between 0 and 1 where 0 means “black” and 1 means “white”, and the numbers inbetween represent the different <a href="https://en.wikipedia.org/wiki/Fifty_Shades_of_Grey">shades of gray</a> between these black and white tones. When showing the image on the screen the computer translates these numbers back to gray values.</p>
<!--
(If you wonder why it is so difficult to recognize my face in the table of numbers while it is not in the gray scale picture read [this short post](/intuitivemi/2015/07/23/difficult-cv.html).)
-->
<h3 id="data-is-numbers--what--encoding">Data is numbers + what + encoding</h3>
<p>However, the way how I have chosen the numbers to substitute gray values in the picture is entirely arbitrary. We could have done the same by representing black with 0.5 and white with 12, reverse the order, or even mix up the order of the numbers, etc. The way how you choose the numbers is completely up to you. But it is very important to know that: somebody else will not know how to make sense of the numbers unless you tell her that it is a picture and which numbers correspond to which gray scale. The latter information, of how to translate gray scale values to numbers and back is what we call the <em>representation</em>, and is also sometimes called the <em>encoding</em> of the data.</p>
<!--
Moreover, some representation might be better or worse with respect to some criterion. For example, when encoding gray scale values between 0 and 1 is immediately know that 0.5 is an average gray; we wouldn't have this intuitive understanding if we encoded the values between 41543 and 341125.
-->
<p>What about the sound file example? This one is a bit more tricky. You have to know quite a lot about the physics of sound. Let’s take a 10 seconds sound file. The easy way to think about how to encode sound by numbers is to divide the 10 seconds a sound file into say 10000 equally sized parts à 1 millisecond and assign a number to each part. Every number corresponds to the “loudness” of a part. That’s basically it! You might wonder now how a human voice, a guitar and the engine of a car can all be broken done to these few numbers. Well, it’s really just that! The real “magic” is happening in your brain which collects these pieces of information and generates a certain sensation which you identify as an opera singer or Jimi Hendrix playing his guitar.
If you want to learn more about these things I recommend you to look into <a href="#brainonmusic">this book [2]</a> which does a great job at explaining what sound and music really is and how the brain comes at perceiving it the way it is.</p>
<p>What is finally left is the text example. This one is easy, too. I leave it as an exercise to you to think of how to encode or represent text as numbers. (A hint: think about how many different letters and characters we have in our alphabet).</p>
<h4 id="summary">Summary</h4>
<p>To summarize, we have learned that data is numbers plus information about what these numbers mean, called the representation or encoding.</p>
<p>As a final remark, I cannot overemphasize how crucial the representation of data is.
The representation of a problem can drastically influence the ability to solve it (as exemplified very well in <a href="#funwithrepr">[3]</a>), and corresponds directly to the question of how to represent data.
Lots of the stuff you will read about in future articles deals with transforming numbers from one representation into another. In the next article we will come up with an interesting way of looking at data in order to extract information from it.</p>
<h3 id="tldr"><a href="http://de.urbandictionary.com/define.php?term=tl%3Bdr">TL;DR</a>:</h3>
<ul>
<li>Data is numbers + representation (what these numbers stand for)</li>
<li>How data is represented is crucial</li>
</ul>
<h3 id="further-reading-and-thinking">Further reading and thinking:</h3>
<ol>
<li><a name="numbersencode">Is there something that cannot be encoded by numbers?</a> Some people would argue there isn’t. Do you agree? What would be the consequences of it?</li>
<li><a name="brainonmusic"><a href="https://en.wikipedia.org/wiki/This_Is_Your_Brain_on_Music">This Is Your Brain on Music</a> by Daniel J. Levitin</a></li>
<li><a name="funwithrepr"><a href="https://catenary.wordpress.com/2006/08/19/fun-with-representations-i-nine-numbers/">Fun with representations</a></a> is a fantastic series of articles about representations, and how different representations enable us to solve complex problems more easily than others.</li>
</ol>
Sun, 19 Jul 2015 15:07:27 +0000
http://shoefer.github.io/intuitivemi/2015/07/19/data-numbers-representations.html
http://shoefer.github.io/intuitivemi/2015/07/19/data-numbers-representations.htmlintuitivemiUnderstanding Machine Intelligence without Formal Math<p>In this post I want to defend the basic idea of this series of articles, namely why I am trying to avoid mathematical formalism for explaining data science.</p>
<p>If you don’t care about math anyway, you can skip this post and start reading about <a href="/intuitivemi/2015/07/19/data-numbers-representations.html">data</a>.</p>
<p>I do not have anything against formulas. Quite the contrary: mathematical formalisms are a fantastic way to condense and sort out the necessary ingredients of a method. Without formalism, we could not program a computer to operationalize any method. Think of a formula as a packing list for your camping trip. It has to be complete and all the things you need have to be on it, and only on site you might realize if you did something wrong (“Oh no, I forgot to bring my <a href="http://consumerist.com/2010/11/29/go-hiking-in-style-with-these-teva-high-heels/">hiking heels</a>”). In a similar way a formula has to contain everything required to describe your mathematical concept or method, and only when applying it / executing your method you realize whether something in your mathematical description is wrong or missing.</p>
<p>However, I find formulas a very bad way to understand things. I see three main reasons for that.</p>
<p>The first reason is that after studying and working in academia for a couple of years, I realized that the magic “ahhh, THAT’S how it works!” moments while trying to grasp a complex method always happened when I could find analogies in “bodily” or linguistic terms: while rotating three-dimensional structures in my head, while visualizing probabilities as a contour (like a mountainous landscape), while breaking down complex relationships to sets of causal if-then statements, and so on. Maybe not everybody thinks of mathematics like this. It is possible that some mathematical geniuses think differently when reasoning about math. But the vast majority of researchers and students that I have met so far strive for analogies in order to explain or grasp complex mathematical concepts. I think there is a lot of empirical evidence for that, for example that scientific papers plot graphs, bar charts and the like in order to visualize results instead of just printing some numbers. Two famous cognitive scientists even have made an endeavor to argue that the holy grail of mathematics, Euler’s formula, can be derived only from the way we perceive the world (in contrast to a Platonic view of universal mathematical truths) <a href="#further">[1,2]</a>.</p>
<p>The second reason is that formulas often do not reflect the way how people figured out the concept or method underlying the formula <a href="#further">[3]</a>. It is like not telling the whole story but just the end of it. You will have a hard time reconstructing the rest of the story from that little information. With formulas it is at least not impossible as they condense all information, only in a very concise manner; but you will still have a hard time figuring out its meaning without some context.</p>
<p>The third reason is much simpler: Formulas scare many people. I believe one reason for this lies in the condensed nature of formulas that I have mentioned before. Formulas use a very precise and non-redundant language, which is far less verbose than natural language. Therefore, you can spend an entire day looking at one single formula without getting it. I believe this being the reason for some people thinking that they are not good at math, although they probably could be. They get disappointed too quickly because they cannot grasp the formula within an eyelash. Relax guys, that holds true for most of us.</p>
<p>So let us now jump right into the topic and talk a bit about <em>data</em>!</p>
<p><i><b>Update 09/11/2016:</b> I have received a series of similar comments regarding this article, which I would like to respond to. The first critique was that I am claiming that all researchers think in intuitions, rather than in formulas, when devising mathematical equations and algorithms. That is of course not true; there are people who prefer to and are very good at thinking in abstract formulas, rather than in intuitions. And indeed, certain problems can be more easily solved “syntactically”, that means only by applying mathematical rules rather than thinking in intuitions (but we won’t do that in this series of articles). Secondly, I want to admit that, in some cases, the full aha-that’s-how-it-works moment comes when reconciling the intuitive and the formal, mathematical levels. But - at least for me - the intuitive level was always required to make full understanding happen.</i></p>
<h3 id="tldr"><a href="http://de.urbandictionary.com/define.php?term=tl%3Bdr">TL;DR</a>:</h3>
<ul>
<li>Formulas are good for condensing and bullet-proofing concepts and methods, but less often for understanding</li>
<li>Most people conceptualize complex things in analogies, relating them to perception (e.g. vision) and language</li>
<li>Formulas scare people</li>
</ul>
<h3 id="further-reading-and-thinking"><a name="further"></a>Further reading and thinking:</h3>
<ol>
<li><a href="http://en.wikipedia.org/wiki/Euler's_formula">Euler’s formula on Wikipedia</a></li>
<li><a href="https://en.wikipedia.org/wiki/Where_Mathematics_Comes_From">Where Mathematics Comes From</a> by Lakoff and Nunez</li>
<li>Of course, this might not be true for everybody. A colleague of mine argued that real understanding only happens if you understand both, the high-level concept or analogy, and the mathematical formula. I agree that to get a full understanding you need to care about the details – but to get a rough idea of how and why something works, I will argue that the mathematical details are not required.</li>
<li><a href="http://www.r2d3.us/visual-intro-to-machine-learning-part-1/">A Visual Introduction to Machine Learning</a> takes a similar stance as this blog and features cool visualizations of data and machine learning algorithms, as well as brief technical descriptions.</li>
</ol>
Sun, 19 Jul 2015 15:00:00 +0000
http://shoefer.github.io/intuitivemi/2015/07/19/why-no-formulas.html
http://shoefer.github.io/intuitivemi/2015/07/19/why-no-formulas.htmlintuitivemi