Записки программиста, обо всем и ни о чем. Но, наверное, больше профессионального.

2016-08-11

Machine Learning (Andrew Ng, Coursera, Stanford)

В далеком 2014 году я открыл для себя новое измерение: возможность учиться у лучших.
Все, что нужно, это компьютер, интернет и знание английского языка.

Первый курс, который я прошел (с отличием!), назывался
Machine Learning
читал его Andrew Ng
на платформе Coursera, Stanford

Вот тут есть подробное изложение изученного:

Сегодня, с целью взбодрить память, я скомпилял типа дайджест курса.
Енджой.



Week 1
    Introduction
    Linear Regression with One Variable
    (Optional) Linear Algebra Review
Week 2
    Linear Regression with Multiple Variables
    Octave Tutorial
Week 3
    Logistic Regression
    Regularization
Week 4
    Neural Networks: Representation
Week 5
    Neural Networks: Learning
Week 6
    Advice for Applying Machine Learning
    Machine Learning System Design
Week 7
    Support Vector Machines (SVMs)
Week 8
    Clustering
    Dimensionality Reduction
Week 9
    Anomaly Detection
    Recommender Systems
Week 10
    Large-Scale Machine Learning
    Example of an application of machine learning



Week 1

– Introduction
– Linear Regression with One Variable
– (Optional) Linear Algebra Review

Introduction

Tom Mitchell:
we say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E.

Algorithms classification:
supervised, unsupervised. Also: reinforcement learning, recomender systems.

Supervised learning
– you have set of answers and you feed it to machine for training/learning
– regression problem: continous valued output, polynomial function (house price)
– classification problem: 0/1 with probability, discrete valued output (is that tumor malignant?)

Unsupervised learning
– you have no answers, only data: have that data any structure? Clusterization.
– cocktail party problem: having a set of microphones, extract different sound streams/sources.

Linear Regression with One Variable (univariate linear regression)

Example: house price derived from house size.
Input dataset: m – number of records; x – variable – vector of features (square foots); y – target (price);
(x, y) – some record from dataset;
(xi, yi) – i-th record from dataset.

The job of a learning algorithm: output a function h (hypothesis):
y = h(x)
h representation:
h(x) = theta0 + theta1 * x // model
learning algorithm should find right vector theta (model parameters).

How to find theta?
Cost function: squared error, find min(J) changing parameters theta somehow.
min sum ((h(x) — y)^2 for each record 0..m)
Cost function:
J(theta) = 1/2m * sum (h(x) — y)^2 // bowl shaped function, convex function
aka: square error cost function.
Optimization objective: minimize it by selecting the right theta.

Gradient Descent (batch GD): find optimum/minimum for cost function J.
batch: on every step we're looking at all of the training records.
repeat updating theta until convergence (theta don't change in minimum point):
theta(j) = theta(j) — a * (d_J(theta)/d_theta(j))
j = 0..1 for our example
a: alpha, learning rate (hyperparameter), step size: not too small/big;
d: derivative term, function slope; ~ delta J / delta theta
d_J(theta) / d_theta0 = 1/m * sum (h(x) — y)
d_J(theta) / d_theta1 = 1/m * sum (h(x) — y)*x

Variables matrix MxN: m rows/records, n columns/variables;
target vector y: matrix Mx1.
parameters vector theta: matrix 1xN+1

(Optional) Linear Algebra Review

Matrix i-rows x j-columns;
Vector = matrix i-rows x 1 column;

Matrix addition: C11 = A11 + B11
Multiply matrix by scalar: C11 = A11 * b
Matrix 3x2 * vector 2x1 = vector 3x1: c1 = A11*b1 + A12*b2

predictionVector Mx1 = dataMatrix MxN+1 * modelParametersVector N+1x1
dataMatrix extra column (first column) = 1, 1, ...
M: num of data records
N: num of variables

Matrix multiplication: MxN * NxO = MxO // associative, not commutative
C = A * B: column i of C = A * vector i of B

Identity matrix: I(i,i) = 1 // diagonal == 1:
I * A = A * I = A if A has square shape

Inverse:
– inverse number for real numbers: num * inv = 1 // 3 * 1/3 = 1
– inverse matrix: square matrix
A * Ainv = Ainv * A = I
– if A have not Ainv: singular or degenerate matrix

Transpose:
– Bij = Aji if B = A.transpose

Week 2

– Linear Regression with Multiple Variables
– Octave Tutorial

Linear Regression with Multiple Variables (multivariate linear regression)

House: size, num of bedrooms, floors, age, …
x1, x2, … : vector x, variables.
N: number of features/columns
M: number of examples/rows
xi: i-th row of training set
xij: value of j-th feature in i-th row

Hypothesis:
let x0 = 1, then
h_theta(x) = theta_transpose * x // parameters vector transpose * variables vector 

Gradient Descent for multivariate LR
– cost function
Jtheta = 1/2m * sum ((h_theta(x) — y)^2  for each row)
– minimize it using Batch GD: repeat until convergence:
theta_j = theta_j — a * 1/m * sum ((h_theta(x) — y)*x_j for each row)

Feature scaling
– make shure that the features are on a similar scale
e.g. number of bedrooms: 1..5; size 0..2000 sq.feet
– rule of thumb: scale features to -1 .. +1 scale
– and do mean normalization: make mean ~ 0
x = (x — mu) / si 
mu: average value of x
si: standard deviation of x (si = max — min)

Learning rate
– GD is working if cost (Jtheta) is lowering on each iteration, build a graph #iter vs Jtheta
– if GD is not working: decrease alpha: learning rate

Choosing features, polynomial regression
– e.g. replace width, length with size = width * length
– polynomial model, e.g:
theta0 + theta1*x + theta2*x^2 + theta3*x^3
– you can add to model higher order variables: x^2, x^3 (x1^2, x1*x2, …)
and do feature scaling

Normal Equations: solve min(Jtheta) analytically
– find where derivative = 0
Octave: pinv(X' * X) * X' * y
X': Xtranspose
pinv: inverted, pseudo-inverse, can do inversion for degenerate matrices
– if variables n > 10000 better not to use Normal Equations, (X' * X) is n-by-n matrix, pinv(X) cost O(n^3)
– no need for feature scaling, learning rate choosing.
– X'*X can be degenerate if we have dubled features (size in foots and size in meters);
or we have too many features (m <= n).
Remove doubled features; apply regularization

Octave tutorial

'%' is a comment
1 == 2 % 0
1 ~= 2 % 1
1 && 0 % and
1 || 0 % or
xor(1, 0)
a = 3 % assignment
a = 3; % semicolon supressing output
b = 'hi'; % string
c = (3 >= 1) % 1 for true
disp(a); % print
sprintf('foobar: %0.2f', pi) % print PI a with 2 decimals
format long % swith output to long number format — 14 decimals
format short % … 4 decimals

matrices
A = [1 2; 3 4; 5 6] % matrix 3x2
A = [ 1 2;
 3 4;
 5 6 ] % the same
size(A) % 3 2 matrix size which is matrix 1x2
size(A, 1) % first dimention, num rows = 3
length(v) % size of the longest dimension
V = [1 2 3] % matrix 1x3
V = [1; 2; 3] % vector 3 or matrix 3x1
v = 1:0.1:2 % matrix 1x11 inited by range from 1 to 2 increment 0.1
v = 1:6 % matrix 1x6 by range 1 to 6 increment 1
ones(2, 3) % matrix 2x3 with '1'
zeros... % matrix with '0'
w = rand... % matrix with random numbers (normal distribution) 0 <= x <= 1
w = randn... % gaussian distribution
hist(w) % draw histogram
hist(w, 50) % draw histogram with 50 bins
eye(4) % Identity matrix 4x4 (Diagonal Matrix)
help cmd % print help text

loading data
pwd % work dir
cd 'path/to/dir' % change work dir
ls % dir list
load('featuresX.dat') % load the file
who % show workspace variables
whos % detail view
featuresX % mx2 matrix
pricesY % mx1 matrix (vector)
size(featuresX) % 47 1
size(priceY) % 47 1
clear featuresX % remove variable from workspace
clear % clear all workspace!

v = priceY(1:10) % first 10 elements
save hello.mat v % save variable v to file hello.mat — binary format
save hello.txt v -ascii % save to ascii text file

indexing operations
A(3, 2) % row 3 column 2 value
A(2, :) % row 2
A(:, 2) % column 2
A([1 3], :) % row 1 and row 3
A(:, 2) = [10; 11; 12] % assign to second column, replace column 2 vith new vector
A = [A, [100; 101; 103]] % append another column to A
A(:) % put all elements of A into a single vector
C = [A B] % concatenate A and B, add columns
C = [A; B] % concatenate, add rows

matrices ops
% let
size(A) % 3 2
size(B) % 3 2
size(C) % 2 2

A*C % 3x2 matrix
A .* B % like A + B only Aij * Bij % element-wise multiplication 
A .^ 2 % (A dot carry 2) element-wise squaring (power2)
1 ./ v % element-wise reciprocal (1/x) — обратная величина
log(v) % element-wise logarithm
exp(v) % element-wise base E exponentiation
abs(v) % element-wise absolute value
-v % element-wise negative (-1*v)
v + ones(length(v), 1) % element-wise increment % == v + 1
A' % transpose A
max(A) % maximum value column-wise
[val, ind] = max(A) % val — max.value, ind — index in matrix
A < 3 % element-wise comparison
find(A < 3) % return indexes where value < 3
A = magic(3) % magic matrix — sums by rows, columns, diagonals is equal
[r, c] = find(A >= 7) % return rows indexes and columns indexes % A(r, c)
sum(A) % return sum of all elements
prod(A) % return product of all elements
floor(A) % rounds down elements 
ceil(A) % rounds up all elements 
max(rand(3), rand(3)) % max.values for two matrices
max(A, [], 1) % returns max.values for each column % 1 means '1-st dimension of A' % == max(A)
max(max(A)) % absolute maximum == max(A(:))
sum(A, 1) % sums per column 
sum(A, 2) % sums per row, returns vector
A .* eye(9) % returns diagonal from A 
sum(sum(A .* eye(9))) % sum for diagonal 
flipud(eye(9)) % another diagonal % flip upside down, Permutation Matrix 
pinv(magic(3)) % pseudo-inverse magic matrix 
pinv(A) * A == I

visualisation
t = [0:0.01:0.98]
y1 = sin(2*pi*4*t)
plot(t, y1)
y2 = cos(2*pi*4*t)
plot(t, y2)

plot(t, y1)
hold on % don't remove prev.plot
plot(t, y2, 'r') % red color
xlabel('time') % label for x
ylabel('value') % label for y
legend('sin', 'cos') % draw a legend
title('my plot') % draw a title
print -dpng 'filename.png' % save the plot to a file 
close % close plot window

% open two plot windows
figure(1); plot(t, y1);
figure(2); plot(t, y2);

% define draw cell in plot grid
subplot(1, 2, 1) % divides plot to a 1x2 grid, access first grid element 

axis([o.5, 1, -1, 1]) % change axis ranges

clf % clear the figure

%let
A = magic(5)
imagesc(A) % plot grid 5x5 colors
imagesc(a), colorbar, colormap gray; % draw grid in gray tone and draw legend
% comma chaining of commands/function calls

for, while, if; functions
% let
v = zeros(10, 1) % vector 10

% replace vector elements 
for i = 1:10,
 v(i) = 2 ^ i;
end;

% replace first 5 elements
i = 1;
while i <= 5,
 v(i) = 100;
 i = i + 1;
end;

% break, continue works
i = 1;
while true,
 v(i) = 999;
 i = i + 1;
 if i == 6,
  break;
 end;
end;

if v(1) == 1,
 disp('one');
elseif v(1) == 2,
 disp('2')
else
 disp('not 1 nor 2')
end;

% functions
% create file 'functionName.m'
% write to it:
function y = functionName(x)
y = x ^ 2;
% this function returns squares of argument
addpath('path/to/dir/with/function file')

% functions with multiple returns
function [y1, y2] = funcName(x)
y1 = x ^ 2;
y2 = x ^ 3;
% returns row vector

[a, b] = funcName(3)

% compute Cost Function 
X = [1 1; 1 2; 1 3] % design matrix
y = [1; 2; 3] % results vector
theta = [0; 1] % theta vector

% costFunctionJ.m
function J = costFunctionJ(X, y, theta)
% X is the 'design matrix' containing our training examples.
% y is the class labels
m = size(X, 1) % number of training examples
predictions = X * theta % predictions of hypothesis on all m examples
sqrErrors = (predictions — y) .^ 2 % squared errors
J = 1 / (2 * m) * sum(sqrErrors)
% easy

j = costFunctionJ(X, y, theta) % returns 0 for given theta 

% unvectorize implementation
prediction = 0.0
for j = 1:n+1,
 prediction = prediction + theta(j) * x(j)

% vectorize implementation
prediction = theta' * x

linear regression with regularization
% call: J = linearRegCostFunction([ones(m, 1) X], y, theta, 1);
% Parameters:
%   X - train.set variables, matrix 12x2 with column of '1' added
%   y - train.set results, vector 12x1
%   theta - model parameters, vector 2x1
%   lambda - regularization parameter, real number
% Returns:
%   J - cost value for given parameters, real number
%   grad - gradients for cost function, vector 2x1

% Linear Regression Cost Function Formulae
%   J = 1/(2m) * (sumErr) + lambda/(2m) * reg
%   sumErr = sum((hyp - y)^2)
%   hyp = theta' * x
%   reg = sum(theta^2) % skip first parameter!

hyp = X * theta; % vector 12x1 = 12x2 * 2x1
sumErr = sum(power(hyp - y, 2)); % real number = sum(vector)
reg = sum(power(theta(2:end, :), 2)); % real number; skip first parameter!
J = (sumErr / (2*m)) + ((lambda * reg) / (2*m));

% Linear Regression Cost Function Gradient Formulae
%   for j == 0
%      dJ/dt0 = 1/m * sumPartErr
%   for j > 0
%       dJ/dtj = 1/m * sumPartErr + (lambda/m * tj)
%   tj - theta(j)
%   sumPartErr = sum((hyp - y) * X)

grad = (1/m) * X' * (hyp - y); % 2x12 * 12x1 = 2x1 vector size n,1
temp = theta;  % vector size n,1
temp(1) = 0;   % because we don't add anything for j = 0
grad = grad + ((lambda/m) * temp); % vector size n,1

Week 3

– Logistic Regression
– Regularization

Logistic Regression

Binary classification problems, algorithm: Logistic Regression
– y: result, discrete values;
– y E {0, 1}: binary classification problems
– y E {1, 2, 3}: multiclass classification problems
– 0 <= h_theta(x) <= 1 : logistic regression

Hypotesis
h_theta(x) = g(theta.transpose * x)
– g: sigmoid function (logistic)
g(z) = 1 / (1 + e^-z)
– h_theta(x) : P(y=1 | x;theta) : probability that y = 1, given x parametrized by theta.

Decision Boundary
– sigmoid graph shows that g(z) > 0.5 if z > 0
– from that follows: predict y = 1 if theta.transpose * x > 0
– decision boundary equation: theta.transpose * x = - theta0
– decision boundary can be in any form if we have polynomial functions (variables of higher order)

Logistic Regression Cost Function
– to get convex shape (global minimum) we need different cost function
cost(h_theta(x), y) = -log(h_theta(x)) if y == 1 or -log(1 — h_theta(x)) if y == 0
– Cost Function J for Logistic Regression
J_theta = 1/m * sum(cost(h_theta(x), y) for each record)
// where:
cost(h_theta(x), y) = -y * log(h_theta(x)) — (1 — y) * log(1 — h_theta(x))

Gradient Descent looks the same
– repeat until convergence
theta_j = theta_j — a * (delta J / delta theta_j) 
// unwrapped:
theta_j = theta_j — a/m * sum((h_theta(x) — y) * x_j for each record)

And don't forget Features Scaling/Normalization.

Advanced Optimization
– given theta, we have code that compute: J_theta; delta_J/delta_theta for each column;
– there exists a few optimization algorithms:
gradient descent
conjugate gradient
BFGS
L-BFGS
– easy to use, more complex, faster
- to use them you should write and pass a function
(errCost, gradient) = costFunction(theta): errCost = J value; gradient = delta J / delta theta 

Multiclass Classification: One-vs-all (one-vs-rest)
– example: email labels/tags, work, friends, family, …
– find out (for 3 classes) 3 hypothesis for binary classification:
work / rest; friends / rest; family / rest
– on a new input (prediction), pick the hypothesis (class) that give max h_theta(x)

Regularization

Overfitting problem :
accurate predictions on a training set but, not generalize well to make accurate predictions on new, unseen examples
– in case of small set of features / low order polynom we can get an underfitting (high bias):
straight line as a decision boundary
– or, if we have too many features / high order polynomial h function, we can get an overfitting (high variance):
too curvy line as a decision boundary
– what to do? drop less important features; apply regularization, penalizing model parameters.
don't penalize parameter theta0

cost function, regularization (linear regression)
– add to cost function a new summand
J_theta = 1/2m * ( sum( (h(x) — y)^2 for each record) + lambda * sum( theta^2 for each column excluding col[0]))
lambda * sum(theta^2): regularization term (L2 regularization: Ridge regression – theta^2)
lambda: regularization parameter, hyperparameter
– if lambda is too large: we get underfitting (straight line?)

For regression models, the two widely used regularization methods are L1 and L2 regularization, also called lasso and ridge regression when applied in linear regression.
L2 … models are much more stable (coefficients do not fluctuate on small data changes as is the case with unregularized or L1 models) … a predictive feature will get a non-zero coefficient, which is often not the case with L1

Gradient descent with regularization (linear regression)
theta = theta*(1 — alpha * lambda/m) — alpha * (1/m * sum( (h(x) — y)*x for each record) )
theta: for each column except 0
1 — alpha * lambda/m < 1 : penalize parameter theta

Normal equation with regularization
– w/o regularization
theta = pinv(X.transpose * X) * X.transpose * y 
– with regularization
theta = pinv(X.transpose * X + lambda * specialMatrix) * X.transpose * y 
specialMatrix: n+1 x n+1 dimensions, diagonal, (0,0) = 0
– n.b. regularization take care of degenerate X.transp * X // if m < n matrix may be degenerate

Logistic Regression + Regularization
– cost function with regularization
J(theta) = -[1/m*sumi=1m(y(i)*log(htheta(x(i)) + (1-y(i))*log(1-htheta(x(i)))))] + lambda/2m * sumj=1n(thetaj2)
don't penalize parameter theta0!
– Gradient descent with regularization (logistic regression)
thetaj := thetaj*(1-alpha*lambda/m) — alpha*1/m * sumi=1m(htheta(x(i) — y(i))*xj(i))
(1-alpha*lambda/m) : regularization

advanced optimization with regularization
– you should provide function (Jval, gradient) = costFunc(theta)
– Jval = value of cost function with regularization
– gradient = delta J / delta theta
= 1/m sum( (h(x) — y) * x for each record) // for theta0
= 1/m sum( (h(x) — y) * x for each record) + lambda/m * theta // for theta1..n

Week 4

– Neural Networks: representation

problem: too many features in high-order polynomial models
– computer vision: 50x50 pixels, 2500 pixels in image, n features = 2500 grayscale, 7500 RGB,
polynome 2-nd degree would have ~ 3M parameters
– what can we do?

Neural Networks : algorithms that try to mimic the brain
neuron model: logistic unit
input: vector x; parameters vector theta (weight);
output: scalar h_theta(x) // sigmoid (logistic) activation function (on/off)
h_theta(x) = 1 / (1 + e^-z); z = theta.transpose * x
// or 
h_theta(x) = g(z)
– x[0]: bias unit = 1 (remember linear regression?)
– layers: input layer, hidden l., output l.
– a(i,j): activation of unit i in layer j
– Theta(j): matrix of weigths controlling function mapping from layer j to layer j+1
– Theta(j) size: n2 x n1+1 ; n1: #units in layer j; n2: #units in layer j+1

Forward propagation
– a(i,j) = g(z(i,j)) ; from that follows
vector a(layer_i) = g(Theta(layer_i-1) * a(layer_i-1)) 
– layer-by-layer: forward propagation:
layer 0: a(0) = x; add bias unit to a(0): a(0,0) = 1 
layer 1: a(1) = g(Theta(0) * a(0)) 
add bias unit to a(1): a(1,0) = 1 
a(2) = g(Theta(1) * a(1))

Multiclass classification, neural networs
– multiple output units: ove-vs-all
– result is a vector; y in training set also a vector
– only one elem in vector == 1, other == 0

Week 5

– Neural Networks: learning

L: number of layers
sl: number of units (excluding bias unit) in layer l
K: number of classes

– binary classification problem: one output unit, y = 0|1; K <= 2
– multi-class classification problem: K >= 3, output vector y is size K

NN Cost Function
– alike logistic regression, only added new dimensions for sums
J(Theta) = - 1/m * loss + regularization
loss = sum (for_each_record_i for_each_class_k ( y(k,i) * log(h(theta,x,i,k)) + (1 — y(k,i)*log(1 — h(theta, x, i, k))) ) ) 
regularization = lambda/2m * sum (for_each_layer_l for_each_unit_i_in_l for_each_unit_j_in_l+1 Theta(l,j,i)^2)
– don't penalize theta0

Backpropagation algorithm
– for CostFunction optimization we need to compute:
cost value J(Theta)
gradient (partial derivative) delta J / delta Theta(i, j, l)
– Theta(i,j,l): real number
– for each layer Theta size = variables size x units size

let delta(j,l) = 'error' of node j in layer l
let L = 4 ; j: unit number
delta(j,4) = a(j,4) — y(j) = h(theta, x, j) — y(j) 
in layer 3, vector form:
delta(3) = Theta(3).transpose * delta(4) .* g'(a(3))
g'(a(3)) = g(a(3)) .* (1 - g(a(3)))
.* : element-wise multiplication
delta(1) does not exists, it's an input layer
bottom line: gradient (partial derivative) dJ/dtheta(i, j, l) = a(j,l) * delta(i, l+1) if lambda = 0 // no regularization

set Delta(i, j, i) = 0 for all i, j, l
for i =1 to m:
set a(1) = x(i) // input vector, variables
perform forward propagation, compute a(l) for l = 2,3...L
using y(i), compute delta(L) = a(L) — y(i)
compute delta(L-1), delta(L-2, ...delta(2)
aggregate Delta(l) = Delta(l) + (delta(l+1) * a(l).transpose)
gradient D:
D(i,j,i) = 1/m * Delta(i,j,l) + ( regularization  if j != 0 )
regularization = lambda * Theta(i,j,l) 
gradient (partial derivative) dJ/dtheta(i, j, l) = D(i,j,l)

Theta(1), Theta(2), Theta(3) : parameter matrices for layers;
D(1), D(2), D(3) : gradient matrices for layers.
We need vectors to use in optimization functions.
Unroll parameters
thetaVec = [ Theta1(:); Theta2(:); Theta3(:) ];
DVec = [ D1(:); D2(:); D3(:) ];

Gradient checking (numerical gradient), very slow:
gradApprox = (J(theta + EPSILON) — J(theta — EPSILON)) / 2 * EPSILON
delta J / delta theta
for i = 1:n, // number of parameters
 thetaPlus = theta;
 thetaPlus(i) = thetaPlus(i) + EPSILON;
 thetaMinus = theta;
 thetaMinus(i) = thetaMinus(i) — EPSILON;
 gradApprox(i) = (J(thetaPlus) — J(thetaMinus)) / (2*EPSILON);
end;
check that gradApprox ~ DVec

Theta random initialization
Theta1 = rand(10,11) * (2*INIT_EPSILON) — INIT_EPSILON;
breaks the weights symmetry

NN, altogether
– select NN architecture: number of hidden layers, number of units in layers
number of input units = #variables; #output units = #classes
by default: one hidden layer; #hidden units ~ #features
– init Theta (random)
– implement FP to get h(Theta, x(i)) for any x(i)
– implement cost function J(Theta)
– implement BP to compute partial derivatives delta J / delta Theta
– for i = 1:m // for each record in dataset
perform FP, BP (x(i), y(i))
// activations a(l) and delta(l) for l = 1:L; accumulated D
– use gradient checking to compare D and gradApprox; disable checking
– use optimization method to try to minimize J(Theta)
– make shure J(Theta) is decreasing with every iteration

Example (Octave):
– cost function
% Parameters:
%   Theta1 is a model parameters for first layer, matrix size (hidden_layer_size, input_layer_size+1);
%   Theta2 is a m.p. for second layer, matrix size (num_labels, hidden_layer_size+1);
%   X is a training set variables, matrix size (m, input_layer_size)
%   y is a t.s. results filled with labels values, vector size (m, 1)
%   lambda is a regularization parameter, real number
% Returns
%   J is cost value for given Theta and training set, real number
%   Theta1_grad, Theta2_grad is a gradients of cost function for each theta, matrices size(Theta1), size(Theta2)

% FP, h(theta, x); from ex3
A1 = [ones(m, 1) X]; % matrix size m,401
Z2 = A1 * Theta1'; % matrix size m,401 * 401,25 = m,25
A2 = sigmoid(Z2); % matrix size m,25
A2 = [ones(m, 1) A2];
Z3 = A2 * Theta2'; % matrix size m,25 * 26,10 = m,10
A3 = sigmoid(Z3); % matrix size m,10

costVector = zeros(m, 1);
for i = 1:m % K = num_labels
    h = A3(i, :)'; % vector size (K, 1)
    yi = zeros(num_labels, 1); % yi is a vector size (K, 1)
    yi(y(i)) = 1;
    outCost = ((-1 * yi) .* log(h)) - ((1 - yi) .* log(1 - h)); % vector size (K, 1)
    costVector(i) = sum(outCost); % real number
end
J = (1/m) * sum(costVector); % real number

% Cost Regularization
subSum1 = sum(sum(power(Theta1(:, 2:end), 2)));
subSum2 = sum(sum(power(Theta2(:, 2:end), 2)));
reg = (lambda / (2 * m)) * (subSum1 + subSum2); % scalar
J = J + reg;

– backprop
% Forward Propagation, h(theta, x); from ex3
A1 = [ones(m, 1) X]; % matrix size m,401
Z2 = A1 * Theta1'; % matrix size m,401 * 401,25 = m,25
A2 = sigmoid(Z2); % matrix size m,25
A2 = [ones(m, 1) A2]; % m,26
Z3 = A2 * Theta2'; % matrix size m,26 * 26,10 = m,10
A3 = sigmoid(Z3); % matrix size m,10

% Cost Value, Gradient
costVector = zeros(m, 1);
deriv1 = zeros(hidden_layer_size, input_layer_size+1); % 25x401
deriv2 = zeros(num_labels, hidden_layer_size+1); % 10x26
for i = 1:m % K = num_labels = 10
    % cost values
    h = A3(i, :)'; % vector size (K, 1)
    yi = zeros(num_labels, 1); % yi is a vector size (K, 1)
    yi(y(i)) = 1;
    outCost = ((-1 * yi) .* log(h)) - ((1 - yi) .* log(1 - h)); % vector size (K, 1)
    costVector(i) = sum(outCost); % real number

    % Back Propagation
    delta3 = h - yi; % vector size (K, 1)
    % remove bias unit from delta calc!
    % delta2 = Theta2' * delta3 .* sigmoidGradient(z2)
    delta2 = Theta2(:, 2:end)' * delta3 .* sigmoidGradient(Z2(i, :)'); % vector size (25, 1) : 25x10 * 10x1 .* 25x1
    % delta1 not exists

    % gradient for all units, yes, bias too!
    % deriv2 = delta3 * a2'
    deriv2 = deriv2 + (delta3 * A2(i, :)); % matrix 10x26 : 10x26 + 10x1 * 1x26
    deriv1 = deriv1 + (delta2 * A1(i, :)); % matrix 25x401 : 25x401 + 25x1 * 1x401
end

% Cost
J = (1/m) * sum(costVector); % real number

% Gradient
Theta1_grad = deriv1 / m; % 25x401
Theta2_grad = deriv2 / m; % 10x26

% Cost Regularization
subSum1 = sum(sum(power(Theta1(:, 2:end), 2)));
subSum2 = sum(sum(power(Theta2(:, 2:end), 2)));
reg = (lambda / (2 * m)) * (subSum1 + subSum2); % scalar
J = J + reg;

% Gradient Regularization
% reg = 0 if j == 0 (j is a number of unit in left/input layer)
% reg = (lambda / m) * Theta if j != 0
% grad = deriv /m + reg
Reg1 = (lambda / m) * Theta1; % 25x401
Reg2 = (lambda / m) * Theta2; % 10x26
Reg1(:, 1) = 0;
Reg2(:, 1) = 0;
Theta1_grad = Theta1_grad + Reg1;
Theta2_grad = Theta2_grad + Reg2;

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];
% the end

– sigmoid gradient
% sigmoid(z) = g(z)
gz = 1 ./ (1 + exp(-z));
% gradient = dg/dz = g(z) * (1 - g(z))
g = gz .* (1 - gz);

- randomly initialize weights
epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init — epsilon_init;
einit = sqrt(6) / sqrt(Lin + Lout)

Week 6

– Advice for Applying ML
– ML Systed Design

Advice for Applying ML

Deciding what to try next
– well trained linear regression make a unacceptably large errors in its predictions
– what can you do?
get more training examples (if overfitting)
try smaller sets of features (if overfitting)
try getting additional features (if high bias)
try adding polynomial features (if high bias)
try decreasing/increasing lambda (underfitting/overfitting)
– use ML diagnostic to check/select next step

Evaluating a hypothesis
– split data to training set (70%) and test set (30%)
– overfitting: training set is OK (low error), test set is bad (high error)
testErr = 1/2m * sum (h - y)^2

Model selection and train/validation/test sets
– split data set to train (60%), cross-validation (20%), test set (20%)
– select models (hyperparameters): polynomial degree (d), regularization (lambda), threshold (for logistic regression)…
– different models, different parameters Theta
– find model with min. error on cross-validation set; check selected model on test set

Diagnosing Bias vs Variance
– first thing to do
– underfitting vs overfitting
– plot error / degree of polynomial : two lines: training error, cross-validation error
training error is lowering monotonically
cv error have u-shape
– left part of plot: bias problem, both errors are hight;
– right part: variance problem, training error is low, cv error is hight

Regularization and Bias/Variance
– large lambda: high bias; small lambda: high variance
– select set of x lambdas: lambda = 0, 0.01, 0.02, 0.04, …
– train x models, get x theta parameters
– check x models on cv set, select one with min err
– check model on test set
– plot a graph lambda/errors for train/cv tests
train error grow with lambda
cv error have u-shape
left part: overfitting; right part: underfitting

Learning curves
– plot graph: error/training set size; cv error and train error
– high bias/underfitting: both errors are high and close to each other
if you have high bias problem, getting more training data will not help much
– high variance/overfitting: training error are small, cv error are significantly larger
getting more data is likely to help

Neural networks and overfitting
– 'small' NN more prone to underfitting
– 'large' NN more prone to overfitting, regularization can help
– number of layers, number of units, … : hyperparameters, can tune using train/cv/test subsets

ML Systed Design

Prioritizing what to work on
– start with a simple algorithm that you can implement quickly
implement it, test it on cv data
– plot learning curves to decide if more data/features/etc are likely to help
error analysis: manually examine the examples (in CV set) that your algorithm made errors on
see if you spot any systematic trend
select new features
use good numerical metrics to evaluate models/algorithm
cross-validation error

Error metrics for skewed classes
– in that case use Precision/Recall/F1 metrics
e.g. your logistic regression predict cancer with 1% error;
but only 0.5% of patients have cancer.
function h = 0 give as 0.5% error, wicked!
– true positive: y=1, h=1
– false positive: y=1, h=0
– true negative: y=0, h=0
– false negative: y=0, h=1
precision: truePositive of all positive: tp/(tp + fp)
recall: truePositive of real positive: tp/(tp + fn)
accuracy = (true positives + true negatives) / (total examples)

Trading off precision and recall
– logistic regression: 0 <= h <= 1
predict 1 if h >= 0.5, 0 otherwise
– what if threshold set to 0.7?
high precision, low recall
– if we want to detect most cases of cancer: set high recall: threshold = 0.3, for example
select threshold with max F1 Score
F1 Score = 2 * (P * R) / (P + R)

Data for ML
– if human expert can use data for prediction, and if we can collect more data: collect it
– given the input x, can a human expert confidently predict y?
– if we have many features, we need larger training set (unlikely to overfit)

example (Octave)
% call: [error_train, error_val] = learningCurve([ones(m, 1) X], y, [ones(size(Xval, 1), 1) Xval], yval, lambda);
for i = 1:m
    % find theta % [theta] = trainLinearReg([ones(m, 1) X], y, lambda);
    [theta] = trainLinearReg(X(1:i, :), y(1:i, :), lambda);

    % training error % [J, grad] = linearRegCostFunction(X, y, theta, lambda)
    [J grad] = linearRegCostFunction(X(1:i, :), y(1:i, :), theta, 0);
    etr = J;

    % cross validation error
    [J grad] = linearRegCostFunction(Xval, yval, theta, 0);
    ecv = J;

    error_train(i, 1) = etr;
    error_val(i, 1) = ecv;
end

Week 7

– Support Vector Machines (SVM)

Support Vector Machines (SVM) : large margin classification
– logistic regression has errorCost > 0 if z >=1
z = theta.transpose * x // polynome
h = 1 / (1 + e^-z) // hypotesis = 1 if z >= 0, 0 otherwise
– we want cost1(z) = 0 if z >= 1, y=1; cost0(z) = 0 if z <= -1, y=0
– then min cost (cost function optimization) over theta:
min C * sum (for each record: y * cost1(z) + (1 — y) * cost0(z)) + reg 
reg = ½ sum (for each parameter except first: theta^2)
C = 1/lambda // you need to select lower C to fight overfitting
Informally, the C parameter is a positive value that controls the penalty for misclassified training examples.
A large C parameter tells the SVM to try to classify all the examples correctly (overfitting).

Mathematics Behind Large Margin Classification
– vector inner product: u.T * v = u1*v1 + u2*v2 = v.T*u // real number
– ||u||: is a norm of vector u, length(u) = sqrt(u1^2 + u2^2) // real number
– p: length of projection of vector v onto vector u
– inner product = p * ||u||
SVM decision boundary:
– min ½ sum theta^2 = ½ (theta1^2 + theta2^2) = min ||theta|| // norm of vector theta
minimize length of theta
– theta.T * x = p * ||theta|| // projection x to theta
maximize projection x to theta
– SVM: large margin classification

Kernels
Non-linear decision boundary
– try to replace high-degree polynomial features with synthetic features
f1 = similarity(x, l1) = exp( - ||x — l1||^2 / 2sigma^2)
…
fm = ...
l1: landmark: l1 = x1, … lm = xm
– similarity: kernel function (Gaussian Kernels)
if x ~ l1: f1 ~ 1
if x is far from l1: f1 ~ 0
sigma: hyperparameter, control kernel shape
predict y = 1 if theta.T * f >= 0
– given m lendmarks, compute m synthetic features f
fi = similarity(x, li)
– training set: m x m matrix F
– use SVN, not logistic regression (too many features)
– parameter sigma: larger sigma: lover variance, increase sigma to fight overfitting

Using an SVM
– use packages: liblinear, libsvm, …
– specify: param C, kernel (similarity function)
no kernel: 'linear kernel': f = x
– if we have many features & not so many records: overfitting, no need for kernel
– if many records & small #features: use Gaussian kernel, choose sigma
– do perform feature scaling/normalization before using Gaussian kernel
function f = kernel(x1, x2)
 f = exp( -1 * ||x1 — x2||^2 / 2sigma^2)
return
– n.b. not all similarity functions make valid kernels; need to satisfy 'Mercer's Theorem'
– misc. kernels: polynomial, string, chi-square, …
– multi-class classification: embedded functionality of: one-vs-all, K SVMs

Logistic regression vs SVM
– n: #features, m: #records/examples
– if n is large (relative to m, 10000 vs 1000): use logistic regression or SVM w/o a kernel
– if n is small, m is intermediate (1000 vs 10000): use SVM with Gaussian kernel
– if n is small, m is large (1000 vs 50000): create/add more features, then use logistic regression or SVM w/o a kernel
– Neural network likely to work well for most of these settings, but slower to train

Example (Octave)
– You can think of the Gaussian kernel as a similarity function that measures the “distance” between a pair of examples
% Kgaussian = exp(-1 * norm(x1-x2)^2 / 2*sigma^2)
sim = exp(-1 * ( power(norm(x1 - x2), 2) / (2 * power(sigma, 2)) ));
– select C and sigma
if ~exist('C_vect', 'var') || isempty(C_vect)
    C_vect = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30]; % by default
end

if ~exist('sigma_vect', 'var') || isempty(sigma_vect)
    sigma_vect = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30]; % by default
end

% for each pair of C and sigma from C_vect, sigma_vect
% find model parameters (theta)
% find model error on cross validation set
% select pair C,sigma with min(error)

minErr = 1000000;
for currC = C_vect'
    for currSigma = sigma_vect'
        model = svmTrain(X, y, currC, @(x1, x2) gaussianKernel(x1, x2, currSigma));
        predictions = svmPredict(model, Xval);
        err = mean(double(predictions ~= yval));
        if err <= minErr
            minErr = err;
            C = currC;
            sigma = currSigma;
        end
    end
end

Week 8

– Clustering
– Dimensionality reduction

Clustering: unsupervised learning
– unlabeled data
– find structure in the data

K-Means
– clustering most used algorithm
– K: number of clusters
– training set: x size m,n: m rows, n columns (drop x0 = 1 by convention)
1. cluster assignment
2. move centroids
– algorithm:
randomly initialize K cluster centroids w1, w2, …, wK; each size n
repeat: for i = 1..m: c(i) = index of centroid closest to x(i)
for k = 1..K: w(k) = average (mean) of points assigned to cluster k
– K-Means for non-separated clusters
T-shirt sizing (L, M, S)

Optimization objective (distortion)
– c(i): index of cluster to which example x(i) is currently assigned
– w(k): cluster k centroid, size n
– w(c, i): centroid of cluster c, to which example i has been assigned
– optimization objective (cost function)
J(c, w) = 1/m * sum (for each record: ||x — w(c)||^2)
min J(c, w) 
– K-means algorithm minimize function in two steps on each iteraton

Random initialization
– K < m
– randomly pick K training examples
– set w1, …, wK to these examples
for i = 1..100 { random init; run K-means, get c,w; compute cost J(c,w) }
select min cost.

Choosing the number of clusters
– elbow method
– in a graph: cost J vs #clusters K you can see an elbow point
but, sometimes not
– more #clusters gives lower cost J, always (if computations done right)
– choose #clusters K based on a metric for how well it performs for your business

Dimensionality reduction

Data compression
– example: two features: len inches, len centimeters: compress to one dimension
– 3D => 2D, projecting points to 2D plane
– for dataset: n features, m examples, data compression changes n => k, k < n

Visualization
– to plot a graph, you need to compress your data to 2D or 3D dataset

Principal Component Analysis (PCA)
– optimization task: find a surface with k dimensions with min sum (dist^2)
dist: distance between original point and projection point
– reduce from n-dimensions to k-dimensions: find k vectors u1, u2, … uk
onto which to project the data, so as to minimize the projection error
– you need to scale and normalize your data first!
– PCA is not linear regression: LR minimize error, not distance; PCA have not prediction 'y'

Data preprocessing
– feature scaling/mean normalization
x = (x — mu) / si 
mu: average value of x = 1/m sum (for each record: x)
si: standard deviation of x (si = max — min)

PCA algorithm
– compute 'covariance matrix'
Sigma = 1/m sum (for each feature: x * x.transpose) // matrix n,n
 or: Sigma = 1/m * X.T * X
– compute 'eigenvectors' of sigma
[U, S, V] = svd(Sigma) // singular value decomposition 
– U is matrix n,n; we need firs k vectors from it => Ureduce matrix n,k // n rows, k columns
Ureduce = U(:, 1:k)
– project data: z = Ureduce.T * x // [k,1] = [k,n] * [n,1] 
 Z = X * U(:, 1:K);
or: [k,m] = [k,n] * [n,m]

Choosing the number of principal components (k)
– aspe: average squared projection error: 1/m sum (for each record: norm(x — xapprox)^2)
– tvd: total variaton in the data: 1/m sum (for each record: norm(x)^2)
– choose k to be smallest value so that: aspe/tvd <= 0.01 // 1%
saving 99% of variance : 99% of variance is retained
– try PCA with k = 1, compute... check … increase k …
there is a faster method: use matrix S from eigenvectors
(sum (for i in 1..k: S(i,i)) / sum (for i in 1..n: S(i,i))) >= 0.99

Reconstruction from compressed representation
– compress: z = Ureduce.T * x
– reconstruct: xapprox = Ureduce * z // [n,1] = [n,k] * [k,1]
X_rec = Z * U(:, 1:K)';

Advice for applying PCA
– reduce memory/disk needed for data
– visualisation
– speedup supervised learning: reduce #features
– mapping x => z should be defined by running PCA only on the training set
this mapping can be applied to cross-validation and test sets
– bad use of PCA: instead of regularization, to fight overfitting
– first try running your ML with original/raw data; only if that doesn't do what you want, then consider using PCA

Example (Octave)
– K-Means, find closest centroids
% Parameters
%   X is a 300x2 matrix, training set
%   centroids is a 3x2 matrix, initial centroids
%   K is a number of clusters
% Returns
%   idx is a 300x1 vector, cluster indices for each rec.from training set

% cost = norm(x - mu)^2
% idx = min(cost)

m = size(X, 1);
for i = 1:m
    x = X(i, :)';
    costMatrix = [];

    for j = 1:K
        mu = centroids(j, :)';
        cost = power(norm(x - mu), 2);
        costMatrix = [costMatrix; cost j];
    end

    [c, j] = min(costMatrix);
    minCostIdx = costMatrix(j(1), 2);
    idx(i) = minCostIdx;
end
– updated centroids
for i = 1:K
    subset = find(idx == i);
    Xi = X(subset, :);
    centroids(i, :) = mean(Xi);
end

Week 9

– Anomaly Detection
– Recommender Systems

Anomaly Detection
– build a model P(X) on training set
– if P(test) < epsilon => anomaly detected
– fraud detection: x(i) = features of user(i) activities; check if p(x) < epsilon
– monitoring computers in datacenter: x(i) = features of machine(i): x1: memory usage, x2: CPU load, …
– too many false positives? decrease epsilon

Gaussian (normal) distribution
– x distributed with mean mu, variance (standard deviation) sigma^2
bell shaped curve
p(x, mu, sigma^2) = (1 / (sqrt(2Pi) * sigma)) * exp(-1 * (x — mu)^2 / 2sigma^2)
– bell shape area = 1
– having a training set X
mu = 1/m * sum (for each record: x)
sigma^2 = 1/m * sum (for each record: (x — mu)^2)

Density estimation
– training set: x1, x2, … xm
matrix m,n // n: #features/columns
– p(x) = p(x1)*p(x2)...p(xn)
– each feature have normal distribution
– find n parameters: mu, sigma
– p(x) = product (for each feature j: p(xj,muj,sigmaj^2))

Given new example x, compute p(x)
– if p(x) < epsilon: anomaly detected

Developing and evaluating an anomaly detection system
– for evaluation we need a real-number metrics
– apply cross-validation, test techique to feature/epsilon selection
Example:
– 10000 good engines; 20 flawed/anomalous
– training set: 6000 good engines
– CV: 2000 good (y=0); 10 anomalous (y=1)
– test: 2000 good, 10 anomalous
– fit model p(x) on training set
– on cv/test predict y=1 if p(x) < epsilon; y=0 otherwise
– metrics: true positive, false positive, false negative, true negative; precision/recall; F1-score
– use cross-validation dataset to choose parameter epsilon and/or features for model

Anomaly detection vs. Supervised learning
– anomaly: very small #positive examples/anomalies, 0-20; large #negative examples
many different types of anomalies, unknown anomalies
fraud detection; manufacturing; monitoring (computers)
– supervised: large #positive and negative examples
well defined positive examples, future examples likely to be similar to ones in training set
email spam; weather prediction; cancer classification

What features to use
– is feature histogram normal/Gauss?
– can you transform feature data to normal distribution?
x1 = log(x1)
x2 = log(x2 + c)
x3 = x3^1/2
– we want p(x) large for good examples; p(x) small for 'bad'
what if p(x) is comparable for both?
– look at records and find features that differ
– compose synthetic features: x5 = x4/x3 = CPUload/networktraffic

Multivariate Gaussian distribution
– consider not single feature, but a combination of a few
– #features = n; vector mu size n; matrix Sigma size n,n // covariance matrix
mu; Sigma: model parameters
p(x, mu, Sigma) = (1/(2Pi^n/2 * Sigma.det^1/2) )* exp(-1/2 * (x — mu).T * Sigma.inv * (x — mu))
Sigma.det: determinant of Sigma
Sigma.inv: inversion of Sigma
– changing Sigma values you can change distribution bell-shape form arbitrarily

parameter fitting
– given training set X m,n
mu = 1/m sum (for i in 1..m: x(i))
Sigma = 1/m sum (for i in 1..m: (x(i) — mu)*(x(i) — mu).T)

detect anomaly
– given a new example x, compute
p(x, mu, Sigma)
if p(x) < epsilon: flag anomaly

original model vs multivariate
– original model is a partial case of multivariate model
only Sigma diagonal values can change
bell-shape can change only along axes
– original model computationally cheaper, scales better for large n
have to manually create features to capture anomalies where x1,x2 take unusual combinations
OK even if m is small
– multivariate Gaussian automatically captures correlations between features
must have m > n or else Sigma is non-invertible

Recommender Systems

example: predicting movie ratings
– movie 1: users 1,2,3 rate it to 5,5,0
– movie 2: … rate to ?,4,0
nu: #users
nm: #movies
r(i,j) = 1 if user j has rated movie i
y(i,j) = rating given by user j to movie i (only if r(i,j) = 1)
– predict rating for undefined combinations movie,user

Content based recommendations
– for movies we have some features n=2: x1: romance; x2: action: real numbers
adding intercept feature = 1, got a vector size 3
– for each user j, learn a parameter theta(j): vector[3];
predict rating (i,j) = theta(j).T * x(i) 
theta(j): parameter vector for user j
x(i): feature vector for movie i
m(j): #of movies rated by user j
– to learn theta(j):
minimize 1/2m(j) * sum (for i in rated_by_j: (predict — y(i,j))^2) + reg 
predict = theta(j).T * x(i) 
– regularization:
reg = lambda/2m(j) * sum (for k in 1..n: theta(j,k)^2) // don't penalize theta0
– for all users, matrix Theta
min J = ½ sum(for j in 1..nu: sum(for i in rated_by_j: (predict — y(i,j))^2) ))  + reg
predict = Theta(j).T * x(i) 
reg = lambda/2 * sum(for j in 1..nu: sum(for k in 1..n: Theta(j,k)^2)) // don't penalize theta0
– gradient descent update: k in 0..n
Theta(j,k) = Theta(j,k) — alpha * (sum(for i in rated_by_j: gradient (+ reg if k != 0)))
gradient = (Theta(j).T * x(i) — y(i,j)) * x(i,k)
reg = lambda * Theta(j,k)

Collaborative filtering
– we don't know values of movie features (romance x1, action x2: undefined)
– but we have ratings and Theta
– find/learn X that minimize J
min J = ½ sum(for i in 1..nm: sum(for j in rated_by_j: (predict — y(i,j))^2) ))  + reg
predict = Theta(j).T * x(i) 
reg = lambda/2 * sum(for i in 1..nm: sum(for k in 1..n: x(i,k)^2)) // don't penalize x0
– gradient descent update: k in 0..n
x(i,k) = x(i,k) — alpha * (sum(for j in rated_by_j: gradient (+ reg if k != 0)))
gradient = (Theta(j).T * x(i) — y(i,j)) * Theta(j,k)
reg = lambda * x(i,k)

– given features x for each movie, and movie ratings y: can estimate Theta
– given theta for each user, and movie ratings y: can estimate X
– w/o theta: only ratings y?
– minimize J on both Theta and X:
J(Theta,X) = ½ sum(for i,j in ratings: (predict — y(i,j))^2) + regX + regTheta
predict = Theta(j).T * x(i)
regX = lambda/2 * sum(for i in 1..nm: sum(for k in 1..n: x(i,k)^2))
regTheta = lambda/2 * sum(for j in 1..nu: sum(for k in 1..n: Theta(j,k)^2))
– and theta0,x0 can be penalized by regularization!

collaborative filtering algorithm
– initialize x foreach movie, theta foreach user to small random values (symmetry breaking, like in NN)
– minimize J(Theta,X) using optimization algorithm (gradient descent) // no need to add intercept items to theta and x!
for j in 1..nu, i in 1..nm:
X(k,i) = x(k,i) — alpha * gradientTheta 
Theta(k,j) = Theta(k,j) — alpha * gradientX 
gradientTheta = sum(for j in ratings(i,j): (predict — y(i,j)) * Theta(k,j)) + lambda * x(k,j)
gradientX = sum(for i in ratings(i,j): (predict — y(i,j)) * x(k,j)) + lambda * Theta(k,j)
– for a movie with learned features x, and user with parameters theta: predict rating theta.T * x

Vectorization: low rank matrix factorization
– Y(i, j): i in 1..nm, movies; j in 1..nu, users
– Y(i,j) = theta(j).T * x(i) = real number
– X: matrix [nm,n]
– Theta: matrix [nu,n]
– Y = X * Theta.T // nm,n * n,nu = nm rows, nu columns

find the 5 movies j, most similar to movie i:
– smallest ||x(i) — x(j)|| will do // distance, norm

implementation detail: ratings mean normalization
– if for user j no ratings exists: user theta will be = 0
no good
– calculate mean rating for movie, mu
– for user j, on movie i, predict rating: Theta(j).T * X(i) + mu
– if movie have not ratings, better drop that movie
– you don't do scaling for ratings: 0..5 stars is all right

Example (Octave)
– anomaly detection, gaussian distribution
% mu = 1/m * sum(x) 
% sigma2 = 1/m * sum( (x - mu)^2 ) 
mu = mean(X, 1)'; % help mean 
sigma2 = var(X, 1, 1)'; % help var

– find max F1-score, select threshold
bestEpsilon = 0; 
bestF1 = 0; 
F1 = 0; 
 % Parameters 
%   yval is a vector m with labels 1=anomalous, 0=normal 
%   pval is a vector m with computed probabilities - predictions 
% Returns 
%   bestEpsilon is a real number, selected treshold epsilon 
%   bestF1 is a real number, F1 metrica for selected treshold, max(f1) 
% If an example x has a low probability p(x) < ε, then it is considered to be an anomaly 
% tp is the number of true positives: the ground truth label says it’s an anomaly and our algorithm correctly classified it as an anomaly. 
% fp is the number of false positives: the ground truth label says it’s not an anomaly, but our algorithm incorrectly classified it as an anomaly. 
% fn is the number of false negatives: the ground truth label says it’s an anomaly, but our algorithm incorrectly classified it as not being anomalous. 
% F1 = (2 * prec * rec) / (prec + rec); % F1 score 
% prec = tp / (tp + fp); % precision 
% rec = tp / (tp + fn); % recall 
% tp = sum((cvPredictions == 1) & (yval == 1)) % true positives 
% fp = sum((cvPredictions == 1) & (yval == 0)) % false positives 
% fn = sum((cvPredictions == 0) & (yval == 1)) % false negatives 

stepsize = (max(pval) - min(pval)) / 1000; 
for epsilon = min(pval):stepsize:max(pval) 
    % calc F1 for current epsilon 
    cvPredictions = (pval < epsilon); 
   tp = sum((cvPredictions == 1) & (yval == 1)); % true positives 
    fp = sum((cvPredictions == 1) & (yval == 0)); % false positives 
    fn = sum((cvPredictions == 0) & (yval == 1)); % false negatives 
   prec = tp / (tp + fp); % precision 
    rec = tp / (tp + fn); % recall 
   F1 = (2 * prec * rec) / (prec + rec); % F1 score 
   if F1 > bestF1 
       bestF1 = F1; 
       bestEpsilon = epsilon; 
    end 
end % for each epsilon 

– collaborative filtering costFunc
% w/o regularization
% J = 0.5 * sum( (Theta' * X - Y)^2 ) for each rated movie % C = A * B; total = sum(sum(C(R == 1)));
Errs = power((X * Theta') - Y, 2);
errsSum = sum(sum(Errs(R == 1)));
J = 0.5 * errsSum;
+ gradient
% Theta 5x2 - users preferences
% X 6x2 - movies features
% Y 6x5 - movies ratings
% R 6x5 - rated flags
% X_grad = 6x2 = sum( (Theta' * X - Y) * Theta ) for rated movies only
% Theta_grad = 5x2 = sum( (Theta' * X - Y) * X ) for rated movies only
H = (X * Theta') .* R; % 6x5
Yr = (Y .* R); % 6x5
X_grad = (H - Yr) * Theta; % 6x2
Theta_grad = (H - Yr)' * X; % 5x2
+ regularization
% regularization
reg1 = (lambda / 2) * sum(sum(power(Theta, 2)));
reg2 = (lambda / 2) * sum(sum(power(X, 2)));
J = J + reg1 + reg2;
% Returns
%   J is a real number, cost value for given Theta, X, Y, R
%   X_grad is a 6x2 (nm,nf) matrix, gradient vectors for movies
%   Theta_grad is a 5x2 (nu,nf) matrix, gradient vectors for users
gradReg1 = lambda * X;
gradReg2 = lambda * Theta;
X_grad = X_grad + gradReg1;
Theta_grad = Theta_grad + gradReg2;

Week 10

– Large-Scale Machine Learning
– Example of an application of machine learning

Large-Scale Machine Learning

more data: more accuracy
– take low bias/high variancy/overfitting algorithm/model and train it on huge dataset
– problem: min costFunc, every iteration perform … sum(for i in 1..m: … ); m » 1M examples
– you should check: is it enough to take small subset of data? Say, 1000, 10K records?
– plot a learning curve: Jtrain, Jcv on axes: m/error; only in case of overfitting bigdata can help

bigdata processing in ML
– stochastic gradient descent
– map-reduce

stochastic gradient descent
– batch gradient descent, essentials: use m examples in each iteration
J = 1/2m * sum(for i in 1..m: (h(x(i) - y(i))^2)
theta(j) = theta(j) — alpha * gradient 
gradient = 1/m * sum(for i in 1..m: (h(x(i)) — y(i))*x(i,j))
– stochastic GD: use 1 example in each iteration
1. randomly shuffle training examples
2. repeat (until converge, 1..10 times?)
for i in 1..m: foreach j: theta(j) = theta(j) — alpha * (h(x(i) — y(i)) * x(i,j)

mini-batch gradient descent: use b examples in each iteration
– b: batch size, 2..100
say b = 10, m = 1000
repeat (until converge)
for i in 1, 11, 21, 31, ...991: foreach j: theta(j) = theta(j) — alpha * grad
grad = 1/10 * sum(for k in i..i+9: (h(x(k)) - y(k)) * x(k,j))
– can be faster than stochastic GD, vectorization rulezz
– should spend time to tune hyperparameter b

stochastic GD convergence (learning rate alpha selection)
– for batch GD: plot error Jtrain over #iterations, J should descend monotonically until convergence
– for stochastic GD: compute cost = ½ (h — y)^2 before updating theta;
every k iterations (say 1000) plot average(cost) / #iterations
to get more smooth line, increase k
cost should descend
– you can slowly decrease alpha over time: alpha = const1 / (#iteration + const2)

Online learning (stream of data)
– example: shipping web-service, user comes, specifies origin, destination, size, weight of package, …
you offer to ship their package for some asking price,
users sometimes agree (y=1), sometimes not (y=0)
– we want to know optimized price
– features X capture properties of user, package, origin/destination, asking price
we want to learn p(y=1|x,theta) to optimize price
– algorithm (like stochastic GD w/o iterations):
repeat forever // for each next deal:
get x,y corresponding to user;
update theta using x,y: foreach j: theta(j) = theta(j) — alpha * (h - y)*x(j)
use updated theta to predict.
– record come, being used, dropped: good for big services, where you can't save all data

– example: product search
– user searches for 'android phone full-hd'
– have 100 phones, will return 10 records, which one?
– x: features of phone, how many words in query match name of phone, … description of phone, …
– y = 1 if user clicks on link
– learn p(y=1|x,theta)
– show 10 phones with max p

Map-Reduce, data parallelizm
– lots of computing nodes
– say, we have 4 nodes, m = 400 examples/records
temp1(j) = sum(for i in 1..100: (h - y)*x(j))
temp2(j) = sum(for i in 101..200: (h - y)*x(j))
temp3(j) = sum(for i in 201..300: (h - y)*x(j))
temp4(j) = sum(for i in 301..400: (h - y)*x(j))
– then, main node can combine:
theta(j) = theta(j) — alpha 1/400 (temp1 + temp2 + temp3 + temp4)
– many learning algorithms can be expressed as computing sums over the training set
for NN: compute forward/back propagation on 1/#nodes of the data to compute the derivative with respect to that 1/#nodes of the data

Example of an application of machine learning

Photo OCR : problem description and pipeline
– normalize pictures
– text detection (regions with text)
– character segmentation (one char at a time)
– character classification (is that a 'A'? or 'B'?)

pipeline: a system with many stages/components

text detection: sliding window
– first, lets do pedestrian detection
– prepare images, say, 82x36 pix to teach NN classifier; create NN classifier
– sliding window detection: take a picture, apply a 82x36 rectangle (patch, window), ask classifier;
– shift (step size aka stride) patch and ask again;
– scale window, repeat process (before asking classifier you need to scale picture back to 82x36)
with text it's the same: you get a classifier, that can detect presence of chars/letters; apply sliding window.

now you have a picture with marked regions, where chars where detected
– black/white image, white means: text (probably: higher p => whiter color)
– apply expansion operator: if near pixel p was found white pixel: p also is white
– enclose white regions in frames/boxes
– remove boxes with wrong aspect ratio

character segmentation
– prepare classifier that detects (for image): it's a gap between chars or not?
– apply sliding window + classifier to find all gaps in image (enclosed in box on prev.step)
now you can do character classification/recognition
– having extracted char image, you can recognize it using NN classifier

getting lots of data, artificial data
– low biased model/algorithm + massive training set = success
– you can expand initial dataset (transformations)
synthesizing data by introducing distortions
– you can generate artificial data
print text with different fonts, on different backgrounds => lots of artifical data

speech recognition
– original audio
– on bad cellphone connection
– noisy background: crowd, machinery, …

usually does not help to add purely random noise to your data

make shure you have a low bias classifier before expending the effort.
– plot learning curves
– keep increasing the number of features/number of hidden units in NN

how much work would it be to get 10x as much data as we currently have?
– artificial data synthesis
– collect/label it yourself

Ceiling analysis: what part of the pipeline to work on next
– real number metrics to evaluate system/model/pipeline modules
say, accuracy is the system metric
– overall system accuracy: 72%
– fake text detector (100%): system 89% (+17%)
– fake char segmentation: system 90% (+1%)
– fake char recognition: system 100% (+10%)
you should work on text detector, obviously.

example: face recognition
– pipeline: camera image, preprocessing (remove background, brightness, black/white,...),
face detection, (eyes segmentation, nose segm, mouth segm, …), logistic regression, label.
– do ceiling analysis: work face detection, eyes segmentation)






original post http://vasnake.blogspot.com/2016/08/machine-learning-andrew-ng-coursera.html

1 комментарий:

Архив блога

Ярлыки

linux (241) python (191) citation (186) web-develop (170) gov.ru (159) video (124) бытовуха (115) sysadm (100) GIS (97) Zope(Plone) (88) бурчалки (84) Book (83) programming (82) грабли (77) Fun (76) development (73) windsurfing (72) Microsoft (64) hiload (62) internet provider (57) opensource (57) security (57) опыт (55) movie (52) Wisdom (51) ML (47) driving (45) hardware (45) language (45) money (42) JS (41) curse (40) bigdata (39) DBMS (38) ArcGIS (34) history (31) PDA (30) howto (30) holyday (29) Google (27) Oracle (27) tourism (27) virtbox (27) health (26) vacation (24) AI (23) Autodesk (23) SQL (23) humor (23) Java (22) knowledge (22) translate (20) CSS (19) cheatsheet (19) hack (19) Apache (16) Manager (15) web-browser (15) Никонов (15) Klaipeda (14) functional programming (14) happiness (14) music (14) todo (14) PHP (13) course (13) scala (13) weapon (13) HTTP. Apache (12) SSH (12) frameworks (12) hero (12) im (12) settings (12) HTML (11) SciTE (11) USA (11) crypto (11) game (11) map (11) HTTPD (9) ODF (9) Photo (9) купи/продай (9) benchmark (8) documentation (8) 3D (7) CS (7) DNS (7) NoSQL (7) cloud (7) django (7) gun (7) matroska (7) telephony (7) Microsoft Office (6) VCS (6) bluetooth (6) pidgin (6) proxy (6) Donald Knuth (5) ETL (5) NVIDIA (5) Palanga (5) REST (5) bash (5) flash (5) keyboard (5) price (5) samba (5) CGI (4) LISP (4) RoR (4) cache (4) car (4) display (4) holywar (4) nginx (4) pistol (4) spark (4) xml (4) Лебедев (4) IDE (3) IE8 (3) J2EE (3) NTFS (3) RDP (3) holiday (3) mount (3) Гоблин (3) кухня (3) урюк (3) AMQP (2) ERP (2) IE7 (2) NAS (2) Naudoc (2) PDF (2) address (2) air (2) british (2) coffee (2) fitness (2) font (2) ftp (2) fuckup (2) messaging (2) notify (2) sharepoint (2) ssl/tls (2) stardict (2) tests (2) tunnel (2) udev (2) APT (1) CRUD (1) Canyonlands (1) Cyprus (1) DVDShrink (1) Jabber (1) K9Copy (1) Matlab (1) Portugal (1) VBA (1) WD My Book (1) autoit (1) bike (1) cannabis (1) chat (1) concurrent (1) dbf (1) ext4 (1) idioten (1) join (1) krusader (1) license (1) life (1) migration (1) mindmap (1) navitel (1) pneumatic weapon (1) quiz (1) regexp (1) robot (1) science (1) serialization (1) spatial (1) tie (1) vim (1) Науру (1) крысы (1) налоги (1) пианино (1)