В далеком 2014
году я открыл для себя новое измерение:
возможность учиться у лучших.
Все, что нужно,
это компьютер, интернет и знание
английского языка.
Первый курс,
который я прошел (с отличием!), назывался
Machine Learning
читал его
Andrew Ng
на платформе
Coursera, Stanford
Вот тут есть
подробное изложение изученного:
Сегодня, с
целью взбодрить память, я скомпилял
типа дайджест курса.
Енджой.
Week 1
Introduction
Linear Regression with One Variable
(Optional) Linear Algebra Review
Week 2
Linear Regression with Multiple Variables
Octave Tutorial
Week 3
Logistic Regression
Regularization
Week 4
Neural Networks: Representation
Week 5
Neural Networks: Learning
Week 6
Advice for Applying Machine Learning
Machine Learning System Design
Week 7
Support Vector Machines (SVMs)
Week 8
Clustering
Dimensionality Reduction
Week 9
Anomaly Detection
Recommender Systems
Week 10
Large-Scale Machine Learning
Example of an application of machine learning
Week 1
– Introduction
– Linear Regression
with One Variable
– (Optional) Linear
Algebra Review
Introduction
Tom Mitchell:
… we say that a
machine learns with respect to a particular task T, performance
metric P, and type of experience E, if the system reliably improves
its performance P at task T, following experience E.
Algorithms
classification:
supervised,
unsupervised. Also: reinforcement learning, recomender systems.
Supervised learning
– you have set of
answers and you feed it to machine for training/learning
– regression problem:
continous valued output, polynomial function (house price)
– classification
problem: 0/1 with probability, discrete valued output (is that tumor
malignant?)
Unsupervised learning
– you have no
answers, only data: have that data any structure? Clusterization.
– cocktail party
problem: having a set of microphones, extract different sound
streams/sources.
Linear Regression
with One Variable (univariate linear regression)
Example: house price
derived from house size.
Input dataset: m –
number of records; x – variable – vector of features (square
foots); y – target (price);
(x, y) – some record
from dataset;
(xi, yi) – i-th
record from dataset.
The job of a learning
algorithm: output a function h (hypothesis):
y = h(x)
h representation:
h(x) = theta0 + theta1 * x // model
learning algorithm should find right vector theta (model parameters).
How to find theta?
Cost function: squared
error, find min(J) changing parameters theta somehow.
min sum ((h(x) — y)^2 for each record 0..m)
Cost function:
J(theta) = 1/2m * sum (h(x) — y)^2 // bowl shaped function, convex function
aka: square error cost function.
Optimization objective:
minimize it by selecting the right theta.
Gradient Descent
(batch GD): find optimum/minimum for cost function J.
batch: on every step
we're looking at all of the training records.
repeat updating theta
until convergence (theta don't change in minimum point):
theta(j) = theta(j) — a * (d_J(theta)/d_theta(j))
j = 0..1 for our example
a: alpha, learning
rate (hyperparameter), step size: not too small/big;
d: derivative term,
function slope; ~ delta J / delta theta
d_J(theta) / d_theta0 = 1/m * sum (h(x) — y) d_J(theta) / d_theta1 = 1/m * sum (h(x) — y)*x
Variables matrix MxN: m
rows/records, n columns/variables;
target vector y: matrix
Mx1.
parameters vector
theta: matrix 1xN+1
(Optional) Linear
Algebra Review
Matrix i-rows x
j-columns;
Vector = matrix i-rows
x 1 column;
Matrix addition: C11 =
A11 + B11
Multiply matrix by
scalar: C11 = A11 * b
Matrix 3x2 * vector 2x1
= vector 3x1: c1 = A11*b1 + A12*b2
predictionVector Mx1 =
dataMatrix MxN+1 * modelParametersVector N+1x1
dataMatrix extra column
(first column) = 1, 1, ...
M: num of data records
N: num of variables
Matrix multiplication:
MxN * NxO = MxO // associative, not commutative
C = A * B: column i of
C = A * vector i of B
Identity matrix: I(i,i)
= 1 // diagonal == 1:
I * A = A * I = A if A has square shape
Inverse:
– inverse number for
real numbers: num * inv = 1 // 3 * 1/3 = 1
– inverse matrix:
square matrix
A * Ainv = Ainv * A = I
– if A have not Ainv: singular or degenerate matrix
Transpose:
– Bij = Aji if B =
A.transpose
Week 2
– Linear Regression
with Multiple Variables
– Octave Tutorial
Linear Regression
with Multiple Variables (multivariate linear regression)
House: size, num of
bedrooms, floors, age, …
x1, x2, … : vector x,
variables.
N: number of
features/columns
M: number of
examples/rows
xi: i-th row of
training set
xij: value of j-th
feature in i-th row
Hypothesis:
let x0 = 1, then
h_theta(x) = theta_transpose * x // parameters vector transpose * variables vector
Gradient Descent for
multivariate LR
– cost function
Jtheta = 1/2m * sum ((h_theta(x) — y)^2 for each row)
– minimize it using Batch GD: repeat until convergence:
theta_j = theta_j — a * 1/m * sum ((h_theta(x) — y)*x_j for each row)
Feature scaling
– make shure that the
features are on a similar scale
e.g. number of
bedrooms: 1..5; size 0..2000 sq.feet
– rule of thumb:
scale features to -1 .. +1 scale
– and do mean
normalization: make mean ~ 0
x = (x — mu) / si
mu: average value of x
si: standard deviation
of x (si = max — min)
Learning rate
– GD is working if
cost (Jtheta) is lowering on each iteration, build a graph #iter vs
Jtheta
– if GD is not
working: decrease alpha: learning rate
Choosing features,
polynomial regression
– e.g. replace width,
length with size = width * length
– polynomial model,
e.g:
theta0 + theta1*x + theta2*x^2 + theta3*x^3
– you can add to model higher order variables: x^2, x^3 (x1^2,
x1*x2, …)
and do feature scaling
Normal Equations:
solve min(Jtheta) analytically
– find where
derivative = 0
Octave: pinv(X' * X) * X' * y
X': Xtranspose
pinv: inverted,
pseudo-inverse, can do inversion for degenerate matrices
– if variables n >
10000 better not to use Normal Equations, (X' * X) is n-by-n matrix,
pinv(X) cost O(n^3)
– no need for feature
scaling, learning rate choosing.
– X'*X can be
degenerate if we have dubled features (size in foots and size in
meters);
or we have too many
features (m <= n).
Remove doubled
features; apply regularization
Octave tutorial
'%' is a comment
1 == 2 % 0 1 ~= 2 % 1 1 && 0 % and 1 || 0 % or xor(1, 0) a = 3 % assignment a = 3; % semicolon supressing output b = 'hi'; % string c = (3 >= 1) % 1 for true disp(a); % print sprintf('foobar: %0.2f', pi) % print PI a with 2 decimals format long % swith output to long number format — 14 decimals format short % … 4 decimals
matrices
A = [1 2; 3 4; 5 6] % matrix 3x2 A = [ 1 2; 3 4; 5 6 ] % the same size(A) % 3 2 matrix size which is matrix 1x2 size(A, 1) % first dimention, num rows = 3 length(v) % size of the longest dimension V = [1 2 3] % matrix 1x3 V = [1; 2; 3] % vector 3 or matrix 3x1 v = 1:0.1:2 % matrix 1x11 inited by range from 1 to 2 increment 0.1 v = 1:6 % matrix 1x6 by range 1 to 6 increment 1 ones(2, 3) % matrix 2x3 with '1' zeros... % matrix with '0' w = rand... % matrix with random numbers (normal distribution) 0 <= x <= 1 w = randn... % gaussian distribution hist(w) % draw histogram hist(w, 50) % draw histogram with 50 bins eye(4) % Identity matrix 4x4 (Diagonal Matrix) help cmd % print help text
loading
data
pwd % work dir cd 'path/to/dir' % change work dir ls % dir list load('featuresX.dat') % load the file who % show workspace variables whos % detail view featuresX % mx2 matrix pricesY % mx1 matrix (vector) size(featuresX) % 47 1 size(priceY) % 47 1 clear featuresX % remove variable from workspace clear % clear all workspace! v = priceY(1:10) % first 10 elements save hello.mat v % save variable v to file hello.mat — binary format save hello.txt v -ascii % save to ascii text file
indexing
operations
A(3, 2) % row 3 column 2 value A(2, :) % row 2 A(:, 2) % column 2 A([1 3], :) % row 1 and row 3 A(:, 2) = [10; 11; 12] % assign to second column, replace column 2 vith new vector A = [A, [100; 101; 103]] % append another column to A A(:) % put all elements of A into a single vector C = [A B] % concatenate A and B, add columns C = [A; B] % concatenate, add rows
matrices
ops
% let size(A) % 3 2 size(B) % 3 2 size(C) % 2 2 A*C % 3x2 matrix A .* B % like A + B only Aij * Bij % element-wise multiplication A .^ 2 % (A dot carry 2) element-wise squaring (power2) 1 ./ v % element-wise reciprocal (1/x) — обратная величина log(v) % element-wise logarithm exp(v) % element-wise base E exponentiation abs(v) % element-wise absolute value -v % element-wise negative (-1*v) v + ones(length(v), 1) % element-wise increment % == v + 1 A' % transpose A max(A) % maximum value column-wise [val, ind] = max(A) % val — max.value, ind — index in matrix A < 3 % element-wise comparison find(A < 3) % return indexes where value < 3 A = magic(3) % magic matrix — sums by rows, columns, diagonals is equal [r, c] = find(A >= 7) % return rows indexes and columns indexes % A(r, c) sum(A) % return sum of all elements prod(A) % return product of all elements floor(A) % rounds down elements ceil(A) % rounds up all elements max(rand(3), rand(3)) % max.values for two matrices max(A, [], 1) % returns max.values for each column % 1 means '1-st dimension of A' % == max(A) max(max(A)) % absolute maximum == max(A(:)) sum(A, 1) % sums per column sum(A, 2) % sums per row, returns vector A .* eye(9) % returns diagonal from A sum(sum(A .* eye(9))) % sum for diagonal flipud(eye(9)) % another diagonal % flip upside down, Permutation Matrix pinv(magic(3)) % pseudo-inverse magic matrix pinv(A) * A == I
visualisation
t = [0:0.01:0.98] y1 = sin(2*pi*4*t) plot(t, y1) y2 = cos(2*pi*4*t) plot(t, y2) plot(t, y1) hold on % don't remove prev.plot plot(t, y2, 'r') % red color xlabel('time') % label for x ylabel('value') % label for y legend('sin', 'cos') % draw a legend title('my plot') % draw a title print -dpng 'filename.png' % save the plot to a file close % close plot window % open two plot windows figure(1); plot(t, y1); figure(2); plot(t, y2); % define draw cell in plot grid subplot(1, 2, 1) % divides plot to a 1x2 grid, access first grid element axis([o.5, 1, -1, 1]) % change axis ranges clf % clear the figure %let A = magic(5) imagesc(A) % plot grid 5x5 colors imagesc(a), colorbar, colormap gray; % draw grid in gray tone and draw legend % comma chaining of commands/function calls
for,
while, if; functions
% let v = zeros(10, 1) % vector 10 % replace vector elements for i = 1:10, v(i) = 2 ^ i; end; % replace first 5 elements i = 1; while i <= 5, v(i) = 100; i = i + 1; end; % break, continue works i = 1; while true, v(i) = 999; i = i + 1; if i == 6, break; end; end; if v(1) == 1, disp('one'); elseif v(1) == 2, disp('2') else disp('not 1 nor 2') end; % functions % create file 'functionName.m' % write to it: function y = functionName(x) y = x ^ 2; % this function returns squares of argument addpath('path/to/dir/with/function file') % functions with multiple returns function [y1, y2] = funcName(x) y1 = x ^ 2; y2 = x ^ 3; % returns row vector [a, b] = funcName(3) % compute Cost Function X = [1 1; 1 2; 1 3] % design matrix y = [1; 2; 3] % results vector theta = [0; 1] % theta vector % costFunctionJ.m function J = costFunctionJ(X, y, theta) % X is the 'design matrix' containing our training examples. % y is the class labels m = size(X, 1) % number of training examples predictions = X * theta % predictions of hypothesis on all m examples sqrErrors = (predictions — y) .^ 2 % squared errors J = 1 / (2 * m) * sum(sqrErrors) % easy j = costFunctionJ(X, y, theta) % returns 0 for given theta
% unvectorize implementation prediction = 0.0 for j = 1:n+1, prediction = prediction + theta(j) * x(j) % vectorize implementation prediction = theta' * x
linear regression with
regularization
% call: J = linearRegCostFunction([ones(m, 1) X], y, theta, 1); % Parameters: % X - train.set variables, matrix 12x2 with column of '1' added % y - train.set results, vector 12x1 % theta - model parameters, vector 2x1 % lambda - regularization parameter, real number % Returns: % J - cost value for given parameters, real number % grad - gradients for cost function, vector 2x1 % Linear Regression Cost Function Formulae % J = 1/(2m) * (sumErr) + lambda/(2m) * reg % sumErr = sum((hyp - y)^2) % hyp = theta' * x % reg = sum(theta^2) % skip first parameter! hyp = X * theta; % vector 12x1 = 12x2 * 2x1 sumErr = sum(power(hyp - y, 2)); % real number = sum(vector) reg = sum(power(theta(2:end, :), 2)); % real number; skip first parameter! J = (sumErr / (2*m)) + ((lambda * reg) / (2*m)); % Linear Regression Cost Function Gradient Formulae % for j == 0 % dJ/dt0 = 1/m * sumPartErr % for j > 0 % dJ/dtj = 1/m * sumPartErr + (lambda/m * tj) % tj - theta(j) % sumPartErr = sum((hyp - y) * X) grad = (1/m) * X' * (hyp - y); % 2x12 * 12x1 = 2x1 vector size n,1 temp = theta; % vector size n,1 temp(1) = 0; % because we don't add anything for j = 0 grad = grad + ((lambda/m) * temp); % vector size n,1
Week 3
– Logistic Regression
– Regularization
Logistic Regression
Binary classification
problems, algorithm: Logistic Regression
– y: result, discrete
values;
– y E {0, 1}: binary
classification problems
– y E {1, 2, 3}:
multiclass classification problems
– 0 <= h_theta(x)
<= 1 : logistic regression
Hypotesis
h_theta(x) = g(theta.transpose * x)
– g: sigmoid function (logistic)
g(z) = 1 / (1 + e^-z)
– h_theta(x) : P(y=1 | x;theta) : probability that y = 1, given x
parametrized by theta.
Decision Boundary
– sigmoid graph shows
that g(z) > 0.5 if z > 0
– from that follows:
predict y = 1 if theta.transpose * x > 0
– decision boundary
equation: theta.transpose * x = - theta0
– decision boundary
can be in any form if we have polynomial functions (variables of
higher order)
Logistic Regression
Cost Function
– to get convex shape
(global minimum) we need different cost function
cost(h_theta(x), y) = -log(h_theta(x)) if y == 1 or -log(1 — h_theta(x)) if y == 0
– Cost Function J for Logistic Regression
J_theta = 1/m * sum(cost(h_theta(x), y) for each record) // where: cost(h_theta(x), y) = -y * log(h_theta(x)) — (1 — y) * log(1 — h_theta(x))
Gradient Descent looks
the same
– repeat until
convergence
theta_j = theta_j — a * (delta J / delta theta_j) // unwrapped: theta_j = theta_j — a/m * sum((h_theta(x) — y) * x_j for each record)
And don't forget
Features Scaling/Normalization.
Advanced Optimization
– given theta, we
have code that compute: J_theta; delta_J/delta_theta for each column;
– there exists a few
optimization algorithms:
gradient descent
conjugate gradient
BFGS
L-BFGS
– easy to use, more
complex, faster
- to use them you
should write and pass a function
(errCost, gradient) = costFunction(theta): errCost = J value; gradient = delta J / delta theta
Multiclass
Classification: One-vs-all (one-vs-rest)
– example: email
labels/tags, work, friends, family, …
– find out (for 3
classes) 3 hypothesis for binary classification:
work / rest; friends /
rest; family / rest
– on a new input
(prediction), pick the hypothesis (class) that give max h_theta(x)
Regularization
Overfitting problem :
accurate predictions on
a training set but, not generalize well to make accurate predictions
on new, unseen examples
– in case of small
set of features / low order polynom we can get an underfitting
(high bias):
straight line as a
decision boundary
– or, if we have too
many features / high order polynomial h function, we can get an
overfitting (high variance):
too curvy line as a
decision boundary
– what to do? drop
less important features; apply regularization, penalizing
model parameters.
don't penalize
parameter theta0
cost function,
regularization (linear regression)
– add to cost
function a new summand
J_theta = 1/2m * ( sum( (h(x) — y)^2 for each record) + lambda * sum( theta^2 for each column excluding col[0]))
lambda * sum(theta^2): regularization term (L2 regularization: Ridge
regression – theta^2)
lambda:
regularization parameter, hyperparameter
– if lambda is too
large: we get underfitting (straight line?)
For regression
models, the two widely used regularization methods are L1 and L2
regularization, also called lasso and ridge regression when
applied in linear regression.
L2 … models are
much more stable (coefficients do not fluctuate on small data changes
as is the case with unregularized or L1 models) … a predictive
feature will get a non-zero coefficient, which is often not the case
with L1
Gradient descent with
regularization (linear regression)
theta = theta*(1 — alpha * lambda/m) — alpha * (1/m * sum( (h(x) — y)*x for each record) )
theta: for each column except 0
1 — alpha * lambda/m
< 1 : penalize parameter theta
Normal equation with
regularization
– w/o regularization
theta = pinv(X.transpose * X) * X.transpose * y
– with regularization
theta = pinv(X.transpose * X + lambda * specialMatrix) * X.transpose * y
specialMatrix: n+1 x n+1 dimensions, diagonal, (0,0) = 0
– n.b. regularization
take care of degenerate X.transp * X // if m < n matrix may be
degenerate
Logistic Regression +
Regularization
– cost function with
regularization
J(theta)
= -[1/m*sumi=1m(y(i)*log(htheta(x(i))
+ (1-y(i))*log(1-htheta(x(i)))))] +
lambda/2m * sumj=1n(thetaj2)
don't penalize
parameter theta0!
– Gradient descent
with regularization (logistic regression)
thetaj
:= thetaj*(1-alpha*lambda/m) — alpha*1/m *
sumi=1m(htheta(x(i) —
y(i))*xj(i))
(1-alpha*lambda/m) :
regularization
advanced optimization
with regularization
– you should provide
function (Jval, gradient) = costFunc(theta)
– Jval = value of
cost function with regularization
– gradient = delta J
/ delta theta
= 1/m sum( (h(x) —
y) * x for each record) // for theta0
= 1/m sum( (h(x) —
y) * x for each record) + lambda/m * theta // for theta1..n
Week 4
– Neural Networks:
representation
problem: too many
features in high-order polynomial models
– computer vision:
50x50 pixels, 2500 pixels in image, n features = 2500 grayscale, 7500
RGB,
polynome 2-nd degree
would have ~ 3M parameters
– what can we do?
Neural Networks :
algorithms that try to mimic the brain
– neuron model:
logistic unit
input: vector x;
parameters vector theta (weight);
output: scalar
h_theta(x) // sigmoid (logistic) activation function (on/off)
h_theta(x) = 1 / (1 + e^-z); z = theta.transpose * x // or h_theta(x) = g(z)
– x[0]: bias unit = 1 (remember linear regression?)
– layers: input
layer, hidden l., output l.
– a(i,j): activation
of unit i in layer j
– Theta(j): matrix of
weigths controlling function mapping from layer j to layer j+1
– Theta(j) size: n2 x
n1+1 ; n1: #units in layer j; n2: #units in layer j+1
Forward propagation
– a(i,j) = g(z(i,j))
; from that follows
vector a(layer_i) = g(Theta(layer_i-1) * a(layer_i-1))
– layer-by-layer: forward propagation:
layer 0: a(0) = x; add bias unit to a(0): a(0,0) = 1 layer 1: a(1) = g(Theta(0) * a(0)) add bias unit to a(1): a(1,0) = 1 a(2) = g(Theta(1) * a(1))
Multiclass
classification, neural networs
– multiple output
units: ove-vs-all
– result is a vector;
y in training set also a vector
– only one elem in
vector == 1, other == 0
Week 5
– Neural Networks:
learning
L: number of layers
sl: number of units
(excluding bias unit) in layer l
K: number of classes
– binary
classification problem: one output unit, y = 0|1; K <= 2
– multi-class
classification problem: K >= 3, output vector y is size K
NN Cost Function
– alike logistic
regression, only added new dimensions for sums
J(Theta) = - 1/m * loss + regularization loss = sum (for_each_record_i for_each_class_k ( y(k,i) * log(h(theta,x,i,k)) + (1 — y(k,i)*log(1 — h(theta, x, i, k))) ) ) regularization = lambda/2m * sum (for_each_layer_l for_each_unit_i_in_l for_each_unit_j_in_l+1 Theta(l,j,i)^2)
– don't penalize theta0
Backpropagation
algorithm
– for CostFunction
optimization we need to compute:
cost value J(Theta)
gradient (partial
derivative) delta J / delta Theta(i, j, l)
– Theta(i,j,l): real
number
– for each layer
Theta size = variables size x units size
let delta(j,l) =
'error' of node j in layer l
let L = 4 ; j: unit
number
delta(j,4) = a(j,4) — y(j) = h(theta, x, j) — y(j)
in layer 3, vector form:
delta(3) = Theta(3).transpose * delta(4) .* g'(a(3)) g'(a(3)) = g(a(3)) .* (1 - g(a(3)))
.* : element-wise multiplication
delta(1) does not
exists, it's an input layer
bottom line: gradient
(partial derivative) dJ/dtheta(i, j, l) = a(j,l) * delta(i, l+1) if
lambda = 0 // no regularization
set Delta(i, j, i) = 0
for all i, j, l
for i =1 to m:
set a(1) = x(i) //
input vector, variables
perform forward
propagation, compute a(l) for l = 2,3...L
using y(i), compute
delta(L) = a(L) — y(i)
compute delta(L-1),
delta(L-2, ...delta(2)
aggregate Delta(l) =
Delta(l) + (delta(l+1) * a(l).transpose)
gradient D:
D(i,j,i) = 1/m * Delta(i,j,l) + ( regularization if j != 0 ) regularization = lambda * Theta(i,j,l)
gradient (partial derivative) dJ/dtheta(i, j, l) = D(i,j,l)
Theta(1), Theta(2),
Theta(3) : parameter matrices for layers;
D(1), D(2), D(3) :
gradient matrices for layers.
We need vectors to use
in optimization functions.
Unroll parameters
thetaVec = [ Theta1(:);
Theta2(:); Theta3(:) ];
DVec = [ D1(:); D2(:);
D3(:) ];
Gradient checking
(numerical gradient), very slow:
gradApprox = (J(theta + EPSILON) — J(theta — EPSILON)) / 2 * EPSILON
delta J / delta theta
for i = 1:n, // number of parameters thetaPlus = theta; thetaPlus(i) = thetaPlus(i) + EPSILON; thetaMinus = theta; thetaMinus(i) = thetaMinus(i) — EPSILON; gradApprox(i) = (J(thetaPlus) — J(thetaMinus)) / (2*EPSILON); end;
check that gradApprox ~ DVec
Theta random
initialization
Theta1 = rand(10,11) * (2*INIT_EPSILON) — INIT_EPSILON;
breaks the weights symmetry
NN, altogether
– select NN
architecture: number of hidden layers, number of units in layers
number of input units
= #variables; #output units = #classes
by default: one hidden
layer; #hidden units ~ #features
– init Theta (random)
– implement FP to get
h(Theta, x(i)) for any x(i)
– implement cost
function J(Theta)
– implement BP to
compute partial derivatives delta J / delta Theta
– for i = 1:m // for
each record in dataset
perform FP, BP (x(i),
y(i))
// activations a(l)
and delta(l) for l = 1:L; accumulated D
– use gradient
checking to compare D and gradApprox; disable checking
– use optimization
method to try to minimize J(Theta)
– make shure J(Theta)
is decreasing with every iteration
Example (Octave):
– cost function
% Parameters: % Theta1 is a model parameters for first layer, matrix size (hidden_layer_size, input_layer_size+1); % Theta2 is a m.p. for second layer, matrix size (num_labels, hidden_layer_size+1); % X is a training set variables, matrix size (m, input_layer_size) % y is a t.s. results filled with labels values, vector size (m, 1) % lambda is a regularization parameter, real number % Returns % J is cost value for given Theta and training set, real number % Theta1_grad, Theta2_grad is a gradients of cost function for each theta, matrices size(Theta1), size(Theta2) % FP, h(theta, x); from ex3 A1 = [ones(m, 1) X]; % matrix size m,401 Z2 = A1 * Theta1'; % matrix size m,401 * 401,25 = m,25 A2 = sigmoid(Z2); % matrix size m,25 A2 = [ones(m, 1) A2]; Z3 = A2 * Theta2'; % matrix size m,25 * 26,10 = m,10 A3 = sigmoid(Z3); % matrix size m,10 costVector = zeros(m, 1); for i = 1:m % K = num_labels h = A3(i, :)'; % vector size (K, 1) yi = zeros(num_labels, 1); % yi is a vector size (K, 1) yi(y(i)) = 1; outCost = ((-1 * yi) .* log(h)) - ((1 - yi) .* log(1 - h)); % vector size (K, 1) costVector(i) = sum(outCost); % real number end J = (1/m) * sum(costVector); % real number % Cost Regularization subSum1 = sum(sum(power(Theta1(:, 2:end), 2))); subSum2 = sum(sum(power(Theta2(:, 2:end), 2))); reg = (lambda / (2 * m)) * (subSum1 + subSum2); % scalar J = J + reg;
– backprop
% Forward Propagation, h(theta, x); from ex3 A1 = [ones(m, 1) X]; % matrix size m,401 Z2 = A1 * Theta1'; % matrix size m,401 * 401,25 = m,25 A2 = sigmoid(Z2); % matrix size m,25 A2 = [ones(m, 1) A2]; % m,26 Z3 = A2 * Theta2'; % matrix size m,26 * 26,10 = m,10 A3 = sigmoid(Z3); % matrix size m,10 % Cost Value, Gradient costVector = zeros(m, 1); deriv1 = zeros(hidden_layer_size, input_layer_size+1); % 25x401 deriv2 = zeros(num_labels, hidden_layer_size+1); % 10x26 for i = 1:m % K = num_labels = 10 % cost values h = A3(i, :)'; % vector size (K, 1) yi = zeros(num_labels, 1); % yi is a vector size (K, 1) yi(y(i)) = 1; outCost = ((-1 * yi) .* log(h)) - ((1 - yi) .* log(1 - h)); % vector size (K, 1) costVector(i) = sum(outCost); % real number % Back Propagation delta3 = h - yi; % vector size (K, 1) % remove bias unit from delta calc! % delta2 = Theta2' * delta3 .* sigmoidGradient(z2) delta2 = Theta2(:, 2:end)' * delta3 .* sigmoidGradient(Z2(i, :)'); % vector size (25, 1) : 25x10 * 10x1 .* 25x1 % delta1 not exists % gradient for all units, yes, bias too! % deriv2 = delta3 * a2' deriv2 = deriv2 + (delta3 * A2(i, :)); % matrix 10x26 : 10x26 + 10x1 * 1x26 deriv1 = deriv1 + (delta2 * A1(i, :)); % matrix 25x401 : 25x401 + 25x1 * 1x401 end % Cost J = (1/m) * sum(costVector); % real number % Gradient Theta1_grad = deriv1 / m; % 25x401 Theta2_grad = deriv2 / m; % 10x26 % Cost Regularization subSum1 = sum(sum(power(Theta1(:, 2:end), 2))); subSum2 = sum(sum(power(Theta2(:, 2:end), 2))); reg = (lambda / (2 * m)) * (subSum1 + subSum2); % scalar J = J + reg; % Gradient Regularization % reg = 0 if j == 0 (j is a number of unit in left/input layer) % reg = (lambda / m) * Theta if j != 0 % grad = deriv /m + reg Reg1 = (lambda / m) * Theta1; % 25x401 Reg2 = (lambda / m) * Theta2; % 10x26 Reg1(:, 1) = 0; Reg2(:, 1) = 0; Theta1_grad = Theta1_grad + Reg1; Theta2_grad = Theta2_grad + Reg2; % Unroll gradients grad = [Theta1_grad(:) ; Theta2_grad(:)]; % the end
– sigmoid gradient
% sigmoid(z) = g(z) gz = 1 ./ (1 + exp(-z)); % gradient = dg/dz = g(z) * (1 - g(z)) g = gz .* (1 - gz);
- randomly initialize
weights
epsilon_init = 0.12; W = rand(L_out, 1 + L_in) * 2 * epsilon_init — epsilon_init; einit = sqrt(6) / sqrt(Lin + Lout)
Week 6
– Advice for Applying
ML
– ML Systed Design
Advice for Applying
ML
Deciding what to try
next
– well trained linear
regression make a unacceptably large errors in its predictions
– what can you do?
get more training
examples (if overfitting)
try smaller sets of
features (if overfitting)
try getting additional
features (if high bias)
try adding polynomial
features (if high bias)
try
decreasing/increasing lambda (underfitting/overfitting)
– use ML diagnostic
to check/select next step
Evaluating a hypothesis
– split data to
training set (70%) and test set (30%)
– overfitting:
training set is OK (low error), test set is bad (high error)
testErr = 1/2m * sum (h
- y)^2
Model selection and
train/validation/test sets
– split data set to
train (60%), cross-validation (20%), test set (20%)
– select models
(hyperparameters): polynomial degree (d), regularization
(lambda), threshold (for logistic regression)…
– different models,
different parameters Theta
– find model with
min. error on cross-validation set; check selected model on test set
Diagnosing Bias vs
Variance
– first thing to do
– underfitting vs
overfitting
– plot error / degree
of polynomial : two lines: training error, cross-validation error
training error is
lowering monotonically
cv error have u-shape
– left part of plot:
bias problem, both errors are hight;
– right part:
variance problem, training error is low, cv error is hight
Regularization and
Bias/Variance
– large lambda: high
bias; small lambda: high variance
– select set of x
lambdas: lambda = 0, 0.01, 0.02, 0.04, …
– train x models, get
x theta parameters
– check x models on
cv set, select one with min err
– check model on test
set
– plot a graph
lambda/errors for train/cv tests
train error grow with
lambda
cv error have u-shape
left part:
overfitting; right part: underfitting
Learning curves
– plot graph:
error/training set size; cv error and train error
– high
bias/underfitting: both errors are high and close to each other
if you have high bias
problem, getting more training data will not help much
– high
variance/overfitting: training error are small, cv error are
significantly larger
getting more data is
likely to help
Neural networks and
overfitting
– 'small' NN more
prone to underfitting
– 'large' NN more
prone to overfitting, regularization can help
– number of layers,
number of units, … : hyperparameters, can tune using train/cv/test
subsets
ML Systed Design
Prioritizing what to
work on
– start with a simple
algorithm that you can implement quickly
implement it, test it
on cv data
– plot learning
curves to decide if more data/features/etc are likely to help
– error analysis:
manually examine the examples (in CV set) that your algorithm made
errors on
see if you spot any
systematic trend
select new features
– use good
numerical metrics to evaluate models/algorithm
cross-validation error
Error metrics for
skewed classes
– in that case use
Precision/Recall/F1 metrics
e.g. your logistic
regression predict cancer with 1% error;
but only 0.5% of
patients have cancer.
function h = 0 give as
0.5% error, wicked!
– true positive: y=1,
h=1
– false positive:
y=1, h=0
– true negative: y=0,
h=0
– false negative:
y=0, h=1
– precision:
truePositive of all positive: tp/(tp + fp)
– recall:
truePositive of real positive: tp/(tp + fn)
– accuracy =
(true positives + true negatives) / (total examples)
Trading off precision
and recall
– logistic
regression: 0 <= h <= 1
predict 1 if h >=
0.5, 0 otherwise
– what if threshold
set to 0.7?
high precision, low
recall
– if we want to
detect most cases of cancer: set high recall: threshold = 0.3, for
example
– select threshold
with max F1 Score
F1 Score = 2 * (P * R)
/ (P + R)
Data for ML
– if human expert can
use data for prediction, and if we can collect more data: collect it
– given the input x,
can a human expert confidently predict y?
– if we have many
features, we need larger training set (unlikely to overfit)
example (Octave)
% call: [error_train, error_val] = learningCurve([ones(m, 1) X], y, [ones(size(Xval, 1), 1) Xval], yval, lambda); for i = 1:m % find theta % [theta] = trainLinearReg([ones(m, 1) X], y, lambda); [theta] = trainLinearReg(X(1:i, :), y(1:i, :), lambda); % training error % [J, grad] = linearRegCostFunction(X, y, theta, lambda) [J grad] = linearRegCostFunction(X(1:i, :), y(1:i, :), theta, 0); etr = J; % cross validation error [J grad] = linearRegCostFunction(Xval, yval, theta, 0); ecv = J; error_train(i, 1) = etr; error_val(i, 1) = ecv; end
Week 7
– Support Vector
Machines (SVM)
Support Vector Machines
(SVM) : large margin classification
– logistic regression
has errorCost > 0 if z >=1
z = theta.transpose *
x // polynome
h = 1 / (1 + e^-z) //
hypotesis = 1 if z >= 0, 0 otherwise
– we want
cost1(z) = 0 if z >= 1, y=1; cost0(z) = 0 if z <= -1, y=0
– then min cost (cost
function optimization) over theta:
min C * sum (for each record: y * cost1(z) + (1 — y) * cost0(z)) + reg reg = ½ sum (for each parameter except first: theta^2) C = 1/lambda // you need to select lower C to fight overfitting
Informally, the C parameter is a positive value that controls the
penalty for misclassified training examples.
A large C parameter
tells the SVM to try to classify all the examples correctly
(overfitting).
Mathematics Behind
Large Margin Classification
– vector inner
product: u.T * v = u1*v1 + u2*v2 = v.T*u // real number
– ||u||: is a norm of
vector u, length(u) = sqrt(u1^2 + u2^2) // real number
– p: length of
projection of vector v onto vector u
– inner product = p *
||u||
SVM decision boundary:
– min ½ sum theta^2
= ½ (theta1^2 + theta2^2) = min ||theta|| // norm of vector theta
minimize length of
theta
– theta.T * x = p *
||theta|| // projection x to theta
maximize projection x
to theta
– SVM: large margin
classification
Kernels
Non-linear decision
boundary
– try to replace
high-degree polynomial features with synthetic features
f1 = similarity(x, l1) = exp( - ||x — l1||^2 / 2sigma^2) … fm = ...
l1: landmark: l1 = x1, … lm = xm
– similarity: kernel
function (Gaussian Kernels)
if x ~ l1: f1 ~ 1
if x is far from l1:
f1 ~ 0
– sigma:
hyperparameter, control kernel shape
predict y = 1 if
theta.T * f >= 0
– given m lendmarks,
compute m synthetic features f
fi = similarity(x, li)
– training set: m x m
matrix F
– use SVN, not
logistic regression (too many features)
– parameter sigma:
larger sigma: lover variance, increase sigma to fight overfitting
Using an SVM
– use packages:
liblinear, libsvm, …
– specify: param C,
kernel (similarity function)
no kernel: 'linear
kernel': f = x
– if we have many
features & not so many records: overfitting, no need for kernel
– if many records &
small #features: use Gaussian kernel, choose sigma
– do perform feature
scaling/normalization before using Gaussian kernel
function f = kernel(x1, x2) f = exp( -1 * ||x1 — x2||^2 / 2sigma^2) return
– n.b. not all similarity functions make valid kernels; need to
satisfy 'Mercer's Theorem'
– misc. kernels:
polynomial, string, chi-square, …
– multi-class
classification: embedded functionality of: one-vs-all, K SVMs
Logistic regression
vs SVM
– n: #features, m:
#records/examples
– if n is large
(relative to m, 10000 vs 1000): use logistic regression or SVM w/o a
kernel
– if n is small, m is
intermediate (1000 vs 10000): use SVM with Gaussian kernel
– if n is small, m is
large (1000 vs 50000): create/add more features, then use logistic
regression or SVM w/o a kernel
– Neural network
likely to work well for most of these settings, but slower to train
Example (Octave)
– You can think of
the Gaussian kernel as a similarity function that measures the
“distance” between a pair of examples
% Kgaussian = exp(-1 * norm(x1-x2)^2 / 2*sigma^2) sim = exp(-1 * ( power(norm(x1 - x2), 2) / (2 * power(sigma, 2)) ));
– select C and sigma
if ~exist('C_vect', 'var') || isempty(C_vect) C_vect = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30]; % by default end if ~exist('sigma_vect', 'var') || isempty(sigma_vect) sigma_vect = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30]; % by default end % for each pair of C and sigma from C_vect, sigma_vect % find model parameters (theta) % find model error on cross validation set % select pair C,sigma with min(error) minErr = 1000000; for currC = C_vect' for currSigma = sigma_vect' model = svmTrain(X, y, currC, @(x1, x2) gaussianKernel(x1, x2, currSigma)); predictions = svmPredict(model, Xval); err = mean(double(predictions ~= yval)); if err <= minErr minErr = err; C = currC; sigma = currSigma; end end end
Week 8
– Clustering
– Dimensionality
reduction
Clustering:
unsupervised learning
– unlabeled data
– find structure in
the data
K-Means
– clustering most
used algorithm
– K: number of
clusters
– training set: x
size m,n: m rows, n columns (drop x0 = 1 by convention)
1. cluster assignment
2. move centroids
– algorithm:
randomly initialize K
cluster centroids w1, w2, …, wK; each size n
repeat: for i = 1..m:
c(i) = index of centroid closest to x(i)
for k = 1..K: w(k) =
average (mean) of points assigned to cluster k
– K-Means for
non-separated clusters
T-shirt sizing (L, M,
S)
Optimization objective
(distortion)
– c(i): index of
cluster to which example x(i) is currently assigned
– w(k): cluster k
centroid, size n
– w(c, i): centroid
of cluster c, to which example i has been assigned
– optimization
objective (cost function)
J(c, w) = 1/m * sum (for each record: ||x — w(c)||^2) min J(c, w)
– K-means algorithm minimize function in two steps on each iteraton
Random initialization
– K < m
– randomly pick K
training examples
– set w1, …, wK to
these examples
for i = 1..100 { random
init; run K-means, get c,w; compute cost J(c,w) }
select min cost.
Choosing the number of
clusters
– elbow method
– in a graph: cost J
vs #clusters K you can see an elbow point
but, sometimes not
– more #clusters
gives lower cost J, always (if computations done right)
– choose #clusters K
based on a metric for how well it performs for your business
Dimensionality
reduction
Data compression
– example: two
features: len inches, len centimeters: compress to one dimension
– 3D => 2D,
projecting points to 2D plane
– for dataset: n
features, m examples, data compression changes n => k, k < n
Visualization
– to plot a graph,
you need to compress your data to 2D or 3D dataset
Principal Component
Analysis (PCA)
– optimization task:
find a surface with k dimensions with min sum (dist^2)
dist: distance between
original point and projection point
– reduce from
n-dimensions to k-dimensions: find k vectors u1, u2, … uk
onto which to project
the data, so as to minimize the projection error
– you need to scale
and normalize your data first!
– PCA is not linear
regression: LR minimize error, not distance; PCA have not prediction
'y'
Data preprocessing
– feature
scaling/mean normalization
x = (x — mu) / si mu: average value of x = 1/m sum (for each record: x) si: standard deviation of x (si = max — min)
PCA algorithm
– compute 'covariance
matrix'
Sigma = 1/m sum (for
each feature: x * x.transpose) // matrix n,n
or: Sigma = 1/m * X.T * X
– compute 'eigenvectors' of sigma
[U, S, V] = svd(Sigma) // singular value decomposition
– U is matrix n,n; we need firs k vectors from it => Ureduce
matrix n,k // n rows, k columns
Ureduce = U(:, 1:k)
– project data: z = Ureduce.T * x // [k,1] = [k,n] * [n,1] Z = X * U(:, 1:K);
or: [k,m] = [k,n] * [n,m]
Choosing the number of
principal components (k)
– aspe: average
squared projection error: 1/m sum (for each record: norm(x —
xapprox)^2)
– tvd: total variaton
in the data: 1/m sum (for each record: norm(x)^2)
– choose k to be
smallest value so that: aspe/tvd <= 0.01 // 1%
saving 99% of variance
: 99% of variance is retained
– try PCA with k = 1,
compute... check … increase k …
there is a faster
method: use matrix S from eigenvectors
(sum (for i in 1..k: S(i,i)) / sum (for i in 1..n: S(i,i))) >= 0.99
Reconstruction from
compressed representation
– compress: z =
Ureduce.T * x
– reconstruct:
xapprox = Ureduce * z // [n,1] = [n,k] * [k,1]
X_rec = Z * U(:, 1:K)';
Advice for applying PCA
– reduce memory/disk
needed for data
– visualisation
– speedup supervised
learning: reduce #features
– mapping x => z
should be defined by running PCA only on the training set
this mapping can be
applied to cross-validation and test sets
– bad use of PCA:
instead of regularization, to fight overfitting
– first try running
your ML with original/raw data; only if that doesn't do what you
want, then consider using PCA
Example (Octave)
– K-Means, find
closest centroids
% Parameters % X is a 300x2 matrix, training set % centroids is a 3x2 matrix, initial centroids % K is a number of clusters % Returns % idx is a 300x1 vector, cluster indices for each rec.from training set % cost = norm(x - mu)^2 % idx = min(cost) m = size(X, 1); for i = 1:m x = X(i, :)'; costMatrix = []; for j = 1:K mu = centroids(j, :)'; cost = power(norm(x - mu), 2); costMatrix = [costMatrix; cost j]; end [c, j] = min(costMatrix); minCostIdx = costMatrix(j(1), 2); idx(i) = minCostIdx; end
– updated centroids
for i = 1:K subset = find(idx == i); Xi = X(subset, :); centroids(i, :) = mean(Xi); end
Week 9
– Anomaly Detection
– Recommender Systems
Anomaly Detection
– build a model P(X)
on training set
– if P(test) <
epsilon => anomaly detected
– fraud detection:
x(i) = features of user(i) activities; check if p(x) < epsilon
– monitoring
computers in datacenter: x(i) = features of machine(i): x1: memory
usage, x2: CPU load, …
– too many false
positives? decrease epsilon
Gaussian (normal)
distribution
– x distributed with
mean mu, variance (standard deviation) sigma^2
bell shaped curve
p(x, mu, sigma^2) = (1 / (sqrt(2Pi) * sigma)) * exp(-1 * (x — mu)^2 / 2sigma^2)
– bell shape area = 1
– having a training
set X
mu = 1/m * sum (for each record: x) sigma^2 = 1/m * sum (for each record: (x — mu)^2)
Density estimation
– training set: x1,
x2, … xm
matrix m,n // n:
#features/columns
– p(x) =
p(x1)*p(x2)...p(xn)
– each feature have
normal distribution
– find n parameters:
mu, sigma
– p(x) = product (for
each feature j: p(xj,muj,sigmaj^2))
Given new example x,
compute p(x)
– if p(x) <
epsilon: anomaly detected
Developing and
evaluating an anomaly detection system
– for evaluation we
need a real-number metrics
– apply
cross-validation, test techique to feature/epsilon selection
Example:
– 10000 good engines;
20 flawed/anomalous
– training set: 6000
good engines
– CV: 2000 good
(y=0); 10 anomalous (y=1)
– test: 2000 good, 10
anomalous
– fit model p(x) on
training set
– on cv/test predict
y=1 if p(x) < epsilon; y=0 otherwise
– metrics: true
positive, false positive, false negative, true negative;
precision/recall; F1-score
– use
cross-validation dataset to choose parameter epsilon and/or features
for model
Anomaly detection vs.
Supervised learning
– anomaly: very small
#positive examples/anomalies, 0-20; large #negative examples
many different types
of anomalies, unknown anomalies
fraud detection;
manufacturing; monitoring (computers)
– supervised: large
#positive and negative examples
well defined positive
examples, future examples likely to be similar to ones in training
set
email spam; weather
prediction; cancer classification
What features to use
– is feature
histogram normal/Gauss?
– can you transform
feature data to normal distribution?
x1 = log(x1)
x2 = log(x2 + c)
x3 = x3^1/2
– we want p(x) large
for good examples; p(x) small for 'bad'
what if p(x) is
comparable for both?
– look at records and
find features that differ
– compose synthetic
features: x5 = x4/x3 = CPUload/networktraffic
Multivariate Gaussian
distribution
– consider not single
feature, but a combination of a few
– #features = n;
vector mu size n; matrix Sigma size n,n // covariance matrix
mu; Sigma: model
parameters
p(x, mu, Sigma) = (1/(2Pi^n/2 * Sigma.det^1/2) )* exp(-1/2 * (x — mu).T * Sigma.inv * (x — mu))
Sigma.det: determinant of Sigma
Sigma.inv: inversion of
Sigma
– changing Sigma
values you can change distribution bell-shape form arbitrarily
parameter fitting
– given training set
X m,n
mu = 1/m sum (for i in 1..m: x(i)) Sigma = 1/m sum (for i in 1..m: (x(i) — mu)*(x(i) — mu).T)
detect anomaly
– given a new example
x, compute
p(x, mu, Sigma)
if p(x) < epsilon:
flag anomaly
original model vs
multivariate
– original model is a
partial case of multivariate model
only Sigma diagonal
values can change
bell-shape can change
only along axes
– original model
computationally cheaper, scales better for large n
have to manually
create features to capture anomalies where x1,x2 take unusual
combinations
OK even if m is small
– multivariate
Gaussian automatically captures correlations between features
must have m > n or
else Sigma is non-invertible
Recommender Systems
example: predicting
movie ratings
– movie 1: users
1,2,3 rate it to 5,5,0
– movie 2: … rate
to ?,4,0
…
nu: #users
nm: #movies
r(i,j) = 1 if user j
has rated movie i
y(i,j) = rating given
by user j to movie i (only if r(i,j) = 1)
– predict rating for
undefined combinations movie,user
Content based
recommendations
– for movies we have
some features n=2: x1: romance; x2: action: real numbers
adding intercept
feature = 1, got a vector size 3
– for each user j,
learn a parameter theta(j): vector[3];
predict rating (i,j) = theta(j).T * x(i)
theta(j): parameter vector for user j
x(i): feature vector
for movie i
m(j): #of movies rated
by user j
– to learn theta(j):
minimize 1/2m(j) * sum (for i in rated_by_j: (predict — y(i,j))^2) + reg predict = theta(j).T * x(i)
– regularization:
reg = lambda/2m(j) * sum (for k in 1..n: theta(j,k)^2) // don't penalize theta0
– for all users, matrix Theta
min J = ½ sum(for j in 1..nu: sum(for i in rated_by_j: (predict — y(i,j))^2) )) + reg predict = Theta(j).T * x(i) reg = lambda/2 * sum(for j in 1..nu: sum(for k in 1..n: Theta(j,k)^2)) // don't penalize theta0
– gradient descent update: k in 0..n
Theta(j,k) = Theta(j,k) — alpha * (sum(for i in rated_by_j: gradient (+ reg if k != 0))) gradient = (Theta(j).T * x(i) — y(i,j)) * x(i,k) reg = lambda * Theta(j,k)
Collaborative
filtering
– we don't know
values of movie features (romance x1, action x2: undefined)
– but we have ratings
and Theta
– find/learn X that
minimize J
min J = ½ sum(for i in 1..nm: sum(for j in rated_by_j: (predict — y(i,j))^2) )) + reg predict = Theta(j).T * x(i) reg = lambda/2 * sum(for i in 1..nm: sum(for k in 1..n: x(i,k)^2)) // don't penalize x0
– gradient descent update: k in 0..n
x(i,k) = x(i,k) — alpha * (sum(for j in rated_by_j: gradient (+ reg if k != 0))) gradient = (Theta(j).T * x(i) — y(i,j)) * Theta(j,k) reg = lambda * x(i,k)
– given features x
for each movie, and movie ratings y: can estimate Theta
– given theta for
each user, and movie ratings y: can estimate X
– w/o theta: only
ratings y?
– minimize J on both
Theta and X:
J(Theta,X) = ½ sum(for i,j in ratings: (predict — y(i,j))^2) + regX + regTheta predict = Theta(j).T * x(i) regX = lambda/2 * sum(for i in 1..nm: sum(for k in 1..n: x(i,k)^2)) regTheta = lambda/2 * sum(for j in 1..nu: sum(for k in 1..n: Theta(j,k)^2))
– and theta0,x0 can be penalized by regularization!
collaborative filtering
algorithm
– initialize x
foreach movie, theta foreach user to small random values (symmetry
breaking, like in NN)
– minimize J(Theta,X)
using optimization algorithm (gradient descent) // no need to add
intercept items to theta and x!
for j in 1..nu, i in 1..nm: X(k,i) = x(k,i) — alpha * gradientTheta Theta(k,j) = Theta(k,j) — alpha * gradientX gradientTheta = sum(for j in ratings(i,j): (predict — y(i,j)) * Theta(k,j)) + lambda * x(k,j) gradientX = sum(for i in ratings(i,j): (predict — y(i,j)) * x(k,j)) + lambda * Theta(k,j)
– for a movie with learned features x, and user with parameters
theta: predict rating theta.T * x
Vectorization: low rank
matrix factorization
– Y(i, j): i in
1..nm, movies; j in 1..nu, users
– Y(i,j) = theta(j).T
* x(i) = real number
– X: matrix [nm,n]
– Theta: matrix
[nu,n]
– Y = X * Theta.T //
nm,n * n,nu = nm rows, nu columns
find the 5 movies j,
most similar to movie i:
– smallest ||x(i) —
x(j)|| will do // distance, norm
implementation detail:
ratings mean normalization
– if for user j no
ratings exists: user theta will be = 0
no good
– calculate mean
rating for movie, mu
– for user j, on
movie i, predict rating: Theta(j).T * X(i) + mu
– if movie have not
ratings, better drop that movie
– you don't do
scaling for ratings: 0..5 stars is all right
Example (Octave)
– anomaly detection,
gaussian distribution
% mu = 1/m * sum(x) % sigma2 = 1/m * sum( (x - mu)^2 ) mu = mean(X, 1)'; % help mean sigma2 = var(X, 1, 1)'; % help var
– find max F1-score,
select threshold
bestEpsilon = 0; bestF1 = 0; F1 = 0; % Parameters % yval is a vector m with labels 1=anomalous, 0=normal % pval is a vector m with computed probabilities - predictions % Returns % bestEpsilon is a real number, selected treshold epsilon % bestF1 is a real number, F1 metrica for selected treshold, max(f1) % If an example x has a low probability p(x) < ε, then it is considered to be an anomaly % tp is the number of true positives: the ground truth label says it’s an anomaly and our algorithm correctly classified it as an anomaly. % fp is the number of false positives: the ground truth label says it’s not an anomaly, but our algorithm incorrectly classified it as an anomaly. % fn is the number of false negatives: the ground truth label says it’s an anomaly, but our algorithm incorrectly classified it as not being anomalous. % F1 = (2 * prec * rec) / (prec + rec); % F1 score % prec = tp / (tp + fp); % precision % rec = tp / (tp + fn); % recall % tp = sum((cvPredictions == 1) & (yval == 1)) % true positives % fp = sum((cvPredictions == 1) & (yval == 0)) % false positives % fn = sum((cvPredictions == 0) & (yval == 1)) % false negatives stepsize = (max(pval) - min(pval)) / 1000; for epsilon = min(pval):stepsize:max(pval) % calc F1 for current epsilon cvPredictions = (pval < epsilon); tp = sum((cvPredictions == 1) & (yval == 1)); % true positives fp = sum((cvPredictions == 1) & (yval == 0)); % false positives fn = sum((cvPredictions == 0) & (yval == 1)); % false negatives prec = tp / (tp + fp); % precision rec = tp / (tp + fn); % recall F1 = (2 * prec * rec) / (prec + rec); % F1 score if F1 > bestF1 bestF1 = F1; bestEpsilon = epsilon; end end % for each epsilon
– collaborative
filtering costFunc
% w/o regularization % J = 0.5 * sum( (Theta' * X - Y)^2 ) for each rated movie % C = A * B; total = sum(sum(C(R == 1))); Errs = power((X * Theta') - Y, 2); errsSum = sum(sum(Errs(R == 1))); J = 0.5 * errsSum;
+ gradient
% Theta 5x2 - users preferences % X 6x2 - movies features % Y 6x5 - movies ratings % R 6x5 - rated flags % X_grad = 6x2 = sum( (Theta' * X - Y) * Theta ) for rated movies only % Theta_grad = 5x2 = sum( (Theta' * X - Y) * X ) for rated movies only H = (X * Theta') .* R; % 6x5 Yr = (Y .* R); % 6x5 X_grad = (H - Yr) * Theta; % 6x2 Theta_grad = (H - Yr)' * X; % 5x2
+ regularization
% regularization reg1 = (lambda / 2) * sum(sum(power(Theta, 2))); reg2 = (lambda / 2) * sum(sum(power(X, 2))); J = J + reg1 + reg2; % Returns % J is a real number, cost value for given Theta, X, Y, R % X_grad is a 6x2 (nm,nf) matrix, gradient vectors for movies % Theta_grad is a 5x2 (nu,nf) matrix, gradient vectors for users gradReg1 = lambda * X; gradReg2 = lambda * Theta; X_grad = X_grad + gradReg1; Theta_grad = Theta_grad + gradReg2;
Week 10
– Large-Scale Machine
Learning
– Example of an
application of machine learning
Large-Scale Machine
Learning
more data: more
accuracy
– take low bias/high
variancy/overfitting algorithm/model and train it on huge dataset
– problem: min
costFunc, every iteration perform … sum(for i in 1..m: … ); m »
1M examples
– you should check:
is it enough to take small subset of data? Say, 1000, 10K records?
– plot a learning
curve: Jtrain, Jcv on axes: m/error; only in case of overfitting
bigdata can help
bigdata processing in
ML
– stochastic gradient
descent
– map-reduce
stochastic gradient
descent
– batch gradient
descent, essentials: use m examples in each iteration
J = 1/2m * sum(for i in 1..m: (h(x(i) - y(i))^2) theta(j) = theta(j) — alpha * gradient gradient = 1/m * sum(for i in 1..m: (h(x(i)) — y(i))*x(i,j))
– stochastic GD: use 1 example in each iteration
1. randomly shuffle
training examples
2. repeat (until
converge, 1..10 times?)
for i in 1..m: foreach j: theta(j) = theta(j) — alpha * (h(x(i) — y(i)) * x(i,j)
mini-batch gradient
descent: use b examples in each iteration
– b: batch size,
2..100
say b = 10, m = 1000
repeat (until converge)
for i in 1, 11, 21, 31, ...991: foreach j: theta(j) = theta(j) — alpha * grad grad = 1/10 * sum(for k in i..i+9: (h(x(k)) - y(k)) * x(k,j))
– can be faster than stochastic GD, vectorization rulezz
– should spend time
to tune hyperparameter b
stochastic GD
convergence (learning rate alpha selection)
– for batch GD: plot
error Jtrain over #iterations, J should descend monotonically until
convergence
– for stochastic GD:
compute cost = ½ (h — y)^2 before updating theta;
every k iterations
(say 1000) plot average(cost) / #iterations
to get more smooth
line, increase k
cost should descend
– you can slowly
decrease alpha over time: alpha = const1 / (#iteration + const2)
Online learning
(stream of data)
– example: shipping
web-service, user comes, specifies origin, destination, size, weight
of package, …
you offer to ship
their package for some asking price,
users sometimes agree
(y=1), sometimes not (y=0)
– we want to know
optimized price
– features X capture
properties of user, package, origin/destination, asking price
we want to learn
p(y=1|x,theta) to optimize price
– algorithm (like
stochastic GD w/o iterations):
repeat forever // for
each next deal:
get x,y corresponding
to user;
update theta using x,y:
foreach j: theta(j) = theta(j) — alpha * (h - y)*x(j)
use updated theta to
predict.
– record come, being
used, dropped: good for big services, where you can't save all data
– example: product
search
– user searches for
'android phone full-hd'
– have 100 phones,
will return 10 records, which one?
– x: features of
phone, how many words in query match name of phone, … description
of phone, …
– y = 1 if user
clicks on link
– learn
p(y=1|x,theta)
– show 10 phones with
max p
Map-Reduce, data
parallelizm
– lots of computing
nodes
– say, we have 4
nodes, m = 400 examples/records
temp1(j) = sum(for i in
1..100: (h - y)*x(j))
temp2(j) = sum(for i in
101..200: (h - y)*x(j))
temp3(j) = sum(for i in
201..300: (h - y)*x(j))
temp4(j) = sum(for i in
301..400: (h - y)*x(j))
– then, main node can
combine:
theta(j) = theta(j) —
alpha 1/400 (temp1 + temp2 + temp3 + temp4)
– many learning
algorithms can be expressed as computing sums over the training set
for NN: compute
forward/back propagation on 1/#nodes of the data to compute the
derivative with respect to that 1/#nodes of the data
Example of an
application of machine learning
Photo OCR : problem
description and pipeline
– normalize pictures
– text detection
(regions with text)
– character
segmentation (one char at a time)
– character
classification (is that a 'A'? or 'B'?)
pipeline: a system with
many stages/components
text detection: sliding
window
– first, lets do
pedestrian detection
– prepare images,
say, 82x36 pix to teach NN classifier; create NN classifier
– sliding window
detection: take a picture, apply a 82x36 rectangle (patch, window),
ask classifier;
– shift (step size
aka stride) patch and ask again;
– scale window,
repeat process (before asking classifier you need to scale picture
back to 82x36)
with text it's the
same: you get a classifier, that can detect presence of
chars/letters; apply sliding window.
now you have a picture
with marked regions, where chars where detected
– black/white image,
white means: text (probably: higher p => whiter color)
– apply expansion
operator: if near pixel p was found white pixel: p also is white
– enclose white
regions in frames/boxes
– remove boxes with
wrong aspect ratio
character segmentation
– prepare classifier
that detects (for image): it's a gap between chars or not?
– apply sliding
window + classifier to find all gaps in image (enclosed in box on
prev.step)
now you can do
character classification/recognition
– having extracted
char image, you can recognize it using NN classifier
getting lots of
data, artificial data
– low biased
model/algorithm + massive training set = success
– you can expand
initial dataset (transformations)
synthesizing data by
introducing distortions
– you can generate
artificial data
print text with
different fonts, on different backgrounds => lots of artifical
data
speech recognition
– original audio
– on bad cellphone
connection
– noisy background:
crowd, machinery, …
usually does not help
to add purely random noise to your data
make shure you have a
low bias classifier before expending the effort.
– plot learning
curves
– keep increasing the
number of features/number of hidden units in NN
how much work would it
be to get 10x as much data as we currently have?
– artificial data
synthesis
– collect/label it
yourself
Ceiling analysis:
what part of the pipeline to work on next
– real number metrics
to evaluate system/model/pipeline modules
say, accuracy is the
system metric
– overall system
accuracy: 72%
– fake text detector
(100%): system 89% (+17%)
– fake char segmentation: system 90% (+1%)
– fake char segmentation: system 90% (+1%)
– fake char
recognition: system 100% (+10%)
you should work on text
detector, obviously.
example: face
recognition
– pipeline: camera
image, preprocessing (remove background, brightness,
black/white,...),
face detection, (eyes
segmentation, nose segm, mouth segm, …), logistic regression,
label.
– do ceiling
analysis: work face detection, eyes segmentation)
original post http://vasnake.blogspot.com/2016/08/machine-learning-andrew-ng-coursera.html
Any idea on coursera Coupons ?
ОтветитьУдалить