Курс Scalable
Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это
всё.
Продолжаю
конспектировать пройденный курс. Неделя
5, продолжение.
В прошлый
раз начали разбирать теорию PCA.
Далее: пятая
неделя, лекции, алгоритмы вычисления
PCA в распределенной среде, бигдата.
WEEK 5: Principal
Component Analysis and Neuroimaging.
PCA ALGORITHM
Освежим в
памяти, что такое ортогональные и
ортонормальные векторы:
Orthogonal
vectors are simply perpendicular vectors,
and one nice
property of orthogonal vectors
is that their dot
product always equals 0.
Note that a unit
vector is simply
a vector whose
Euclidean norm equals one.
Orthonormal …
These are vectors that are orthogonal and also
have unit norm.
Going back to our
example, since a and b are
both unit norm and
orthogonal, they
are also orthonormal
vectors.
In contrast, b and
d, even though they're orthogonal,
are not orthonormal
since the Euclidean norm
of d is greater than
one.
Now that we're
equipped with a better understanding
of these concepts,
we could discuss
the interpretation
of PCA as an iterative algorithm.
Понятно,
ортонормальные векторы, ага.
Теперь как
формулируется итеративный алгоритм
PCA:
imagine that we want
a k dimensional
representation of
our data.
... we can imagine
taking the following iterative
approach.
At the i-th
iteration we aim to find
the direction of
maximal variance in the data,
subject to the
constraint that this direction is of unit norm
and that it is
orthogonal to all directions we
chose in previous
iterations.
We can then project
our data onto this direction,
and the locations
along this direction
become the i-th
feature in our new representation.
As a result of this
algorithm, we
have found k unit
vectors that are pairwise orthogonal,
and thus, these k
directions are orthonormal.
DISTRIBUTED PCA
Так как же нам
найти Принципиальные Компоненты и
получить редуцированный датасет?
4 шага:
отцентрировать данные,
вычислить матрицу
ковариантности,
найти эйгенвекторы,
умножить данные на эйгенвекторы:
we'll assume that
our raw data is not
centered. And so the
first step in PCA involves centering our data.
Or in other words,
computing the mean of each feature
and subtracting
these feature means from each original data
point.
We'll assume that
our centered data is stored in an n
by d matrix which
we'll call x.
Once we've centered
our data, we next
need to compute
the sample covariance
matrix, or the
scatter matrix.
Note that the
scatter matrix is simply the sample covariance
matrix without
dividing by n.
PCA returns the same
solution in either case,
as discussed in the
previous segment.
And we'll work
with the scatter matrix in this segment
to simplify our
notation.
As we previously
discussed, the principal components
of our data equal
the eigenvectors of the sample
covariance matrix.
So in the third
step, we compute these eigenvectors
by performing an
eigendecomposition.
Finally, in order to
obtain our k dimensional representation,
we need to multiply
our original data by the top k eigenvectors
to compute the PCA
scores.
Distributed PCA, Big
n, small d
Как посчитать
PCA на практике, с учетом того, что
количество фич невелико (small d).
… we must center
our data.
And to do this, we
must compute the mean of each feature.
There are d
features, and thus d feature means.
And we define the
vector m to be the d dimensional vector whose
i-th component is
the mean of the i-th feature.
We can compute the
mean vector via a simple
reduce operation
whereby we sum all of the data points
together.
After we have
performed this sum,
we can compute the
mean vector, m,
by simply dividing
by the number of data points, n.
After computing m on
the driver, we
must send it back to
the workers so that each data
point can be
centered.
Together, the reduce
operation to compute m
and the subsequent
communication are inexpensive,
as they are linear
in d in terms of local storage, computation,
and communication.
Once each worker has
access to m,
we can then perform
a map operation
to create the
centered data points, which
simply involves
subtracting m from each original data point.
… we next need to
compute the scatter matrix.
As in the case of
closed-form linear regression,
we can efficiently
perform this computation in a distributed
fashion by using
outer products.
Матрица
ковариантности = X transpose * X, а это это
сумма произведений строк датасета.
Очень легко
ложится на модель распределенных
вычислений.
We'll represent x
visually by its rows or data points,
and then we can
express this matrix multiplication
as a sum of outer
products where each outer product involves
only a single row
of x, or a single data point.
Also recall that, in
the previous step,
we computed our
center data and we stored it
in a data parallel
fashion.
So with this
context, we can now compute the scatter matrix
as a simple
MapReduce operation.
In the map step,
we take each point
and compute its
outer product with itself.
This is a
computational bottleneck
in our PCA
computation, but it's distributed
across multiple
workers.
In the reduce
step, we simply sum over
all of these outer
products.
This requires
quadratic storage and computation in d,
both of which are
feasible since we're assuming that d is small.
Once we've computed
the scatter matrix,
we need to perform
its eigendecomposition.
Since we want to
compute a k dimensional representation
for our data, we're
interested in the top k
principal components
of our data, which we know
are the top k
eigenvectors of a scatter matrix.
We represent these
principal components by the d by k matrix P.
As a result of
eigendecomposition,
we have access to
the top k eigenvectors on the driver.
But now we need to
communicate them to the workers
so that they can
compute these k dimensional representations
for the data points.
This requires O of
dk communication,
which is the
communication bottleneck in this algorithm.
Additionally, the
eigendecomposition
generally requires
cubic time and quadratic space.
But if we only want
the top k eigenvalues and eigenvectors,
we can reduce this
time complexity to O of d squared k.
Finally, now that
each worker has the principal component
stored locally, we
can compute the k dimensional
representation
for each point via a simple matrix vector
multiply.
This process
requires O of dk local computation,
and can be performed
via a simple map operation.
Простой
алгоритм, но можно упереться в количество
фич, превышающее возможности Spark кластера.
Не беда, есть
способ проделать PCA в Big N and Big D setting.
Distributed PCA: big d, big n
Ключевая идея
заключается в хитром, итеративном
способе вычисления ейгенвекторов.
where both N and D
are large,
we can only afford
storage, computation, and communication that are linear in N &
D.
So we can't locally
store or operate on the scatter matrix.
Instead, we'll
introduce an iterative approach.
Our iterative
approach relies on performing
a sequence of
matrix vector products
to compute the
top K eigenvectors of the scatter matrix.
The most common
methods for doing this are
Krylov subspace
and random projection based methods.
And Spark's MLlib in
particular relies on Krylov subspace methods.
In this approach,
the driver provides the workers
with a D dimensional
vector on each iteration,
and requires that
the workers left multiply this vector
by the scatter
matrix.
Overall, the
algorithm requires O of K iterations,
or passes over the
data, and O of DK local storage.
And notably, this
algorithm computes the PCA solution
without ever
explicitly computing the covariance
of a scatter matrix.
The first step
involves the algorithm
communicating the D
dimensional vector, vi, to all workers.
Next (step
2, see below), we need to multiply the scatter matrix by
the vector vi
in a distributed
fashion.
And we'll denote the
result of this matrix multiplication
as the D
dimensional vector qi.
The driver then uses
qi to update its estimate of p,
which are the top K
eigenvectors of the scatter matrix.
We repeat this
process, o of k for o of k iterations,
until we converge
on our answer for p
Also, note that
we're using the letter
i here to denote the
iteration number.
And so the vector
vi communicated
to the workers in
step one changes on each iteration.
step
two is interesting.
The challenge is
that we want to perform this matrix
multiplication
without ever having to explicitly compute
the scatter matrix,
or even having to store copies
of both X and X
transpose.
And so we need to be
a bit clever in how
we perform this
computation.
And it turns out
that by carefully breaking
this
multiplication into two steps,
we're able to
work to achieve our desired goal.
We first compute
bi, which is an n dimensional vector
equal to x times vi.
We then multiply
X transpose by this intermediate result
to obtained qi.
we can efficiently
compute steps one
and step two by
only storing X in a data parallel fashion.
Remember that bi
is an n dimensional vector,
and each component
of this vector is simply equal to the
dot product
between a row of X and the vector vi.
Since each row of X
is an observation,
and since each
worker is storing vi locally,
we can simply
compute the dot product between each data point
and vi, and then
concatenate the results to obtain bi.
We can first perform
a map operation, in which we compute
the dot product of
each data point and the vector vi.
This requires O of d
storage to store vi, and O of nd
distributed
computation to compute the dot products.
Next, in the reduce
step, we can simply concatenate the results.
Finally, each worker
will need to store bi locally
in order to compute
the overall result, qi, in the next step.
So we're going to
need to communicate bi to each worker.
So overall, this
reduce step, combined with the communication of qi,
requires linear
time, space, and communication in terms of n.
The following spark
code snippet succinctly
summarizes what
we've done.
Starting with an RDD
of training data,
we first compute a
dot product in the map step,
then collect the
results, which creates a list.
And finally, we
convert this list into a NumPy array.
Now let's consider
the second step of this two step process.
In this step, our
goal is to compute the product of X transpose and bi.
By inspecting this
product, we can interpret this multiplication
as the sum of
rescaled data points.
In particular, for
the jth data point, we can multiply it by the jth component of the
vector bi,
which gives us a new
d dimensional vector.
We can similarly
rescale each of the other data points,
which gives us a
total of n rescaled d dimensional vectors.
Taking the sum of
these rescaled vectors gives us qi,
which is our desired
result.
In the map step,
we simply rescale each vector
by its corresponding
component of bi.
Storing bi requires
O of n storage,
and computing the
rescaled vectors
requires a pass over
all of our data,
and thus takes O of
nd distributed computation.
In the reduce
step, we simply take a sum
of these rescaled
vectors.
Each worker can
aggregate all of the rescaled vectors
that it is storing
locally.
So each worker only
needs to send a single D dimensional
vector to the
driver, which is the driver then must sum.
Hence, this
process is linear in d, in terms of storage,
computation, and
communication.
The following spark
code snippet summarizes the step,
showing how we can
rescale in the map step,
and then sum the
vectors in the reduce step.
С теоретической
частью всё. Осталась еще одна лабораторка,
пятая (№5).
И конспект
будет завершен.
original post http://vasnake.blogspot.com/2015/12/week-5-part-2.html
Комментариев нет:
Отправить комментарий