Курс Scalable
Machine Learning. Hadoop, Apache Spark, Python, ML -- вот это
всё.
Продолжаю
конспектировать пройденный курс. Неделя
5.
В прошлый
раз было про теорию PCA (Principal Component
Analysis).
Пришло время
потрогать пройденные темы на практике.
Лабораторка.
LAB5 OVERVIEW
Principal Component
Analysis Lab
This lab delves into
exploratory analysis of neuroscience data, specifically using
principal component analysis (PCA) and feature-based aggregation.
We will use a dataset of light-sheet imaging recorded by the Ahrens
Lab at Janelia Research Campus, and hosted on the CodeNeuro data
repository.
Our dataset is
generated by studying the movement of a larval zebrafish, an animal
that is especially useful in neuroscience because it is transparent,
making it possible to record activity over its entire brain using a
technique called light-sheet microscopy. Specifically, we'll work
with time-varying images containing patterns of the zebrafish's
neural activity as it is presented with a moving visual pattern.
Different stimuli induce different patterns across the brain, and we
can use exploratory analyses to identify these patterns. Read
"Mapping brain activity at scale with cluster computing"
for more information about these kinds of analyses.
During this lab you
will learn about PCA, and then compare and contrast different
exploratory analyses of the same data set to identify which neural
patterns they best highlight.
Понятно вроде.
Я, пожалуй, не буду дальше выделять
курсивом цитаты из материалов курса. И
так понятно, всё, что по аглицки – умело
скопипащено.
ОК, забираем
нотебук
Запускаем
виртуалку
valik@snafu:~$ pushd ~/sparkvagrant/ valik@snafu:~/sparkvagrant$ vagrant up
И вперед: http://localhost:8001/tree
На Гитхабе
этот нотебук в отрендеренном виде
Программа
действий:
-
Part 1: Work through the steps of PCA on a sample dataset
-
Part 2: Write a PCA function and evaluate PCA on sample datasets
-
Part 3: Parse, inspect, and preprocess neuroscience data then perform PCA
-
Part 4: Perform feature-based aggregation followed by PCA
Видно, что
первые две части посвящены разбору
алгоритма PCA (который big n, small d) и только
потом собственно практика:
применение
PCA в реальной задаче.
Для справки
Part 1: Work through the steps of PCA on a sample dataset
Visualization 1:
Two-dimensional Gaussians
Principal Component
Analysis, or PCA, is a strategy for dimensionality reduction.
To better understand
PCA, we'll work with synthetic data generated by sampling from the
two-dimensional Gaussian distribution.
This distribution takes
as input the mean and variance of each dimension, as well as the
covariance between the two dimensions.
In our visualizations
below, we will specify the mean of each dimension to be 50 and the
variance along each dimension to be 1.
We will explore two
different values for the covariance: 0 and 0.9.
When the covariance is
zero, the two dimensions are uncorrelated, and hence the data looks
spherical.
In contrast, when the
covariance is 0.9, the two dimensions are strongly (positively)
correlated and thus the data is non-spherical.
As we'll see in Parts 1
and 2, the non-spherical data is amenable to dimensionality reduction
via PCA, while the spherical data is not.
Для разминки
нам подготовили два датасета:
def create2DGaussian(mn, sigma, cov, n): """Randomly sample points from a two-dimensional Gaussian distribution""" np.random.seed(142) return np.random.multivariate_normal(np.array([mn, mn]), np.array([[sigma, cov], [cov, sigma]]), n) dataRandom = create2DGaussian(mn=50, sigma=1, cov=0, n=100) dataCorrelated = create2DGaussian(mn=50, sigma=1, cov=.9, n=100) |
dataRandom
dataCorrelated
1a, отцентруем
датасет, чтобы среднее значение = 0
In the first step of
PCA, we must first center our data. Working with our correlated
dataset, first compute the mean of each feature (column) in the
dataset. Then for each observation, modify the features by
subtracting their corresponding mean, to create a zero mean dataset.
Note that
correlatedData is an RDD of NumPy arrays. This allows us to perform
certain operations more succinctly. For example, we can sum the
columns of our dataset using correlatedData.sum()
correlatedData = sc.parallelize(dataCorrelated) meanCorrelated = correlatedData.сумма meanCorrelated = [meanCorrelated[0] / correlatedData.количество, meanCorrelated[1] / correlatedData.количество] correlatedDataZeroMean = correlatedData.map(lambda x: (x[0] минус meanCorrelated[0], x[1] минус meanCorrelated[1])) print meanCorrelated print correlatedData.take(1) print correlatedDataZeroMean.take(1) [49.957390367753533, 49.971804769900864] [array([ 49.6717712 , 50.07531969])] [(-0.28561916832964585, 0.10351492197556666)] |
1b, посчитаем
матрицу ковариантности, распределенно
Если вспомнить
лекцию, то понятно, что каждый воркер
получит запись датасета и сделает из
нее outer product.
To compute this matrix,
compute the outer product of each data point,
add together these
outer products,
and divide by the
number of data points.
The data are two
dimensional, so the resulting covariance matrix should be a 2x2
matrix.
correlatedCov = (correlatedDataZeroMean .map(lambda x: аутерпродакт(x, x)) .сумма ) / correlatedDataZeroMean.количество print correlatedCov [[ 0.99558386 0.90148989] [ 0.90148989 1.08607497]] |
1c, размялись,
теперь оформим расчет ковариантности
как функцию
use the expressions
above to write a function to compute the sample covariance matrix for
an arbitrary data RDD
def estimateCovariance(data): """Compute the covariance matrix for a given rdd. Note: The multi-dimensional covariance array should be calculated using outer products. Don't forget to normalize the data by first subtracting the mean. Args: data (RDD of np.ndarray): An `RDD` consisting of NumPy arrays. Returns: np.ndarray: A multi-dimensional array where the number of rows and columns both equal the length of the arrays in the input `RDD`. """ count = data.количество mean = data.сумма mean = map(lambda x: x делить count, mean) centered = data.map(lambda x: x минус mean) res = (centered .map(lambda x: аутерпродакт(x, x)) .сумма() ) / count return res correlatedCovAuto = estimateCovariance(correlatedData) print correlatedCovAuto [[ 0.99558386 0.90148989] [ 0.90148989 1.08607497]] |
1d, имея матрицу
ковариантности, можно найти эйгенвекторы
со товарищи (принцип. компоненты)
Считать будем
по простому, ибо «small d».
The d eigenvectors of
the covariance matrix give us the directions of maximal variance, and
are often called the "principal components."
The associated
eigenvalues are the variances in these directions. In particular, the
eigenvector corresponding to the largest eigenvalue is the direction
of maximal variance (this is sometimes called the "top"
eigenvector).
Eigendecomposition of a
d×d covariance matrix has a (roughly) cubic runtime complexity with
respect to d. Whenever d is relatively small (e.g., less than a few
thousand) we can quickly perform this eigendecomposition locally.
Use a function from
numpy.linalg called eigh to perform the eigendecomposition.
Next, sort the
eigenvectors based on their corresponding eigenvalues (from high to
low), yielding a matrix where the columns are the eigenvectors (and
the first column is the top eigenvector).
Note that np.argsort
can be used to obtain the indices of the eigenvalues that correspond
to the ascending order of eigenvalues.
Finally, set the
topComponent variable equal to the top eigenvector or prinicipal
component, which is a 2-dimensional vector (array with two values).
Note that the
eigenvectors returned by eigh appear in the columns and not the rows.
For example, the first
eigenvector of eigVecs would be found in the first column and could
be accessed using eigVecs[:,0].
from numpy.linalg import eigh # Calculate the eigenvalues and eigenvectors from correlatedCovAuto eigVals, eigVecs = eigh(correlatedCovAuto) print 'eigenvalues: {0}'.format(eigVals) print '\neigenvectors: \n{0}'.format(eigVecs) # Use np.argsort to find the top eigenvector based on the largest eigenvalue inds = np.argsort(eigVals)[::-1] topComponent = eigVecs[:, inds[0]] print '\ntop principal component: {0}'.format(topComponent) eigenvalues: [ 0.13820481 1.94345403] eigenvectors: [[-0.72461254 0.68915649] [ 0.68915649 0.72461254]] top principal component: [ 0.68915649 0.72461254] |
1e, сожмем
данные через PCA
В смысле, имея
принцип. компоненты, в виде матрицы, мы
можем снизить размерность наших данных.
Перемножив
матрицу компонент на матрицу данных.
We just computed the
top principal component for a 2-dimensional non-spherical dataset.
Now let's use this principal component to derive a one-dimensional
representation for the original data. To compute these compact
representations, which are sometimes called PCA "scores",
calculate the dot product between each data point in the raw data and
the top principal component.
# Use the topComponent and the data from correlatedData to generate PCA scores correlatedDataScores = correlatedData.map(lambda x: x.умножить(topComponent)) print 'one-dimensional data (first three):\n{0}'.format(np.asarray(correlatedDataScores.take(3))) one-dimensional data (first three): [ 70.51682806 69.30622356 71.13588168] |
Part 2: Write a PCA function and evaluate PCA on sample datasets
2a, теперь у
нас есть всё необходимое, чтобы нарисовать
функцию «pca»
def pca(data, k=2): """Computes the top `k` principal components, corresponding scores, and all eigenvalues. Note: All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns each eigenvectors as a column. This function should also return eigenvectors as columns. Args: data (RDD of np.ndarray): An `RDD` consisting of NumPy arrays. k (int): The number of principal components to return. Returns: tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of scores, eigenvalues). Eigenvectors is a multi-dimensional array where the number of rows equals the length of the arrays in the input `RDD` and the number of columns equals `k`. The `RDD` of scores has the same number of rows as `data` and consists of arrays of length `k`. Eigenvalues is an array of length d (the number of features). """ covMatrix = estimateCovariance(data) eigValues, eigVecs = eigh(covMatrix) inds = нп.аргсорт(eigValues) revInds = inds[::минус раз] d = len(eigValues) topComponents = np.нули((d, k)) for idx in range(k): # insert columns topComponents[:, idx] = eigVecs[:, revInds[idx]] dataScores = data.map(lambda x: x.умножить(topComponents)) # Return the `k` principal components, `k` scores, and all eigenvalues return (topComponents, dataScores, eigValues[revInds]) # Run pca on correlatedData with k = 2 topComponentsCorrelated, correlatedDataScoresAuto, eigenvaluesCorrelated = pca(correlatedData, k=2) |
2b, применим
функцию pca
randomData = sc.parallelize(dataRandom) # Use pca on randomData topComponentsRandom, randomDataScoresAuto, eigenvaluesRandom = pca(randomData, k=2) print 'topComponentsRandom: \n{0}'.format(topComponentsRandom) print ('\nrandomDataScoresAuto (first three): \n{0}' .format('\n'.join(map(str, randomDataScoresAuto.take(3))))) print '\neigenvaluesRandom: \n{0}'.format(eigenvaluesRandom) topComponentsRandom: [[-0.2522559 -0.96766056] [ 0.96766056 -0.2522559 ]] randomDataScoresAuto (first three): [ 36.61068572 -61.3489929 ] [ 35.97314295 -62.08813671] [ 35.59836628 -60.61390415] eigenvaluesRandom: [ 1.4204546 0.99521397] |
Посмотрим,
как наши данные выглядят на графиках
Теперь добавим
третье измерение, данные 3D
from mpl_toolkits.mplot3d import Axes3D m = 100 mu = np.array([50, 50, 50]) r1_2 = 0.9 r1_3 = 0.7 r2_3 = 0.1 sigma1 = 5 sigma2 = 20 sigma3 = 20 c = np.array([[sigma1 ** 2, r1_2 * sigma1 * sigma2, r1_3 * sigma1 * sigma3], [r1_2 * sigma1 * sigma2, sigma2 ** 2, r2_3 * sigma2 * sigma3], [r1_3 * sigma1 * sigma3, r2_3 * sigma2 * sigma3, sigma3 ** 2]]) np.random.seed(142) dataThreeD = np.random.multivariate_normal(mu, c, m) |
2c, и сожмем
их через PCA до 2D
threeDData = sc.parallelize(dataThreeD) componentsThreeD, threeDScores, eigenvaluesThreeD = pca(threeDData, k=2) print 'componentsThreeD: \n{0}'.format(componentsThreeD) print ('\nthreeDScores (first three): \n{0}' .format('\n'.join(map(str, threeDScores.take(3))))) print '\neigenvaluesThreeD: \n{0}'.format(eigenvaluesThreeD) componentsThreeD: [[ 0.23952078 0.045635 ] [ 0.61699931 0.76409466] [ 0.74962768 -0.64348799]] threeDScores (first three): [ 85.25798606 -8.29694407] [ 89.66337911 15.73381517] [ 75.92616872 -20.5015709 ] eigenvaluesThreeD: [ 614.46863537 349.47737219 5.85043581] |
See the 2D version of
the data that captures most of its original structure.
Note that darker blues
correspond to points with higher values for the original data's third
dimension
2d, отладка,
как много вариативности мы захватываем
let's quantify how much
of the variance is being captured by PCA in each of the three
synthetic datasets we've analyzed. To do this, we'll compute the
fraction of retained variance by the top principal components.
Recall that the
eigenvalue corresponding to each principal component captures the
variance along this direction.
def varianceExplained(data, k=1): """Calculate the fraction of variance explained by the top `k` eigenvectors. Args: data (RDD of np.ndarray): An RDD that contains NumPy arrays which store the features for an observation. k: The number of principal components to consider. Returns: float: A number between 0 and 1 representing the percentage of variance explained by the top `k` eigenvectors. """ components, scores, eigenvalues = pca(data, k) topEV = эйгенвалс[:первые к] res = sum(топ) / sum(эйгенвалс) return res varianceRandom1 = varianceExplained(randomData, 1) varianceCorrelated1 = varianceExplained(correlatedData, 1) varianceRandom2 = varianceExplained(randomData, 2) varianceCorrelated2 = varianceExplained(correlatedData, 2) varianceThreeD2 = varianceExplained(threeDData, 2) print ('Percentage of variance explained by the first component of randomData: {0:.1f}%' .format(varianceRandom1 * 100)) print ('Percentage of variance explained by both components of randomData: {0:.1f}%' .format(varianceRandom2 * 100)) print ('\nPercentage of variance explained by the first component of correlatedData: {0:.1f}%'. format(varianceCorrelated1 * 100)) print ('Percentage of variance explained by both components of correlatedData: {0:.1f}%' .format(varianceCorrelated2 * 100)) print ('\nPercentage of variance explained by the first two components of threeDData: {0:.1f}%' .format(varianceThreeD2 * 100)) Percentage of variance explained by the first component of randomData: 58.8% Percentage of variance explained by both components of randomData: 100.0% Percentage of variance explained by the first component of correlatedData: 93.4% Percentage of variance explained by both components of correlatedData: 100.0% Percentage of variance explained by the first two components of threeDData: 99.4% |
Part 3: Parse, inspect, and preprocess neuroscience data then perform PCA
3a, загрузка
данных
we will load and do
some basic inspection of the data. The raw data are currently stored
as a text file. Every line in the file contains the time series of
image intensity for a single pixel in a time-varying image (i.e. a
movie). The first two numbers in each line are the spatial
coordinates of the pixel, and the remaining numbers are the time
series. We'll use first() to inspect a single row, and print just the
first 100 characters.¶
import os baseDir = os.path.join('data') inputPath = os.path.join('cs190', 'neuro.txt') inputFile = os.path.join(baseDir, inputPath) lines = sc.textFile(inputFile) print lines.first()[0:100] # Check that everything loaded properly assert len(lines.first()) == 1397 assert lines.count() == 46460 0 0 103 103.7 103.2 102.7 103.8 102.8 103 103.3 103.8 103.2 102.1 103.5 103.2 102.7 103.1 102.2 102. |
3b, парсинг
данных
Каждая запись
станет туплем, см. ниже
Parse the data into a
key-value representation. We want each key to be a tuple of
two-dimensional spatial coordinates and each value to be a NumPy
array storing the associated time series.
Write a function that
converts a line of text into a (tuple, np.ndarray) pair.
Then apply this
function to each record in the RDD, and inspect the first entry of
the new parsed data set.
Now would be a good
time to cache the data, and force a computation by calling count, to
ensure the data are cached.
def parse(line): """Parse the raw data into a (`tuple`, `np.ndarray`) pair. Note: You should store the pixel coordinates as a tuple of two ints and the elements of the pixel intensity time series as an np.ndarray of floats. Args: line (str): A string representing an observation. Elements are separated by spaces. The first two elements represent the coordinates of the pixel, and the rest of the elements represent the pixel intensity over time. Returns: tuple of tuple, np.ndarray: A (coordinate, pixel intensity array) `tuple` where coordinate is a `tuple` containing two values and the pixel intensity is stored in an NumPy array which contains 240 values. """ lst = line.сплит(пробел) lst = map(lambda x: float(x.strip()), lst) coords = (int(lst[0]), int(lst[1])) pixValues = lst[остаток-после-двух:] res = (coords, np.asarray(pixValues)) return res rawData = lines.map(parse) rawData.cache() entry = rawData.first() print 'Length of movie is {0} seconds'.format(len(entry[1])) print 'Number of pixels in movie is {0:,}'.format(rawData.count()) print ('\nFirst entry of rawData (with only the first five values of the NumPy array):\n({0}, {1})' .format(entry[0], entry[1][:5])) Length of movie is 240 seconds Number of pixels in movie is 46,460 First entry of rawData (with only the first five values of the NumPy array): ((0, 0), [ 103. 103.7 103.2 102.7 103.8]) |
3c, найдем
границы датасета, мин, макс
Next we'll do some
basic preprocessing on the data. The raw time-series data are in
units of image flouresence, and baseline flouresence varies somewhat
arbitrarily from pixel to pixel. First, compute the minimum and
maximum values across all pixels
#allMean = (rawData.map(lambda (xy, vals): sum(vals)/len(vals)).sum()) / rawData.count() mn = (rawData .map(lambda (xy, vals): минимум(vals)) .минимум ) mx = (rawData .map(lambda (xy, vals): максимум(vals)) .максимум ) print mn, mx 100.6 940.8 |
Как меняется
яркость пикселей со временем, график
example = rawData.filter(lambda (k, v): np.std(v) > 100).values().first() plt.plot(range(len(example)), example, c='#8cbfd0', linewidth='3.0')
3d, масштабирование
данных
To convert from these
raw flouresence units to more intuitive units of fractional signal
change, write a function that takes a time series for a particular
pixel and subtracts and divides by the mean. Then apply this function
to all the pixels. Confirm that this changes the maximum and minimum
values
def rescale(ts): """Take a np.ndarray and return the standardized array by subtracting and dividing by the mean. Note: You should first subtract the mean and then divide by the mean. Args: ts (np.ndarray): Time series data (`np.float`) representing pixel intensity. Returns: np.ndarray: The times series adjusted by subtracting the mean and dividing by the mean. """ mn = np.среднее(ts) #res = map(lambda x: (x минус mn) поделить mn, ts) res = (ts минус mn) поделить mn res = np.asarray(res) return res scaledData = rawData.mapValues(lambda v: rescale(v)) mnScaled = scaledData.map(lambda (k, v): v).map(lambda v: min(v)).min() mxScaled = scaledData.map(lambda (k, v): v).map(lambda v: max(v)).max() print mnScaled, mxScaled -0.271512880125 0.905448764348 |
нормализованные
данные на графике
example = scaledData.filter(lambda (k, v): np.std(v) > 0.1).values().first() plt.plot(range(len(example)), example, c='#8cbfd0', linewidth='3.0')
3e, сожмем
датасет с 240 секунд до 3
We now have a
preprocessed dataset with n=46460 pixels and d=240 seconds of time
series data for each pixel.
We can interpret the
pixels as our observations and each pixel value in the time series as
a feature.
We would like to find
patterns in brain activity during this time series, and we expect to
find correlations over time.
We can thus use PCA to
find a more compact representation of our data and allow us to
visualize it.
Use the pca function
from Part (2a) to perform PCA on the preprocessed neuroscience data
with k=3, resulting in a new low-dimensional 46460 by 3 dataset.
The pca function takes
an RDD of arrays, but data is an RDD of key-value pairs, so you'll
need to extract the values.
# Run pca using scaledData componentsScaled, scaledScores, eigenvaluesScaled = pca( scaledData.map(lambda (k,v): v), k=3) |
Как сжатые
данные выглядят на графике
Now, we'll view the
scores for the top two component as images.
Note that we reshape
the vectors by the dimensions of the original image, 230 x 202.
These graphs map the
values for the single component to a grayscale image. This provides
us with a visual representation which we can use to see the overall
structure of the zebrafish brain and to identify where high and low
values occur.
However, using this
representation, there is a substantial amount of useful information
that is difficult to interpret.
In the next
visualization, we'll see how we can improve interpretability by
combining the two principal components into a single image using a
color mapping
scoresScaled = np.vstack(scaledScores.collect())
imageOneScaled = scoresScaled[:,0].reshape(230, 202).T image = plt.imshow(imageOneScaled,interpolation='nearest', aspect='auto', cmap=cm.gray)
imageTwoScaled = scoresScaled[:,1].reshape(230, 202).T image = plt.imshow(imageTwoScaled,interpolation='nearest', aspect='auto', cmap=cm.gray)
А отобразив
компоненты на цвета, можно получить
картинку еще интереснее
brainmap = polarTransform(2.0, [imageOneScaled, imageTwoScaled]) image = plt.imshow(brainmap,interpolation='nearest', aspect='auto')
Part 4: Perform feature-based aggregation followed by PCA
4a, фокус в
том, что 240 секунд замеров в нашем датасете
это измерения реакции на 12 шаблонов, по
20 секунд каждое измерение.
Поэтому нам
надо научиться аггрегировать данные.
В данном случае, через (кто бы мог
подумать) перемножение матриц.
For this exercise,
you'll create several arrays that perform different types of
aggregation
vector = np.array([0., 1., 2., 3., 4., 5.]) # Create a multi-dimensional array that when multiplied (using .dot) against vector, results in # a two element array where the first element is the sum of the 0, 2, and 4 indexed elements of # vector and the second element is the sum of the 1, 3, and 5 indexed elements of vector. # This should be a 2 row by 6 column array sumEveryOther = np.array([[1, 0, 1, 0, 1, 0], [0, 1, 0, 1, 0, 1]]) # Create a multi-dimensional array that when multiplied (using .dot) against vector, results in a # three element array where the first element is the sum of the 0 and 3 indexed elements of vector, # the second element is the sum of the 1 and 4 indexed elements of vector, and the third element is # the sum of the 2 and 5 indexed elements of vector. # This should be a 3 row by 6 column array sumEveryThird = np.array([[1, 0, 0, 1, 0, 0], [0, 1, 0, 0, 1, 0], [0, 0, 1, 0, 0, 1]]) # Create a multi-dimensional array that can be used to sum the first three elements of vector and # the last three elements of vector, which returns a two element array with those values when dotted # with vector. # This should be a 2 row by 6 column array sumByThree = np.array([[1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 1, 1]]) # Create a multi-dimensional array that sums the first two elements, second two elements, and # last two elements of vector, which returns a three element array with those values when dotted # with vector. # This should be a 3 row by 6 column array sumByTwo = np.array([ [1, 1, 0, 0, 0, 0], [0, 0, 1, 1, 0, 0], [0, 0, 0, 0, 1, 1]]) print 'sumEveryOther.dot(vector):\t{0}'.format(sumEveryOther.dot(vector)) print 'sumEveryThird.dot(vector):\t{0}'.format(sumEveryThird.dot(vector)) print '\nsumByThree.dot(vector):\t{0}'.format(sumByThree.dot(vector)) print 'sumByTwo.dot(vector): \t{0}'.format(sumByTwo.dot(vector)) sumEveryOther.dot(vector): [ 6. 9.] sumEveryThird.dot(vector): [ 3. 5. 7.] sumByThree.dot(vector): [ 3. 12.] sumByTwo.dot(vector): [ 1. 5. 9.] |
4b, использование
np.tile, np.eye для заполнения матриц
In this exercise,
recreate sumEveryOther and sumEveryThird using np.tile and np.eye
# Use np.tile and np.eye to recreate the arrays sumEveryOtherTile = np.tile(np.eye(2),3) sumEveryThirdTile = np.tile(np.eye(3),2) print sumEveryOtherTile print 'sumEveryOtherTile.dot(vector): {0}'.format(sumEveryOtherTile.dot(vector)) print '\n', sumEveryThirdTile print 'sumEveryThirdTile.dot(vector): {0}'.format(sumEveryThirdTile.dot(vector)) [[ 1. 0. 1. 0. 1. 0.] [ 0. 1. 0. 1. 0. 1.]] sumEveryOtherTile.dot(vector): [ 6. 9.] [[ 1. 0. 0. 1. 0. 0.] [ 0. 1. 0. 0. 1. 0.] [ 0. 0. 1. 0. 0. 1.]] sumEveryThirdTile.dot(vector): [ 3. 5. 7.] |
4c, использование
np.kron для заполнения матриц
For this exercise,
you'll recreate the sumByThree and sumByTwo arrays using np.kron,
np.eye, and np.ones. Note that np.ones creates an array of all ones
# Use np.kron, np.eye, and np.ones to recreate the arrays sumByThreeKron = np.kron( # np.array([[1, 0], # [0, 1]]), np.eye(2), # np.array([1, 1, 1]) np.ones((1,3)) ) sumByTwoKron = np.kron( # np.array([[1,0,0], # [0,1,0], # [0,0,1]]), np.eye(3), # np.array([1, 1]) np.ones((1,2)) ) print sumByThreeKron print 'sumByThreeKron.dot(vector): {0}'.format(sumByThreeKron.dot(vector)) print '\n', sumByTwoKron print 'sumByTwoKron.dot(vector): {0}'.format(sumByTwoKron.dot(vector)) [[ 1. 1. 1. 0. 0. 0.] [ 0. 0. 0. 1. 1. 1.]] sumByThreeKron.dot(vector): [ 3. 12.] [[ 1. 1. 0. 0. 0. 0.] [ 0. 0. 1. 1. 0. 0.] [ 0. 0. 0. 0. 1. 1.]] sumByTwoKron.dot(vector): [ 1. 5. 9.] |
4d, теперь
можно аггрегировать наш датасет
we'll first study the
temporal aspects of neural response, by aggregating our features by
time. In other words, we want to see how different pixels (and the
underlying neurons captured in these pixels) react in each of the 20
seconds after a new visual pattern is displayed, regardless of what
the pattern is.
Hence, instead of
working with the 240 features individually, we'll aggregate the
original features into 20 new features, where the first new feature
captures the pixel response one second after a visual pattern
appears, the second new feature is the response after two seconds,
and so on.
We can perform this
aggregation using a map operation.
First, build a
multi-dimensional array T that, when dotted with a 240-dimensional
vector, sums every 20-th component of this vector and returns a
20-dimensional vector.
Note that this exercise
is similar to (4b).
Once you have created
your multi-dimensional array T, use a map operation with that array
and each time series to generate a transformed dataset. We'll cache
and count the output, as we'll be using it again
# Create a multi-dimensional array to perform the aggregation T = np.tile( np.kron( np.глаз(двадцать), np.единицы((раз, раз)) ), 12 ) # Transform scaledData using T. Make sure to retain the keys. timeData = scaledData.map(lambda (k,v): (k, T.умножить(v))) timeData.cache() print timeData.count() print timeData.first() 46460 ((0, 0), array([ 0.00802155, 0.00607693, -0.0075354 , 0.00121539, 0.02163388, 0.00121539, -0.03087082, 0.00510462, 0.01191079, 0.02455081, -0.0182308 , 0.00802155, -0.00948002, -0.00948002, 0.02163388, -0.02212004, 0.00704924, 0.00121539, -0.01142464, -0.00850771])) |
4e, сожмем
агрегированный датасет через PCA
We now have a
time-aggregated dataset with n=46460 pixels and d=20 aggregated time
features, and we want to use PCA to find a more compact
representation. Use the pca function from Part (2a) to perform PCA on
the this data with k=3, resulting in a new low-dimensional 46,460 by
3 dataset. As before, you'll need to extract the values from timeData
since it is an RDD of key-value pairs
componentsTime, timeScores, eigenvaluesTime = pca( timeData.map(lambda (k,v): v), k=3 ) print 'componentsTime: (first five) \n{0}'.format(componentsTime[:5,:]) print ('\ntimeScores (first three): \n{0}' .format('\n'.join(map(str, timeScores.take(3))))) print '\neigenvaluesTime: (first five) \n{0}'.format(eigenvaluesTime[:5]) componentsTime: (first five) [[ 0.27392702 -0.16152431 0.01388556] [ 0.09941893 -0.31968127 -0.34738824] [-0.03376505 -0.32933108 -0.35606954] [-0.12092744 -0.2845482 -0.27232364] [-0.18219248 -0.22998061 -0.12248985]] timeScores (first three): [-0.00720617 -0.00292979 -0.00223645] [ 0.02353076 -0.00197457 0.00362094] [ 0.01310623 0.00123069 -0.00582974] eigenvaluesTime: (first five) [ 0.77528991 0.05038881 0.01173423 0.0059711 0.00138073] |
Посмотрим,
что вышло
Let's view the scores
from the first two PCs as a composite image. When we preprocess by
aggregating by time and then perform PCA, we are only looking at
variability related to temporal dynamics. As a result, if neurons
appear similar -- have similar colors -- in the resulting image, it
means that their responses vary similarly over time, regardless of
how they might be encoding direction. In the image below, we can
define the midline as the horizontal line across the middle of the
brain. We see clear patterns of neural activity in different parts of
the brain, and crucially note that the regions on either side of the
midline are similar, which suggests that temporal dynamics do not
differ across the two sides of the brain
scoresTime = np.vstack(timeScores.collect()) imageOneTime = scoresTime[:,0].reshape(230, 202).T imageTwoTime = scoresTime[:,1].reshape(230, 202).T brainmap = polarTransform(3, [imageOneTime, imageTwoTime]) image = plt.imshow(brainmap,interpolation='nearest', aspect='auto')
4f, аггрегируем
иначе, в 12 (вместо 20) фич
Next, let's perform a
second type of feature aggregation so that we can study the
direction-specific aspects of neural response, by aggregating our
features by direction.
In other words, we want
to see how different pixels (and the underlying neurons captured in
these pixels) react when the zebrafish is presented with 12
direction-specific patterns, ignoring the temporal aspect of the
reaction.
Hence, instead of
working with the 240 features individually, we'll aggregate the
original features into 12 new features, where the first new feature
captures the average pixel response to the first direction-specific
visual pattern, the second new feature is the response to the second
direction-specific visual pattern, and so on.
As in Part (4c), we'll
design a multi-dimensional array D that, when multiplied by a
240-dimensional vector, sums the first 20 components, then the second
20 components, and so on.
Note that this is
similar to exercise (4c).
First create D, then
use a map operation with that array and each time series to generate
a transformed dataset.
We'll cache and count
the output, as we'll be using it again.
# Create a multi-dimensional array to perform the aggregation D = np.kron( np.глаз(двенадцать), np.единицы((раз, двадцать)) ) # Transform scaledData using D. Make sure to retain the keys. directionData = scaledData.map(lambda (k,v): (k, D.умножить(v))) directionData.cache() print directionData.count() print directionData.first() 46460 ((0, 0), array([ 0.03346365, 0.03638058, -0.02195799, -0.02487492, 0.00721129, 0.00332206, -0.02098568, 0.00915591, -0.00542873, -0.01029027, 0.0081836 , -0.01417951])) |
4g, сожмем
получившееся через PCA
We now have a
direction-aggregated dataset with n=46460 pixels and d=12 aggregated
direction features, and we want to use PCA to find a more compact
representation. Use the pca function from Part (2a) to perform PCA on
the this data with k=3, resulting in a new low-dimensional 46460 by 3
dataset. As before, you'll need to extract the values from
directionData since it is an RDD of key-value pairs
componentsDirection, directionScores, eigenvaluesDirection = pca( directionData.map(lambda (k,v): v), k=3 ) print 'componentsDirection: (first five) \n{0}'.format(componentsDirection[:5,:]) print ('\ndirectionScores (first three): \n{0}' .format('\n'.join(map(str, directionScores.take(3))))) print '\neigenvaluesDirection: (first five) \n{0}'.format(eigenvaluesDirection[:5]) componentsDirection: (first five) [[-0.25952179 0.16201941 0.24947433] [-0.31369506 -0.09185175 0.29464223] [-0.21716693 -0.35944645 0.35296454] [-0.11517273 -0.37356905 0.07169062] [ 0.02996577 -0.36272623 -0.14783897]] directionScores (first three): [-0.01622513 0.01322998 0.01322204] [ 0.00999482 0.0652367 -0.04524758] [ 0.004646 0.05751097 0.00756383] eigenvaluesDirection: (first five) [ 0.96411048 0.77613553 0.12762987 0.09775924 0.04333691] |
Графическое
представление полученного
Again, let's view the
scores from the first two PCs as a composite image. When we
preprocess by averaging across time (group by direction), and then
perform PCA, we are only looking at variability related to stimulus
direction. As a result, if neurons appear similar -- have similar
colors -- in the image, it means that their responses vary similarly
across directions, regardless of how they evolve over time. In the
image below, we see a different pattern of similarity across regions
of the brain. Moreover, regions on either side of the midline are
colored differently, which suggests that we are looking at a
property, direction selectivity, that has a different representation
across the two sides of the brain
scoresDirection = np.vstack(directionScores.collect()) imageOneDirection = scoresDirection[:,0].reshape(230, 202).T imageTwoDirection = scoresDirection[:,1].reshape(230, 202).T brainmap = polarTransform(2, [imageOneDirection, imageTwoDirection]) image = plt.imshow(brainmap, interpolation='nearest', aspect='auto')
Вместо
заключения:
In the analyses above
we have successfully identified regions of the brain that encode
particular properties, e.g., a particular temporal pattern or
selectivity to a stimulus. However, this is only the first step!
These exploratory analyses are typically followed with more targeted
investigation, both through analysis and experiment. For example, we
might find all neurons that prefer one stimulus direction, and then
do an experiment in which we stimulate or inactivate only those
neurons and look at the effect on the animal's behavior.
Alternatively, we might subdivide neurons into groups based on simple
forms of stimulus selectivity like the ones analyzed here, and then
estimate coupling across different neuronal populations, i.e. can we
predict one population's response as a function of another. This can
be framed as a massive pair-wise regression problem, related to
techniques you learned earlier in the course, and demanding
large-scale implementations.
Вот.
На этом и
закончился курс
BerkeleyX: CS190.1x
Scalable Machine Learning
В общем, ничего
серъезного. Так, дали понюхать
инструментарий и методы его использования.
original post http://vasnake.blogspot.com/2015/12/week-5-lab-5.html
Комментариев нет:
Отправить комментарий