SciPy

SciPy#

What is SciPy#

SciPy stands for (Sci)entific (Py)thon. It is an open-source Python library that contains modules for scientific computing such as linear algrbra, optimization, image/signal processing,, statistics, and many more. SciPy uses the NumPy array as it’s base data struture, but also builds upon Pandas, Matplotlib, and SymPy.

import numpy as np 
import pandas as pd 
from allensdk.core.cell_types_cache import CellTypesCache
from allensdk.api.queries.cell_types_api import CellTypesApi
import matplotlib.pyplot as plt
from scipy import stats 
from scipy import linalg
ctc = CellTypesCache(manifest_file='cell_types/manifest.json')

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import numpy as np 
      2 import pandas as pd 
----> 3 from allensdk.core.cell_types_cache import CellTypesCache
      4 from allensdk.api.queries.cell_types_api import CellTypesApi
      5 import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'allensdk'

We will be demonstrating SciPy using the Allen Cell Types Dataset. We will discuss what this dataset contains and how to navigate through it in a later chapter.

# Download meatadata for only human cells 
human_cells = pd.DataFrame(ctc.get_cells(species=[CellTypesApi.HUMAN])).set_index('id')

# Download electrophysiology data 
ephys_df = pd.DataFrame(ctc.get_ephys_features()).set_index('specimen_id')

# Combine our ephys data with our metadata for human cells 
human_ephys_df = human_cells.join(ephys_df)
human_ephys_df.head()

	reporter_status	cell_soma_location	species	name	structure_layer_name	structure_area_id	structure_area_abbrev	dendrite_type	apical	...	trough_t_ramp	trough_t_short_square	trough_v_long_square	trough_v_ramp	trough_v_short_square	upstroke_downstroke_ratio_long_square	upstroke_downstroke_ratio_ramp	upstroke_downstroke_ratio_short_square	vm_for_sag	vrest
id
525011903	None	[273.0, 354.0, 216.0]	Homo Sapiens	H16.03.003.01.14.02	3	12113	FroL	spiny	intact	...	4.134987	1.375253	-53.968754	-59.510420	-71.197919	2.895461	2.559876	3.099787	-88.843758	-70.561035
528642047	None	[69.0, 254.0, 96.0]	Homo Sapiens	H16.06.009.01.02.06.05	5	12141	MTG	aspiny	NA	...	NaN	1.051160	-67.468758	NaN	-70.875002	1.891881	NaN	1.989616	-101.000000	-69.209610
537256313	None	[322.0, 255.0, 92.0]	Homo Sapiens	H16.03.006.01.05.02	4	12141	MTG	spiny	truncated	...	5.694547	1.389900	-52.125004	-51.520836	-72.900002	3.121182	3.464528	3.054681	-87.531250	-72.628105
519832676	None	[79.0, 273.0, 91.0]	Homo Sapiens	H16.03.001.01.09.01	3	12141	MTG	spiny	truncated	...	9.962780	1.211020	-53.875004	-52.416668	-73.693753	4.574865	3.817988	4.980603	-84.218758	-72.547661
596020931	None	[66.0, 220.0, 105.0]	Homo Sapiens	H17.06.009.11.04.02	4	12141	MTG	aspiny	NA	...	14.667340	1.336668	-63.593754	-63.239583	-75.518753	1.452890	1.441754	1.556087	-82.531250	-74.260269

5 rows × 70 columns

SciPy for two-sample statistics#

# Set up our two samples 
spiny_df = human_ephys_df[human_ephys_df['dendrite_type']== 'spiny']
aspiny_df = human_ephys_df[human_ephys_df['dendrite_type']== 'aspiny']

Most often, the goal of our hypothesis testing is to test whether or not two distributions are different, or if a distribution has a different mean than the underlying population distribution.

If we know our distributions are normal (i.e. they’re generated from a normal distribution!) we could use parametric statistics to test our hypothesis. To test for differences between normal populations, we can use the independent t-test in our stats package: stats.ttest_ind(). If we had paired samples, we would use a dependent t-test: stats.ttest_rel().

If one of our populations is skewed, however, we cannot use a t-test. A t-test assumes that the populations are normally distributed. For skewed populations, we can use either the Mann-Whitney U (for independent samples: stats.mannwhitneyu()) or the Wilcoxon Signed Rank Test (for dependent/paired samples: stats.wilcoxon()).

Below we will run a statistical test that compares aspiny_df['tau'] to spiny_df['tau']. Before we decide what test to run, we must first check the skewness of our data. To test for skewness, we can use stats.skewtest(). If the skew test gives us a p-value of less than 0.05, the population is skewed.

# Subselect our samples 
sample_1 = spiny_df['tau']
sample_2 = aspiny_df['tau']

# Run the skew test
stats_1, pvalue_1 = stats.skewtest(sample_1) 
stats_2, pvalue_2 = stats.skewtest(sample_2)

# Print the p-value of both skew tests
print('Spiny skew test pvalue: ' + str(pvalue_1))
print('Aspiny skew test pvalue: ' + str(pvalue_2))

# Plot our distributions 
plt.hist(sample_1)
plt.show()
plt.hist(sample_2)
plt.show()

Spiny skew test pvalue: 1.4148485388194806e-12
Aspiny skew test pvalue: 1.1827442418538926e-09

../_images/c60867b01f1a3f5cbcf43c553490e1af2b6aab440be48c1d897f1f09336de23c.png

../_images/950ef0f8555ebbebf5bc374538f52b1b44ef26d8d0e260a8ef001a1dae3009cc.png

Our pvalues indicate that both of our samples are skewed, therefore we will continure with the Mann-Whitney U test.

print(stats.mannwhitneyu(sample_1, sample_2))

MannwhitneyuResult(statistic=4759.0, pvalue=3.864465154121331e-18)

SciPy for Linear Algebra#

SciPy can be used to find solutions to mathematical algorithms as well. The package has a module with many tools that are helpful with linear algebra. The funtions availabe can help with solving the determinant of a matrix, solve for eigen values, and other liniar algebra problems. After importing the linear algebra module, linalg from the SciPy package, it is important to use the syntax linalg.function() when executing a function from the module.

The linalg.det() method is used to solve for the the determinant of a matrix. This function only takes in square matrices as an argument, inputting a non-square matrix will result in an error. Please visit here for the original documentation.

# Assign our square matrix
matrix = np.array([[1,3,5,2], [6,3,9,7], [2,7,8,5], [9,4,1,8]])
print(matrix)

[[1 3 5 2]
 [6 3 9 7]
 [2 7 8 5]
 [9 4 1 8]]

print(linalg.det(matrix))

-237.00000000000006

The linalg.lu() method is used to solve the LU decomposition of a matrix. It will return the permutation matrix, p, the lower triangular matrix with unit diagonal elements, l, and the upper triangular, u. You can look at the SciPy documentation for more information.

# Assign our matrix 
p, l, u = linalg.lu(matrix)
print('Permutation matrix:')
print(p)
print('\n')
print('Lower triangular:')
print(l)
print('\n')
print('Upper Triangle')
print(u)

Permutation matrix:
[[0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]]


Lower triangular:
[[1.         0.         0.         0.        ]
 [0.22222222 1.         0.         0.        ]
 [0.66666667 0.05454545 1.         0.        ]
 [0.11111111 0.41818182 0.20689655 1.        ]]


Upper Triangle
[[ 9.          4.          1.          8.        ]
 [ 0.          6.11111111  7.77777778  3.22222222]
 [ 0.          0.          7.90909091  1.49090909]
 [ 0.          0.          0.         -0.54482759]]

Now that we have p, l, and u, we can used a funtion from numpy, np.dot() to execute the dot product of l and u to return our matrix.

print(np.dot(l,u))

[[9. 4. 1. 8.]
 [2. 7. 8. 5.]
 [6. 3. 9. 7.]
 [1. 3. 5. 2.]]

You can also find the eigen values and eigen vectors with SciPy. The method linalg.eig() takes in a complex or real matrix and returns its eigenvalues. Please visit the SciPy documentation for more help.

eigen_values, eigen_vectors = linalg.eig(matrix) 
print(eigen_values)
print('\n')
print(eigen_vectors) 

[20.15126564+0.j          1.64808546+0.83387563j  1.64808546-0.83387563j
 -3.44743656+0.j        ]


[[-2.94498514e-01+0.j         -4.63909706e-01+0.08050438j
  -4.63909706e-01-0.08050438j -2.00900213e-04+0.j        ]
 [-5.98355615e-01+0.j         -1.69258696e-01+0.00537595j
  -1.69258696e-01-0.00537595j -8.69603794e-01+0.j        ]
 [-5.83737468e-01+0.j         -2.96214773e-01-0.07015942j
  -2.96214773e-01+0.07015942j  4.14828984e-01+0.j        ]
 [-4.63132542e-01+0.j          8.10533088e-01+0.j
   8.10533088e-01-0.j          2.67779977e-01+0.j        ]]

Lastly, SciPy can be used to solve linear systems of equations. Please visit the SciPy documentation for more help.

array = [[2],[4],[6],[8]]
print(linalg.solve(matrix, array))

[[ 0.04219409]
 [ 0.76793249]
 [-0.31223629]
 [ 0.60759494]]

SciPy for FFT#

Let’s first generate a sine wave. We’ll then generate a second sine wave and add these together to understand what a fourier transform of this data would look like. Sine waves are defined by their frequency, ampltitude, and and phase.

sampling_freq = 1024 # Sampling frequency 
duration = 0.3 # 0.3 seconds of signal
freq1 = 7 # 7 Hz signal
freq2 = 130 # 130 Hz signal

# Generate a time vector
time_vector = np.arange(0, duration, 1/sampling_freq)

# Generate a sine wave
signal_1 = np.sin(2 * np.pi * freq1 * time_vector) 
# Generate another sine wave, with double the power
signal_2 = np.sin(2 * np.pi * freq2 * time_vector) * 2 
                                                       
# Add the signals we created above into one signal
combined_signal = signal_1 + signal_2

print('You\'ve created a complex signal with two sin waves, it looks like this:')
print(combined_signal)

You've created a complex signal with two sin waves, it looks like this:
[ 0.          1.47439991  2.08519495  1.48970011  0.07282654 -1.28516247
 -1.73971525 -0.9915122   0.53292413  1.93848187  2.40138863  1.66610567
  0.19943724 -1.09111077 -1.40482347 -0.53084714  1.02457393  2.34344933
  2.64978051  1.77764376  0.27124849 -0.94338912 -1.11709493 -0.12950467
  1.43829796  2.65429147  2.79773083  1.79391154  0.25921309 -0.87075038
 -0.90555508  0.18351981  1.74615215  2.84526973  2.82232648  1.69456043
  0.14458863 -0.89138407 -0.78833049  0.39055267  1.93196998  2.90282491
  2.71311393  1.47185272 -0.07867731 -1.01069639 -0.77059328  0.48712609
  1.99306173  2.82707791  2.47339433  1.13175262 -0.40305218 -1.22060289
 -0.84403858  0.48231033  1.94034552  2.63174967  2.11992873  0.69342222
 -0.80812029 -1.5004171  -0.98795545  0.39745719  1.79689699  2.34251012
  1.68108704  0.18717967 -1.26280101 -1.81923201 -1.17176498  0.26350358
  1.59508858  1.99394982  1.19365596 -0.34884385 -1.72882574 -2.13951596
 -1.35872776  0.11715217  1.37265455  1.62552844  0.69867805 -0.87295939
 -2.16506855 -2.4215031  -1.51038607 -0.0036222   1.16814322  1.27697384
  0.23680688 -1.34459205 -2.53222567 -2.62786482 -1.59121986 -0.06421317
  1.01628945  0.98366972 -0.15628002 -1.72882537 -2.79729789 -2.72811589
 -1.57297081 -0.03762605  0.94384942  0.77257042 -0.4538759  -2.00028872
 -2.93735296 -2.7022398  -1.43812718  0.0921744   0.96638583  0.6591204
 -0.64118171 -2.1459351  -2.94212909 -2.54310883 -1.18216036  0.32820494
  1.08638016  0.64553743 -0.71684373 -2.16638653 -2.81517573 -2.25741512
 -0.81424972  0.65993593  1.29289322  0.7206598  -0.692914   -2.07569135
 -2.5733991  -1.86500303 -0.35640926  1.06435804  1.56281436  0.86137393
 -0.59324003 -1.89952345 -2.24506578 -1.39667963  0.15888389  1.50852037
  1.86355342  1.03545473 -0.45047402 -1.67203558 -1.86649794 -0.89075814
  0.69228366  1.95324482  2.15686085  1.20548496 -0.30205319 -1.43173593
 -1.47784682 -0.38873698  1.20197385  2.35758369  2.4033293   1.33339409
 -0.18562314 -1.21687133 -1.11843804  0.06938192  1.64835905  2.68350055
  2.56705006  1.38508442 -0.13444119 -1.06085796 -0.82223292  0.44958982
  1.99850132  2.9002282   2.61987956  1.33460187 -0.17329901 -0.98829474
 -0.61393553  0.72765888  2.22978557  2.98779512  2.54481636  1.16736204
 -0.31544305 -1.01202613 -0.50620012  0.89178     2.33238713  2.93930863
  2.33809309  0.88205263 -0.56085455 -1.13159852 -0.49826434  0.94374399
  2.3102546   2.76172856  2.00973769  0.49098778 -0.89609186 -1.33329087
 -0.57616702  0.89853024  2.18049461  2.47504112  1.5825372   0.01887107
 -1.29571509 -1.59171638 -0.71452375  0.78235267  1.97123209  2.1099297
  1.08952395 -0.49989026 -1.72512787 -1.87280792 -0.87965142  0.62939344
  1.7181973   1.70421498  0.57027654 -1.02500769 -2.14450529 -2.137838
 -1.03367474  0.47760828  1.46044012  1.29848001  0.06646661 -1.51482552
 -2.51334999 -2.34800292 -1.1391332   0.3640957   1.23567343  0.93139163
 -0.38282981 -1.93100089 -2.79514459 -2.46903474 -1.16354794  0.32057336
  1.07579109  0.63526422 -0.74550978 -2.24280425 -2.96155738 -2.47530093
 -1.08341299  0.36949224  1.00308503  0.43238392 -1.         -2.43053938
 -2.99571025 -2.35291175 -0.88714229  0.52124465  1.02760444  0.33252192
 -1.13757835 -2.4876847  -2.89412705 -2.10147081 -0.57662604  0.77279405
  1.14596598  0.33192635 -1.1631911  -2.4215072  -2.66713509 -1.73426303
 -0.16721248  1.1078885   1.34175427  0.41390957 -1.0946721  -2.25207847
 -2.3376726  -1.27685668  0.31388467  1.49883394  1.58746591  0.55096047
 -0.96045719 -2.00980935 -1.93864172 -0.76428128  0.83056896  1.90962138
  1.84776934  0.70813353 -0.796063   -1.73179237 -1.50911587 -0.23710996
  1.34188723  2.30004343  2.08369917  0.84731657 -0.63974373 -1.45738083
 -1.08984464  0.26309767]

What we have are the signal values for our complex signal composed of the two sin waves. To see our combined_signal we must plot it using plt.plot().

# Set up our figure
fig = plt.figure(figsize=(12, 4))

# Plot 0.5 seconds of data
plt.plot(time_vector, combined_signal) 
plt.ylabel('Signal',fontsize=14)
plt.xlabel('Time',fontsize=14)
plt.show()

../_images/f7addf432e91b21daa9f440462f9b48b27ae3f883910b048efef163763ffa1d9.png

# plt.plot(time_vector, signal_1)
# plt.show()
# plt.plot(time_vector, signal_2)
# plt.show()

Below, we’ll calculate the Fourier Transform, which will determine the frequencies that are in our sample. We’ll implement this with Welch’s Method, which consists in averaging consecutive Fourier transform of small windows of the signal, with or without overlapping. Basically, we calculate the fourier transform of a signal across a few sliding windows, and then calculate the mean PSD from all the sliding windows.

The freqs vector contains the x-axis (frequency bins) and the psd vector contains the y-axis values (power spectral density). The units of the power spectral density, when working with EEG data, is usually \(\mu\)V^2 per Hz.

# Import signal processing module 
from scipy import signal

# Define sliding window length (4 seconds, which will give us 2 full cycles at 0.5 Hz)
win = 4 * sampling_freq
freqs, psd = signal.welch(combined_signal, sampling_freq, nperseg=win)

# Plot our data
plt.plot(freqs,psd) # Plot a select range of frequencies
plt.ylabel('Power')
plt.xlabel('Frequency (Hz)')
plt.title('FFT of a complex signal')
plt.show()

/Users/VictorMagdaleno/opt/anaconda3/lib/python3.7/site-packages/scipy/signal/spectral.py:1966: UserWarning: nperseg = 4096 is greater than input length  = 308, using nperseg = 308
  .format(nperseg, input_length))

../_images/462939b9a9ede884916ecbe9d3c08f317f00db3dcd3b3bd5ad802e67e17f05fc.png

The Fourier Transformation shows us the signal frequencies that make up our combined signal.

SciPy for Signal Procesing#

Normal physiological data is never as regular as the data above – it’s usually chock full of lots of different waves, as well as noise. Now that we have a sense of the tools we need, let’s work with some real data.

The data we’ll import here is a real 30-seconds extract of slow-wave sleep from a young individual, collected by the Walker Lab at UC Berkeley. This data was collected at 100 Hz from channel ‘F3’. This sampling frequency is fine for EEG data, but wouldn’t be enough for high frequency spiking data. That kind of data is typically sampled at 40 kHz.

import urllib.request

# URL of data to download
data_url = 'https://raphaelvallat.com/images/tutorials/bandpower/data.txt'

# Get the data and save it locally as "sleep_data.txt"
sleep_data, headers = urllib.request.urlretrieve(data_url, './sleep_data.txt')

# Load the .txt file as a numpy array
data = np.loadtxt('sleep_data.txt')

Now that we have the data, let’s took a look at the raw signal.

sampling_freq = 100

time_vector = np.arange(0,30,1/sampling_freq)
fig = plt.figure(figsize=(15,5))
plt.plot(time_vector,data)
plt.xlabel('Time (s)')
plt.ylabel('Voltage (V)')
plt.show()

../_images/f582d0ad462a493dc74f6af78760f7323ff9a9a7996a8b1ffbab7d9e8515df81.png

In this real EEG data, the underlying frequencies are much harder to see by eye. So, we’ll compute a bandpass by first applying a low-pass filter_, followed by a _high-pass filter (or vice versa).

Signal filtration is usually accomplished in 2 steps

Design a filter kernel
Apply the filter kernel to the data

We will use a Butterworth filter. The ideal filter would completely pass everything in the passband (i.e., allow through the parts of the signal we care about) and completely reject everything outside of it, but this cannot be achieved in reality—the Butterworth filter is a close approximation.

We design the filter in Python using scipy’s signal.butter function, with three arguments:

The filter order : (we’ll use a 4th order filter)
The filter frequency : (we must adjust for the sampling frequency, f_s, which is 100 Hz for these data, i.e. 100 data points were recorded per second)
The type of filter : ('lowpass' or 'highpass')

It returns 2 filter parameters, a and b. Then, the bandpass filter is applied using signal.filtfilt, which takes as its parameters b, a, and the signal to be processed

Below, an example bandpass computation is shown to extract the alpha rhythm from the channel 1 data, the results are stored in a dictionary called oscillations_filtered, with the oscillation name (e.g. 'alpha') as the key

# Define lower and upper limits of our bandpass
filter_limits = [0.5, 4]

# First, apply a lowpass filter
# Design filter with high filter limit
b, a = signal.butter(4, (filter_limits[1]/ (sampling_freq / 2)), 'lowpass') 

# Apply it forwards and backwards (filtfilt)
lowpassed = signal.filtfilt(b, a, data)

# Then, apply a high pass filter
# Design filter with low filter limit
b, a = signal.butter(4, (filter_limits[0] / (sampling_freq / 2)), 'highpass') 

# Apply it
bandpassed = signal.filtfilt(b, a, lowpassed) 

Now lets plot our bandpassed data.

# Plot the bandpassed data
fig = plt.figure(figsize=(15,5))
plt.plot(time_vector,bandpassed)
plt.ylabel('Voltage')

# Let's programmatically set the title here, using {} format
plt.title('N3 sleep EEG data (F3), {} band' .format(filter_limits))

plt.xlabel('Time (seconds)')
plt.show()

../_images/c887d69fcee5c578ef636228976869c6a13cd693a0ef5854183203ef71845893.png