NumPy#
NumPy is a useful package that can help store and wrangle homogeneous data. “Homogenous” means that all data points within the data are of the same data type.
We strongly recommend working through the NumPy Quickstart Tutorial or the NumPy beginners tutorial for a more comprehensive introduction to NumPy. Here, we’ll introduce some useful tools using the NumPy package to analyze large datasets.
Before we can use NumPy, we need to import the package. We can also nickname the modules when we import them. The convention is to import numpy
as np
.
# Import packages
import numpy as np
# Use whos to see available modules
%whos
Variable Type Data/Info
------------------------------
np module <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
NumPy Arrays#
The basis of the NumPy package is the array. A NumPy array is similar to a list of lists or a grid of values. You can create a NumPy array from a list using np.array()
, by reading in a file, or through functions built into the NumPy package such as such as arange
, linspace
, empty
, etc.
# Create a list
list_1 = [2, 4, 6, 8, 10, 12]
# Store list as a numpy array
array1 = np.array(list_1)
print(array1)
[ 2 4 6 8 10 12]
What we have created is a one-dimensional array which is similar to a normal list. NumPy arrays however, can be multidimensional. If we input a list of lists into np.array()
, the output would be a multidimensional array (i.e a grid/matrix).
# Create a 2nd list
list_2 = [1, 3, 5, 7, 9, 11]
# Store list of lists as a NumPy array
my_array = np.array([list_1, list_2])
print(my_array)
[[ 2 4 6 8 10 12]
[ 1 3 5 7 9 11]]
Accessing attributes of NumPy arrays#
We can return the shape and size of an array by using the attributes size
and shape
. The shape
attribute returns a tuple for the number of rows and columns of an array. The size
attribute returns the total number of values stored within an array.
print('My array has a shape of:')
print(my_array.shape)
print('\nMy array has a size of:')
print(my_array.size)
My array has a shape of:
(2, 6)
My array has a size of:
12
Other attributes that might be of interest are ndim
and dtype
which respectively return the number of dimensions of the array and the data types stored in the array. You can see the full list of ndarray attributes in the NumPy ndarray documentation.
print('My array dimensions:')
print(my_array.ndim)
print('\nMy array contains values of data type:')
print(my_array.dtype)
My array dimensions:
2
My array contains values of data type:
int64
Indexing & Slicing Arrays#
You can index NumPy arrays using array_name[row,column]
to select a single value. If you omit the column, it will give you the entire row. You can also use :
in place of either row
or column
to indicate you want to return all those values. We will demonstrate by indexing my_array
.
# Select the number 6 from our array
print('The value stored in row 1, column 3 is:')
print(my_array[0,2])
# Select the 2nd row from our array
print('The values stored in row 2 are:')
print(my_array[1])
The value stored in row 1, column 3 is:
6
The values stored in row 2 are:
[ 1 3 5 7 9 11]
You may want to look at a slice of columns or a slice of rows. You can slice your array like the following: array(start_row:stop_row, start_col:end_col)
.
# Print the first 3 columns of each row
print(my_array[: ,0:3])
[[2 4 6]
[1 3 5]]
You can also select multiple, nonsequential columns by inputing a list
as your columns
. Lets try to index the first, third, and last column in array1
.
# Choose your columns of interest
columns = [0, 2, -1]
print(my_array[:, columns])
[[ 2 6 12]
[ 1 5 11]]
We can also change values in an array similar to how we would change values in a list. The syntax we use is array[row,column] = new_desired_value
.
# Change the entire first row of array1 to 100
my_array[0,:] = 100
print(my_array)
[[100 100 100 100 100 100]
[ 1 3 5 7 9 11]]
For further explanation of how to index NumPy arrays, please visit the NumPy indexing documentation.
Subsetting#
We can also subet our original array to only include data that meets our criteria. We can think of this as subsetting the array by applying a condition to our array. The syntax for this would be new_array = original_array[condition]
.
# Reassign our original array
my_array = np.array([list_1, list_2])
# Return only values greater than 5 from our array
condition = (my_array > 5)
filtered_array = my_array[condition]
print(filtered_array)
[ 6 8 10 12 7 9 11]
Benefits of Using Arrays#
If you were trying to add the numbers of the two lists together, simply adding the lists would only append one list at the end of the other. However, if you add two NumPy arrays together, the values of both arrays will be summed.
# Add two lists together
list_3 = [10, 20, 30, 40]
list_4 = [20, 40, 60, 80]
print(list_3 + list_4)
print('\n')
# Add two arrays together
array_1 = np.array([10, 20, 30, 40])
array_2 = np.array([20, 40, 60, 80])
print(array_1 + array_2)
[10, 20, 30, 40, 20, 40, 60, 80]
[ 30 60 90 120]
Alternatively, you can use the sum()
method to add all values in an array together. You can also specify whether you want to sum the values across rows or columns in a grid/matrix. If you specify you want to sum values in rows or columns, the output will be an array of the sums.
# Create a 2 by 3 array
array_3 = np.array([[5, 10], [15, 20], [25, 30]])
print('Original array:\n', array_3)
# Sum all values in array
print('\nArray sum: ', array_3.sum())
# Sum the values across columns
print('\nColumn sums: ', array_3.sum(axis = 0))
# Sum the values across rows
print('\nRow sums: ', array_3.sum(axis = 1))
Original array:
[[ 5 10]
[15 20]
[25 30]]
Array sum: 105
Column sums: [45 60]
Row sums: [15 35 55]
For a full list of array methods, please visit the NumPy array methods documentation.
NumPy also includes some very useful array generating functions:#
arange
: likerange
but gives you a useful NumPy array, instead of an iterator, and can use more than just integers)linspace
creates an array with given start and end points, and a desired number of pointslogspace
same as linspace, but in log.random
can create a random list (there are many different ways to use this)concatenate
which can concatenate two arrays along an existing axis [documentation]hstack
andvstack
which can horizontally or vertically stack arrayssave
andload
can allow you to store and load your arrays
Whenever we call these, we need to use whatever name we imported numpy
as (here, np
). We will demonstrate some of these functions below. For a full list of funtions used to create arrays, please visit the NumPy array creation documentaion.
# When using linspace, both end points are included
print(np.linspace(0,147,10))
[ 0. 16.33333333 32.66666667 49. 65.33333333
81.66666667 98. 114.33333333 130.66666667 147. ]
# First array is a list of 10 numbers that are evenly spaced,
# and range from exactly 1 to 100
lin_array = np.linspace(1,100, 10)
# Second row is a list of 10 numbers that begin
# at 0 and are exactly 10 apart
range_array = np.arange(0,100,10)
print('Linspace array: ', lin_array)
print('Range array: ', range_array)
Linspace array: [ 1. 12. 23. 34. 45. 56. 67. 78. 89. 100.]
Range array: [ 0 10 20 30 40 50 60 70 80 90]
# Create an array that has two rows using np.vstack
big_array = np.vstack([lin_array, range_array])
print(big_array)
[[ 1. 12. 23. 34. 45. 56. 67. 78. 89. 100.]
[ 0. 10. 20. 30. 40. 50. 60. 70. 80. 90.]]
NumPy also has built in methods to save and load arrays: np.save()
and np.load()
. Numpy files have a .npy extension. See full documentation here.
# Save method takes arguments 'filename' and then 'array':
np.save('big_array',big_array)
my_new_matrix = np.load('big_array.npy')
print(my_new_matrix)
[[ 1. 12. 23. 34. 45. 56. 67. 78. 89. 100.]
[ 0. 10. 20. 30. 40. 50. 60. 70. 80. 90.]]
Additional Resources#
See the Python Data Science Handbook for a more in depth exploration of NumPy, and of course, the original NumPy documentation.