Obtaining Datasets with DANDI

Obtaining Datasets with DANDI#

DANDI is an open source data archive for neuroscience datasets, called Dandisets. DANDI allows scientists to submit and download neural datasets to promote research collaboration and consistent and transparent data standards. DANDI also provides a solution to the difficulties that come from housing data in the many other general domains (i.e. Dropbox, Google Drive, etc.). Usefully for our purposes here, many of the datasets on DANDI are in NWB format. If you’d like to know more about DANDI, check out the DANDI handbook.

There are two primary ways to work with Dandisets:

You can download the datasets, either via the DANDI Web Application or using the DANDI Python client below. If you download via the website, you’ll need to create an account.
You can stream datasets directly from DANDI. We’ll show you how to do this online as well as on your local computer.

Below, we demonstrate how to do both of these. For additional information on either of these methods, please refer to the DANDI documentation.

Option 1: Downloading Dandisets using Python#

The cell below will download this dataset from DANDI. This dataset contains 32-channel extracellular recordings from mouse cortex. We’re using the download tool from dandi below.

Note: Downloading this dataset may take several minutes, depending on your internet connection.

Note #2: This step is only possible after completing the setup steps.

from dandi.download import download as dandi_download
import os 

# Set the URL for the DANDI file
url = 'https://dandiarchive.org/dandiset/000006/draft'

# Download the file into the current working directory
# It will skip downloading any files you've already downloaded
dandi_download([url], output_dir = os.getcwd(), existing = "skip")

PATH                                                SIZE      DONE    DONE% CHECKSUM STATUS     MESSAGE          
000006/dandiset.yaml                                                                 skipped    already exists   
000006/sub-anm369962/sub-anm369962_ses-20170309.nwb                                  skipped    already exists   
000006/sub-anm369962/sub-anm369962_ses-20170316.nwb                                  skipped    already exists   
000006/sub-anm369962/sub-anm369962_ses-20170310.nwb                                  skipped    already exists   
000006/sub-anm369962/sub-anm369962_ses-20170314.nwb                                  skipped    already exists   
000006/sub-anm369962/sub-anm369962_ses-20170313.nwb                                  skipped    already exists   
000006/sub-anm369962/sub-anm369962_ses-20170317.nwb                                  skipped    already exists   
000006/sub-anm369963/sub-anm369963_ses-20170227.nwb                                  skipped    already exists   
000006/sub-anm369963/sub-anm369963_ses-20170226.nwb                                  skipped    already exists   
000006/sub-anm369963/sub-anm369963_ses-20170228.nwb                                  skipped    already exists   
000006/sub-anm369963/sub-anm369963_ses-20170301.nwb                                  skipped    already exists   
000006/sub-anm369963/sub-anm369963_ses-20170302.nwb                                  skipped    already exists   
000006/sub-anm369963/sub-anm369963_ses-20170306.nwb                                  skipped    already exists   
000006/sub-anm369963/sub-anm369963_ses-20170309.nwb                                  skipped    already exists   
000006/sub-anm369963/sub-anm369963_ses-20170310.nwb                                  skipped    already exists   
000006/sub-anm369964/sub-anm369964_ses-20170321.nwb                                  skipped    already exists   
000006/sub-anm369964/sub-anm369964_ses-20170320.nwb                                  skipped    already exists   
000006/sub-anm369964/sub-anm369964_ses-20170322.nwb                                  skipped    already exists   
000006/sub-anm369964/sub-anm369964_ses-20170323.nwb                                  skipped    already exists   
000006/sub-anm372793/sub-anm372793_ses-20170504.nwb                                  skipped    already exists   
000006/sub-anm372793/sub-anm372793_ses-20170508.nwb                                  skipped    already exists   
000006/sub-anm372793/sub-anm372793_ses-20170512.nwb                                  skipped    already exists   
000006/sub-anm372793/sub-anm372793_ses-20170513.nwb                                  skipped    already exists   
000006/sub-anm372794/sub-anm372794_ses-20170621.nwb                                  skipped    already exists   
000006/sub-anm372794/sub-anm372794_ses-20170622.nwb                                  skipped    already exists   
000006/sub-anm372793/sub-anm372793_ses-20170514.nwb                                  skipped    already exists   
000006/sub-anm372794/sub-anm372794_ses-20170624.nwb                                  skipped    already exists   
000006/sub-anm372794/sub-anm372794_ses-20170625.nwb                                  skipped    already exists   
000006/sub-anm372794/sub-anm372794_ses-20170626.nwb                                  skipped    already exists   
000006/sub-anm372794/sub-anm372794_ses-20170627.nwb                                  skipped    already exists   
000006/sub-anm372795/sub-anm372795_ses-20170715.nwb                                  skipped    already exists   
000006/sub-anm372795/sub-anm372795_ses-20170714.nwb                                  skipped    already exists   
000006/sub-anm372795/sub-anm372795_ses-20170716.nwb                                  skipped    already exists   
000006/sub-anm372795/sub-anm372795_ses-20170718.nwb                                  skipped    already exists   
000006/sub-anm372797/sub-anm372797_ses-20170617.nwb                                  skipped    already exists   
000006/sub-anm372904/sub-anm372904_ses-20170615.nwb                                  skipped    already exists   
000006/sub-anm372797/sub-anm372797_ses-20170615.nwb                                  skipped    already exists   
000006/sub-anm372904/sub-anm372904_ses-20170617.nwb                                  skipped    already exists   
000006/sub-anm372904/sub-anm372904_ses-20170616.nwb                                  skipped    already exists   
000006/sub-anm372904/sub-anm372904_ses-20170618.nwb                                  skipped    already exists   
000006/sub-anm372904/sub-anm372904_ses-20170619.nwb                                  skipped    already exists   
000006/sub-anm372905/sub-anm372905_ses-20170715.nwb                                  skipped    already exists   
000006/sub-anm372905/sub-anm372905_ses-20170716.nwb                                  skipped    already exists   
000006/sub-anm372905/sub-anm372905_ses-20170717.nwb                                  skipped    already exists   
000006/sub-anm372906/sub-anm372906_ses-20170608.nwb                                  skipped    already exists   
000006/sub-anm372906/sub-anm372906_ses-20170610.nwb                                  skipped    already exists   
000006/sub-anm372906/sub-anm372906_ses-20170611.nwb                                  skipped    already exists   
000006/sub-anm372906/sub-anm372906_ses-20170612.nwb                                  skipped    already exists   
000006/sub-anm372907/sub-anm372907_ses-20170608.nwb                                  skipped    already exists   
000006/sub-anm372907/sub-anm372907_ses-20170610.nwb                                  skipped    already exists   
000006/sub-anm372907/sub-anm372907_ses-20170613.nwb                                  skipped    already exists   
000006/sub-anm372907/sub-anm372907_ses-20170612.nwb                                  skipped    already exists   
000006/sub-anm372909/sub-anm372909_ses-20170520.nwb                                  skipped    already exists   
000006/sub-anm372909/sub-anm372909_ses-20170522.nwb                                  skipped    already exists   
Summary:                                            0 Bytes   0 Bytes                54 skipped 54 already exists
                                                    +139.6 MB 0.00%                                              

Once the cell above completes running, you will see a new folder 📁”00006” wherever you’re running this notebook. Usefully, the code above will also print a list of individual NWB files that have been downloaded in this folder.

Once the data is done downloading, you’re ready for the next step.

Option 2: Streaming the Dandiset#

The folks at NWB have also developed a clever way to stream Dandisets so that small bits of them can be viewed without downloading the entire dataset. This is particularly useful for very large datasets! This step is optional, and maybe a better option if you have limited hard drive space and/or are having issues with Option 1 above.

Streaming via the DANDI hub#

The easiest way to stream data is via the DANDI Jupyter Hub (https://hub.dandiarchive.org/). There are setup steps for this in Chapter 1. The code below should work without a hitch.

Streaming locally, after configuring your environment#

With some configuration, you can stream data on your local computer. First, you need to set up your environment with the right version of a package called h5py. There are instructions here for how to do that. Once you’re done, you can restart the kernel for this notebook, and run the code below.

Code for Data Streaming#

First, we need to figure out the correct URL for the dataset on the Amazon S3 storage system. There is a tool to do so within the dandiapi, which we’ll use below to get the URL for one session from the data we downloaded above.

from dandi.dandiapi import DandiAPIClient

dandiset_id = '000006'  # ephys dataset from the Svoboda Lab
filepath = 'sub-anm372795/sub-anm372795_ses-20170718.nwb'  # 450 kB file

with DandiAPIClient() as client:
    asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(filepath)
    s3_path = asset.get_content_url(follow_redirects=1, strip_query=True)
    
print(s3_path)

https://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35

Now, we can read this path, but we’ll stream it, rather than downloading it! The cell below will print some of the data about this experiment. It uses another package, PyNWB, which is specifically designed to work with NWB files in Python. As you might expect, this won’t be the last time we see this package. Below, we’ll use the NWBHDF5IO class from this package, which will allow us to read NWB files.

Note: The code below will not work unless you’re on the Dandihub or have properly configured your environment. See above.

from pynwb import NWBHDF5IO

with NWBHDF5IO(s3_path, mode='r', load_namespaces=True, driver='ros3') as io:
    nwbfile = io.read()
    print(nwbfile)
    print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 3
      1 from pynwb import NWBHDF5IO
----> 3 with NWBHDF5IO(s3_path, mode='r', load_namespaces=True, driver='ros3') as io:
      4     nwbfile = io.read()
      5     print(nwbfile)

File ~/anaconda3/envs/jb/lib/python3.11/site-packages/hdmf/utils.py:645, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 def func_call(*args, **kwargs):
    644     pargs = _check_args(args, kwargs)
--> 645     return func(args[0], **pargs)

File ~/anaconda3/envs/jb/lib/python3.11/site-packages/pynwb/__init__.py:233, in NWBHDF5IO.__init__(self, **kwargs)
    230         raise ValueError("cannot load namespaces from file when writing to it")
    232     tm = get_type_map()
--> 233     super().load_namespaces(tm, path, file=file_obj, driver=driver)
    234     manager = BuildManager(tm)
    236     # XXX: Leaving this here in case we want to revert to this strategy for
    237     #      loading cached namespaces
    238     # ns_catalog = NamespaceCatalog(NWBGroupSpec, NWBDatasetSpec, NWBNamespace)
   (...)
    241     # tm.copy_mappers(get_type_map())
    242 else:

File ~/anaconda3/envs/jb/lib/python3.11/site-packages/hdmf/utils.py:645, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 def func_call(*args, **kwargs):
    644     pargs = _check_args(args, kwargs)
--> 645     return func(args[0], **pargs)

File ~/anaconda3/envs/jb/lib/python3.11/site-packages/hdmf/backends/hdf5/h5tools.py:149, in HDF5IO.load_namespaces(cls, **kwargs)
    138 """Load cached namespaces from a file.
    139 
    140 If `file` is not supplied, then an :py:class:`h5py.File` object will be opened for the given `path`, the
   (...)
    144 :raises ValueError: if both `path` and `file` are supplied but `path` is not the same as the path of `file`.
    145 """
    146 namespace_catalog, path, namespaces, file_obj, driver = popargs(
    147     'namespace_catalog', 'path', 'namespaces', 'file', 'driver', kwargs)
--> 149 open_file_obj = cls.__resolve_file_obj(path, file_obj, driver)
    150 if file_obj is None:  # need to close the file object that we just opened
    151     with open_file_obj:

File ~/anaconda3/envs/jb/lib/python3.11/site-packages/hdmf/backends/hdf5/h5tools.py:124, in HDF5IO.__resolve_file_obj(cls, path, file_obj, driver)
    122     if driver is not None:
    123         file_kwargs.update(driver=driver)
--> 124     file_obj = File(path, 'r', **file_kwargs)
    125 return file_obj

File ~/anaconda3/envs/jb/lib/python3.11/site-packages/h5py/_hl/files.py:518, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, **kwds)
    515             raise ValueError(f'{name}: S3 location must begin with '
    516                              'either "https://", "http://", or "s3://"')
    517     else:
--> 518         raise ValueError(
    519             "h5py was built without ROS3 support, can't use ros3 driver")
    521 if locking is not None and hdf5_version < (1, 12, 1) and (
    522         hdf5_version[:2] != (1, 10) or hdf5_version[2] < 7):
    523     raise ValueError("HDF5 version >= 1.12.1 or 1.10.x >= 1.10.7 required for file locking options.")

ValueError: h5py was built without ROS3 support, can't use ros3 driver

In addition, we can use a fancy widget to create an interactive display of this dataset while we are streaming it. More on this later!

from nwbwidgets import nwb2widget

nwb2widget(nwbfile)

The following section will go over the the structure of an NWBFile and how to access data from this new file type.