Obtaining Datasets with DANDI#
DANDI is an open source data archive for neuroscience datasets, called Dandisets. DANDI allows scientists to submit and download neural datasets to promote research collaboration and consistent and transparent data standards. DANDI also provides a solution to the difficulties that come from housing data in the many other general domains (i.e. Dropbox, Google Drive, etc.). Usefully for our purposes here, many of the datasets on DANDI are in NWB format. If you’d like to know more about DANDI, check out the DANDI handbook.
There are two primary ways to work with Dandisets:
You can download the datasets, either via the DANDI Web Application or using the DANDI Python client below. If you download via the website, you’ll need to create an account.
You can stream datasets directly from DANDI. We’ll show you how to do this online as well as on your local computer.
Below, we demonstrate how to do both of these. For additional information on either of these methods, please refer to the DANDI documentation.
Option 1: Downloading Dandisets using Python#
The cell below will download this dataset from DANDI. This dataset contains 32-channel extracellular recordings from mouse cortex. We’re using the download
tool from dandi below.
Note: Downloading this dataset may take several minutes, depending on your internet connection.
Note #2: This step is only possible after completing the setup steps.
from dandi.download import download as dandi_download
import os
# Set the URL for the DANDI file
url = 'https://dandiarchive.org/dandiset/000006/draft'
# Download the file into the current working directory
# It will skip downloading any files you've already downloaded
dandi_download([url], output_dir = os.getcwd(), existing = "skip")
PATH SIZE DONE DONE% CHECKSUM STATUS MESSAGE
000006/dandiset.yaml done updated
000006/sub-anm369962/sub-anm369962_ses-20170309.nwb 796.9 kB 796.9 kB 100% ok done
000006/sub-anm369962/sub-anm369962_ses-20170316.nwb 609.6 kB 609.6 kB 100% ok done
000006/sub-anm369962/sub-anm369962_ses-20170310.nwb 6.6 MB 6.6 MB 100% ok done
000006/sub-anm369962/sub-anm369962_ses-20170314.nwb 7.5 MB 7.5 MB 100% ok done
000006/sub-anm369962/sub-anm369962_ses-20170313.nwb 11.5 MB downloading
Summary: 27.0 MB 15.5 MB 5 done 1 updated
+112.6 MB 11.11% 1 downloading
ETA: 3 minutes
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[1], line 9
5 url = 'https://dandiarchive.org/dandiset/000006/draft'
7 # Download the file into the current working directory
8 # It will skip downloading any files you've already downloaded
----> 9 dandi_download([url], output_dir = os.getcwd(), existing = "skip")
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/dandi/download.py:146, in download(urls, output_dir, format, existing, jobs, jobs_per_zarr, get_metadata, get_assets, sync, path_type)
144 elif format == "pyout":
145 with out:
--> 146 for rec in gen_:
147 out(rec)
148 else:
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/dandi/download.py:132, in <genexpr>(.0)
112 lgr.warning(
113 "Parallel downloads are not yet implemented for non-pyout format=%r. "
114 "Download will proceed serially.",
115 format,
116 )
118 downloaders = [
119 Downloader(
120 url=purl,
(...)
129 for purl in parsed_urls
130 ]
--> 132 gen_ = (r for dl in downloaders for r in dl.download_generator())
134 # TODOs:
135 # - redo frontends similarly to how command_ls did it
136 # - have a single loop with analysis of `rec` to either any file
137 # has failed to download. If any was: exception should probably be
138 # raised. API discussion for Python side of API:
139 #
140 if format == "debug":
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/dandi/download.py:308, in Downloader.download_generator(self)
306 yield {"path": path, self.yield_generator_for_fields: gen}
307 else:
--> 308 for resp in gen:
309 yield {**resp, "path": path}
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/dandi/download.py:334, in _download_generator_guard(path, generator)
332 def _download_generator_guard(path: str, generator: Iterator[dict]) -> Iterator[dict]:
333 try:
--> 334 yield from generator
335 except Exception as exc:
336 lgr.exception("Caught while downloading %s:", path)
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/dandi/download.py:679, in _download_file(downloader, path, toplevel_path, lock, size, mtime, existing, digests, digest_callback)
673 if size is not None and downloaded == size:
674 # Exit early when downloaded == size, as making a Range
675 # request in such a case results in a 416 error from S3.
676 # Problems will result if `size` is None but we've already
677 # downloaded everything.
678 break
--> 679 for block in downloader(dldir.offset):
680 if digester:
681 assert downloaded_digest is not None
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/dandi/dandiapi.py:1482, in BaseRemoteAsset.get_download_file_iter.<locals>.downloader(start_at)
1479 # TODO: apparently we might need retries here as well etc
1480 # if result.status_code not in (200, 201):
1481 result.raise_for_status()
-> 1482 for chunk in result.iter_content(chunk_size=chunk_size):
1483 if chunk: # could be some "keep alive"?
1484 yield chunk
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/requests/models.py:816, in Response.iter_content.<locals>.generate()
814 if hasattr(self.raw, "stream"):
815 try:
--> 816 yield from self.raw.stream(chunk_size, decode_content=True)
817 except ProtocolError as e:
818 raise ChunkedEncodingError(e)
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/urllib3/response.py:628, in HTTPResponse.stream(self, amt, decode_content)
626 else:
627 while not is_fp_closed(self._fp):
--> 628 data = self.read(amt=amt, decode_content=decode_content)
630 if data:
631 yield data
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/urllib3/response.py:567, in HTTPResponse.read(self, amt, decode_content, cache_content)
564 fp_closed = getattr(self._fp, "closed", False)
566 with self._error_catcher():
--> 567 data = self._fp_read(amt) if not fp_closed else b""
568 if amt is None:
569 flush_decoder = True
File ~/anaconda3/envs/jb/lib/python3.11/site-packages/urllib3/response.py:533, in HTTPResponse._fp_read(self, amt)
530 return buffer.getvalue()
531 else:
532 # StringIO doesn't like amt=None
--> 533 return self._fp.read(amt) if amt is not None else self._fp.read()
File ~/anaconda3/envs/jb/lib/python3.11/http/client.py:465, in HTTPResponse.read(self, amt)
462 if self.length is not None and amt > self.length:
463 # clip the read to the "end of response"
464 amt = self.length
--> 465 s = self.fp.read(amt)
466 if not s and amt:
467 # Ideally, we would raise IncompleteRead if the content-length
468 # wasn't satisfied, but it might break compatibility.
469 self._close_conn()
File ~/anaconda3/envs/jb/lib/python3.11/socket.py:705, in SocketIO.readinto(self, b)
703 while True:
704 try:
--> 705 return self._sock.recv_into(b)
706 except timeout:
707 self._timeout_occurred = True
File ~/anaconda3/envs/jb/lib/python3.11/ssl.py:1278, in SSLSocket.recv_into(self, buffer, nbytes, flags)
1274 if flags != 0:
1275 raise ValueError(
1276 "non-zero flags not allowed in calls to recv_into() on %s" %
1277 self.__class__)
-> 1278 return self.read(nbytes, buffer)
1279 else:
1280 return super().recv_into(buffer, nbytes, flags)
File ~/anaconda3/envs/jb/lib/python3.11/ssl.py:1134, in SSLSocket.read(self, len, buffer)
1132 try:
1133 if buffer is not None:
-> 1134 return self._sslobj.read(len, buffer)
1135 else:
1136 return self._sslobj.read(len)
KeyboardInterrupt:
Once the cell above completes running, you will see a new folder 📁”00006” wherever you’re running this notebook. Usefully, the code above will also print a list of individual NWB files that have been downloaded in this folder.
Once the data is done downloading, you’re ready for the next step.
Option 2: Streaming the Dandiset#
The folks at NWB have also developed a clever way to stream Dandisets so that small bits of them can be viewed without downloading the entire dataset. This is particularly useful for very large datasets! This step is optional, and maybe a better option if you have limited hard drive space and/or are having issues with Option 1 above.
Streaming via the DANDI hub#
The easiest way to stream data is via the DANDI Jupyter Hub (https://hub.dandiarchive.org/). There are setup steps for this in Chapter 1. The code below should work without a hitch.
Streaming locally, after configuring your environment#
With some configuration, you can stream data on your local computer. First, you need to set up your environment with the right version of a package called h5py
. There are instructions here for how to do that. Once you’re done, you can restart the kernel for this notebook, and run the code below.
Code for Data Streaming#
First, we need to figure out the correct URL for the dataset on the Amazon S3 storage system. There is a tool to do so within the dandiapi, which we’ll use below to get the URL for one session from the data we downloaded above.
from dandi.dandiapi import DandiAPIClient
dandiset_id = '000006' # ephys dataset from the Svoboda Lab
filepath = 'sub-anm372795/sub-anm372795_ses-20170718.nwb' # 450 kB file
with DandiAPIClient() as client:
asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(filepath)
s3_path = asset.get_content_url(follow_redirects=1, strip_query=True)
print(s3_path)
https://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35
Now, we can read this path, but we’ll stream it, rather than downloading it! The cell below will print some of the data about this experiment. It uses another package, PyNWB, which is specifically designed to work with NWB files in Python. As you might expect, this won’t be the last time we see this package. Below, we’ll use the NWBHDF5IO
class from this package, which will allow us to read NWB files.
Note: The code below will not work unless you’re on the Dandihub or have properly configured your environment. See above.
from pynwb import NWBHDF5IO
with NWBHDF5IO(s3_path, mode='r', load_namespaces=True, driver='ros3') as io:
nwbfile = io.read()
print(nwbfile)
print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:])
In addition, we can use a fancy widget to create an interactive display of this dataset while we are streaming it. More on this later!
from nwbwidgets import nwb2widget
nwb2widget(nwbfile)
The following section will go over the the structure of an NWBFile and how to access data from this new file type.