Table of Contents

intake: take the pain out of data access on mistral

If so, then consider intake-xarray and intake-esm.

The idea behind intake

Defining and loading data-sets costs time and effort. The data scientist needs to know what data are available, and the characteristics of each data-set, before going to the effort of loading and beginning to analyze some specific data-set. Furthermore, they might need to learn the API of some Python package specific to the target format. The code to do such data loading often makes up the first block of every notebook or script, propagated by copy&paste.

Intake has been designed as a simple layer over other Python libraries to:

Source and further reading: https://www.anaconda.com/intake-taking-the-pain-out-of-data-access/

intake-xarray

intake-xarray combines intake with xarray. You can easily access data from various locations and filenames you/someone predefined in a YAML file.

Example: How to load different observations from ICDC hassle-free into xarray?

import intake
cat = intake.open_catalog("/home/mpim/m300524/pymistral/intake/obs.yml")
ds = cat.['HadCRUT3'].to_dask()

Clone https://gitlab.dkrz.de/m300524/pymistral and install the conda environment pymistral to try out the notebooks yourself.

intake-esm

intake-esm combines intake-xarray with pandas to make Earth-System-Model output easily accessible. A builder creates a collection, which is pandas.Dataframe from a catalog, which is a json file. Luckily, a few collections are available for mistral. These collectionss can be searched with queries and directly load ESM output via dask into xarray. Developed at NCAR.

Example: How to load CMIP6 hassle-free into xarray?

Also possible with other common experiment comparisons: Choose from CMIP5, CMIP6, MiKlip or MPI GE, see /work/ik1017/Catalogs.

import intake
col_url = "/work/ik1017/Catalogs/mistral-cmip6.json"
col = intake.open_esm_datastore(col_url)
query = dict(experiment_id='esm-piControl', table_id='Omon', 
             variable_id='fgco2', grid_label=['gn', 'gr'])
cat = col.search(**query)
dset_dict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time': 12*50}})
ds = dset_dict['CMIP.CCCma.CanESM5.esm-piControl.Omon.gn']

What next?

Do you like the capabilities of intake? Consider writing your own yaml files and share them with your peers.

A collection of ideas how to use intake-esm for your own experiments: