Search only in this Namespaces below. For a global search use the field in the upper right corner. More tipps: how_to_use_the_wikisearch
In working with the DYAMOND output a combination of shell/python scripting and the CDOs was found to be effective. The general syntax is shown in the pseudo-code snippet below.
import glob from subprocess import call from concurrent.futures import ProcessPoolExecutor, as_completed, wait nprocs = 12 # specify number of processors to use (depends on node) pool = ProcessPoolExecutor(nprocs) tasks = [] # list for references to submitted jobs vname = 'ATHB_T' for file in glob.glob('*.nc'): arg = 'cdo -P 2 selname,%s %s %s'%(vname, file, file[:-3]+vname+'.nc') task = pool.submit(call, arg, shell=True) tasks.append(task) for task in as_completed(tasks): # helps see as tasks are completed print(task.result()) wait(tasks) # useful if further tasks depend on completion of all tasks above
The above logic is applied to a real, and more complex, processing task in the script below. This script was executed on the post-processing nodes of Mistral, and was used to extract, re-map, and re-compose files of single variables. A special step was required from some output variables whose means had been inadvertently accumulated and had to be de-accumulated to be useful.
In this example a function show_timestamp is defined and executed in parallel over a list of files. Here resources in the forms of cores and processors are explicitly allocated. To use this approach for more general tasks one would need to define the appropriate function that is being executed concurrently, i.e., redefine show_timestamp. In addition, the user specific paths and project accounts would have be set appropriately for the system you are working on.,
import os import glob import subprocess as sp # for distributed computing from distributed import Client from dask_jobqueue import SLURMCluster files = sorted(glob.glob("/work/bm0834/k203095/ICON_LEM_DE/20130502-default-readccn_v2/DATA/2*.nc")) def show_timestamp(filename): return sp.check_output(['cdo', '-s', 'showtimestamp', filename]) slurm_options = { 'project': 'bm0834', 'cores': 8, # no. of threads 'processes': 4, 'walltime': '01:00:00', 'queue': 'compute2,compute', 'memory': '64GB', 'python': '/work/bm0834/k202101/anaconda3/bin/python', 'interface': 'ib0', 'local_directory': '/home/dkrz/k202101/dask-worker-space', } cluster = SLURMCluster(**slurm_options) cluster.start_workers(4) client = Client(cluster) %%time res = client.gather(futures) futures = [client.submit(show_timestamp, f) for f in files] cluster.close() client.close()
A poor-person's parallelization can also be accomplished by looping over cdo commands in a shell script, and putting them in the background. This is best done when subsequent commands are not dependent on the completion of the jobs sent to the background and when one knows how many processors are available on the node, and matches this to the length of the loop. But in reality this stub has been introduced in the hope that someone picks up on it and suggests parallel cdo implementation with bash similar to the example for Python.
For many tasks that are not too computationally intensive and not performed over a grid – in which case preprocessing with CDOs may be more effective – parallel post processing can be efficiently performed using DASK and xarray.
For this Pavan Siligam of DKRZ has been developing use cases as part of the HD(CP)2 project. These are maintained here:
Guido Cioni provided an in-depth tutorial here which may also be applicable to tasks usually reserved for CDOs.