Generating data catalogs

There are a few ways to use the catalog builder.

Installation

Recommended approach: Install as a conda package

conda install catalogbuilder -c noaa-gfdl

Alternatively, you may clone the git repository and create your conda environment using the environment.yml in the git repository.

git clone https://github.com/NOAA-GFDL/CatalogBuilder

conda env create -f environment_intake.yml

Expected output

A JSON catalog specification file and a CSV catalog in the specfied output directory with the specified name.

Using conda package

1. Install the package using conda:

conda install catalogbuilder -c noaa-gfdl

If you’re trying these steps from GFDL, likely that you may need to do additional things to get it to work. See below

Add these to your ~/.condarc file

whitelist_channels:
  • noaa-gfdl

  • conda-forge

  • anaconda

channels:
  • noaa-gfdl

  • conda-forge

  • anaconda

(and try: conda config –add channels noaa-gfdl conda config –append channels conda-forge)

If you encounter issues “ChecksumMismatchError: Conda detected a mismatch between the expected..” , do the following:

conda config –add pkgs_dirs /local2/home/conda/pkgs conda config –add envs_dirs /local2/home/conda/envs

2. Add conda environment’s site packages to PATH

See example below.

setenv PATH ${PATH}:${CONDA_PREFIX}/lib/python3.1/site-packages/scripts/

3. Call the builder

Catalogs are generated by the following command: gen_intake_gfdl.py <INPUT_PATH> <OUTPUT_PATH>

Output path argumment should end with the desired output filename WITHOUT a file ending. See example below.

gen_intake_gfdl.py /archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp $HOME/catalog

This would create a catalog.csv and catalog.json in the user’s home directory.

Catalog generation demonstration

See `Flags`_ here.

Using a configuration file

We recommend the use of a configuration file to provide input to the catalog builder. This is necessary and useful if you want to work with datasets and directories that are not quite GFDL post-processed directory oriented.

Here is an example configuration file.

Catalog headers (column names) are set with the HEADER LIST variable. The OUTPUT PATH TEMPLATE variable controls the expected directory structure of input data.

#Catalog Headers
headerlist: ["activity_id", "institution_id", "source_id", "experiment_id",
                 "frequency", "realm", "table_id",
                 "member_id", "grid_label", "variable_id",
                 "time_range", "chunk_freq","platform","dimensions","cell_methods","standard_name","path"]

The headerlist contains the expected column names of your catalog/csv file. This is usually determined by the users in conjuction with the ESM collection specification standards and the appropriate workflows.

#Directory structure information
output_path_template = ['NA','NA','source_id','NA','experiment_id','platform','custom_pp','realm','cell_methods','frequency','chunk_freq']

For a directory structure like /archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp the output_path_template is set as above. We have NA in those values that do not match up with any of the expected headerlist (CSV columns), otherwise we simply specify the associated header name in the appropriate place. E.g. The third directory in the PP path example above is the model (source_id), so the third list value in output_path_template is set to ‘source_id’. We make sure this is a valid value in headerlist as well. The fourth directory is am5f3b1r0 which does not map to an existing header value. So we simply NA in output_path_template for the fourth value. We have NA in values that do not match up with any of the expected headerlist (CSV columns), otherwise we simply specify the associated header name in the appropriate place. E.g. The third directory in the PP path example above is the model (source_id), so the third list value in output_path_template is set to ‘source_id’. We make sure this is a valid value in headerlist as well. #The fourth directory is am5f3b1r0 which does not map to an existing header value. So we simply set NA in output_path_template for the fourth value.

#Filename information
 output_file_template = ['realm','temporal_subset','variable_id']
#Input directory and output info
 input_path:  "/archive/am5/am5/am5f7b10r0/c96L65_am5f7b10r0_amip/gfdl.ncrc5-deploy-prod-openmp/pp/"
 output_path: "/home/a1r/github/noaa-gfdl/catalogs/c96L65_am5f7b10r0_amip" # ENTER NAME OF THE CSV AND JSON, THE SUFFIX ALONE. This can  be an absolute or a relative path

From a Python script

Do you have a python script or a notebook where you could also include steps to generate a data catalog? See example here

Here is another example with a custom configuration:

import sys, os
git_package_dir = '/home/a1r/git/forkCatalogBuilder-/'
sys.path.append(git_package_dir)

import catalogbuilder
from catalogbuilder.scripts import gen_intake_gfdl
######USER input begins########

#User provides the input directory for which a data catalog needs to be generated.

input_path = "/archive/John.Krasting/fre/FMS2024.02_OM5_20240724/CM4.5v01_om5b06_piC_noBLING/gfdl.ncrc5-intel23-prod-openmp/pp/"
#/archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp/"

#USER inputs the output path. Based on the following setting, user can expect to see /home/a1r/mycatalog.csv and /home/a1r/mycatalog.json generated as output.

output_path = "/home/a1r/tests/mycatalog-jpk-def"
#NOTE: If your input_path does not look like the above in general, you will need to pass a --config which is custom

#This is an example call to run catalog builder using a yaml config file.
configyaml = os.path.join(git_package_dir, 'catalogbuilder/scripts/configs/config-example2.yml')
#input_path = "/archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp"
#output_path = "sample-mdtf-catalog"

def create_catalog_from_config(input_path=input_path,output_path=output_path,configyaml=configyaml):
 csv, json = gen_intake_gfdl.create_catalog(input_path=input_path,output_path=output_path,config=configyaml)
    return(csv,json)

if __name__ == '__main__':
    create_catalog_from_config(input_path,output_path) #,configyaml)

And an example with a default configuration:

import sys, os
git_package_dir = '/home/a1r/git/forkCatalogBuilder-/'
sys.path.append(git_package_dir)

import catalogbuilder
from catalogbuilder.scripts import gen_intake_gfdl
print(gen_intake_gfdl.__file__)

######USER input begins########

#User provides the input directory for which a data catalog needs to be generated.

input_path = "/archive/a1r/fre/FMS2024.02_OM5_20240724/CM4.5v01_om5b06_piC_noBLING/gfdl.ncrc5-intel23-prod-openmp/pp/"
#/archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp/"

#USER inputs the output path. Based on the following setting, user can expect to see /home/a1r/mycatalog.csv and /home/a1r/mycatalog.json generated as output.

output_path = "/home/a1r/tests/static-catalog"
#NOTE: If your input_path does not look like the above in general, you will need to pass a --config which is custom
 ####END OF user input ##########

#This is an example call to run catalog builder using a yaml config file.

configyaml = os.path.join(git_package_dir, 'configs/config-template.yaml')
#input_path = "/archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp"
#output_path = "sample-mdtf-catalog"

def create_catalog_from_config(input_path=input_path,output_path=output_path): #,configyaml=configyaml):
    csv, json = gen_intake_gfdl.create_catalog(input_path=input_path,output_path=output_path)#,verbose=True,config=configyaml)
     return(csv,json)

if __name__ == '__main__':
    csv,json = create_catalog_from_config(input_path,output_path)#,configyaml)

From Jupyter Notebook

Refer to this notebook to see how you can generate catalogs from a Jupyter Notebook

Screenshot of a notebook showing catalog generation

Using FRE-CLI (GFDL only)

1. Activate conda environment

conda activate /nbhome/fms/conda/envs/fre-cli

2. Call the builder

Catalogs are generated by the following command: fre catalog buildcatalog <INPUT_PATH> <OUTPUT_PATH>

(OUTPUT_PATH should end with the desired output filename WITHOUT a file ending) See example below.

fre catalog buildcatalog --overwrite /archive/path_to_data_dir ~/output

See `Flags`_ here.

See Fre-CLI Documentation here

Arguments/Options

Input/Output paths can be passed directly to catalog builder tool through calling command

All methods of catalog builder generation support direct input/output path passing.

Input path must be the 1st argument. Output path must be the 2nd.

Ex. gen_intake_gfdl.py /archive/Some.User/input-path ./output_path

  • –config - Allows for catalogs to be generated with a custom configuration. Requires path to YAML configuration file. (Ex. “–config custom_config.yaml”)

  • –overwrite - Overwrite an existing catalog at the given output path

  • –append - Append (without headerlist) to an existing catalog at the given output path

  • –slow - Activates slow mode which retrieves standard_name (or long_name) where possible. “Standard_name” must be in your output_path_template

  • –i - Optional method for passing input path

  • –o - Optional method for passing output path