Machine Learning¶

Once storm data are extracted to CSV files, Hagelslag can train and execute sets of machine learning files for predicting the probability of hail and the hail size distribution. Machine learning modeling is performed through the hsforecast program. Like hsdata, hsforecast utilizes a Python config file with many arguments in order to set up the models and data sources.

Config Options¶

The config object for hsforecast should contain the following keys:

ensemble_name:: Name of the Ensemble forecast system. Supports “SSEF”, “NCAR”, and others.
ensemble_members:: List of ensemble member names.
num_procs:: Integer number of processors
start_dates:: Dictionary containing datetime objects associated with the start date for “train” and “forecast” modes.
end_dates:: Dictionary containing datetime objects associated with the end date (inclusive) for “train” and “forecast” modes.
start_hour:: First forecast hour extracted for training/evaluation.
end_hour:: Last forecast hour (inclusive).
train_data_path:: Path to directory of csv training data files.
forecast_data_path:: Path to directory of csv forecast data files.
member_files:: Dictionary of paths to csv files containing configuration information about each ensemble member. This information is used to group ensemble members into similar subsets for training.
group_col:: Column in the member csv file used to group ensemble members.
data_format:: Currently only “csv” is supported. Additional file formats supported by pandas could be added if there was interest.
condition_model_names:: List of long names for each machine learning model that predicts the probability of hail occurring.
condition_model_objs:: List of scikit-learn model objects for each probability of hail machine learning model.
condition_input_columns:: List of input variables used for probability of hail machine learning models.
condition_output_column:: Column in data files used as a binary label of whether hail is occurring or not. Should contain 1s and 0s.
condition_threshold:: Threshold on the “condition_output_column” data used to split storms into hail and no-hail events.
size_distribution_model_names:: List of long names for each machine learning model that predicts the hail size distribution parameters.
size_distribution_model_objs:: List of scikit-learn model objects for the size distribution hail models.
size_distribution_input_columns:: List of variable names used as input to the size distribution models.
size_distribution_output_columns:: List of output columns used to fit the size distribution model.
size_distribution_loc:: Specified value for location parameter of gamma distribution.
load_models:: Whether to load machine learning models from disk or use existing model output.
model_path:: Path to directory containing machine learning model pickle files
metadata_columns:: List of columns to be included in prediction output files
data_json_path:: Path to track data json files
forecast_json_path:: Path where track forecast files are output
forecast_csv_path:: Path where forecast csv files are output
netcdf_path:: Path where track data netCDF files are stored
ensemble_variables:: Forecast variables from ensemble system used to generate storm surrogate probabilities.
ensemble_variable_thresholds:: Dictionary where keys are ensemble variables and values are lists of thresholds for the hail forecasts.
ml_grid_method:: Currently only supports “gamma”. Other methods could be added in the future.
neighbor_condition_model:: Specifies which hail condition model is used to generate neighborhood probabilities
neighbor_radius:: List of radii in grid points over which events are aggregated
neighbor_sigma:: List of Gaussian filter standard deviations that are applied to neighborhood probability fields
ensemble_consensus_path:: Path to directory where ensemble consensus netCDF files are stored.
ensemble_data_path:: Path to top level directory of ensemble model output
model_map_file:: Path to map projection file for the ensemble, which should be in hagelslag/mapfiles.
ml_grid_percentiles:: List of percentiles from 1 to 99 or “mean” that are extracted from the sampled machine learning hail sizes.
grib_path:: Path to where machine learning grib2 files are output.
single_step:: Whether raw model output is stored in a single file per hour (True) or all hours are in one file (False).

Running hsforecast¶

hsforecast features four operational modes as detailed below: -t, –train Trains all machine learning models and saves the models to pickle files -f, –fore Generates forecasts from the machine learning models -e, –ens Generates ensemble neighborhood probabilities from machine learning and raw ensemble output -g, –grid Generates gridded machine learning forecasts and writes the grids to GRIB2 files.

When running the model in training mode, none of the other modes should be activated. It is recommended to train all of the machine learning models offline and not in real-time operations. Please have all paths specified in the config file created.

The fore, ens, and grid options can be run simultaneously to produce the ML forecasts and the resulting other products. If you are running only the machine learning forecasts, then –fore and –grid are the only options needed.

Currently, machine learning forecasts are output to a CSV file. Older versions of hagelslag output the forecasts to geoJSON files, but the process was very time consuming.

hseval also has the ability to generate coarse neighborhood probabilities for both machine learning and raw ensemble variables.