Supervised Classification with Hugging Face Transformers in Google Cloud Platform

Februrary 10th, 2020

Hugging Face 🤗

This is a tutorial on how to take existing transformers infrastructure and run it in Google Cloud Platform. I made a GitHub repo which contains all the code in this post. If you are interested in running Hugging Face transformers in Google Cloud Platform for fine-tuning supervised text classification models then keep reading.

Google Cloud Platform AI jobs

I have a laptop so although I can do local dev work to write Hugging Face code, I need to use cloud GPU resources when fine-tuning models on any reasonable amount of training data. I am most comfortable with GCP for cloud resources.

Google offers a product that was once called “ML Engine” and is now called “AI Platform Jobs”. Their documentation goes in depth on how to get a Google Cloud Project setup and enable the required resources to use this service.

This entire process would have been alot easier if AI jobs played as nicely with other ML libraries, like pytorch, as it does tensorflow. Tensorflow has a Google Cloud Storage specific “IO Wrapper” called .gfile that allows for more efficient access to files and data stored in GCS. As you’ll see later on, I had to implement a pretty messy solution to get data from GCS available for this pytorch based AI job. A potential future enhancement is to just use tensorflow for the data access component.

transformers-ai-platform

First step is to get setup with some training / testing data for the supervised task. This tutorial assumes you have some labeled training data in CSV format that has a column with the text and another with the text label.

Data formatting

The repo has a library for transforming your raw CSV to the TSV format standard set in the Hugging Face library. After install the bertutils call is simply:

format-csvs-bert --data_dir <DIRECTORY_WITH_TRAINING_DATA> \
    --text_col <COLUMN_NAME_WITH_TEXT> \
    --y_col <COLUMN_NAME_WITH_LABELS> \
    --split <PERCENTAGE_SPLIT_OR_NUM_TRAINING_ROWS>

The data will get dumped in 2 files train.tsv and dev.tsv in the input data_dir. The Hugging Face transformers do have the ability to classify sequences of text but this tutorial just focuse on a single sequence; a message, note, document, tweet, etc. To change this in the actual classification model, the text_b argument just needs to get tweaked in the _create_examples function. This would most likely mess with changes for dealing with text > 512 sequences, so I’ve left it alone for now. More on 512 sequence text in the repo’s README. Future work would be to experiment with a transformers XL model for classifying longer sequences.

Data access

Upload the newly formatted training and development data to Google Cloud Storage so that the job can access it. The code actually copies the contents of the bucket to the WORKDIR on the container by doing something like:

if args.data_dir[0:5] == "gs://":
    if not args.data_dir[-1] == "/":
        raise ValueError("If using a bucket, dir should end in a slash")
    subprocess.check_call(
        [
            'gsutil',
            '-m',
            'cp',
            '-r',
            args.data_dir + '*',
            '.'
        ]
    )

This implementation may have file size limitations but I haven’t run into them yet. One thing to note is that failures due to this subprocess call results in somewhat misleading errors which I highlight in the “Gotchas” below.

Docker image registration

We need to upload the transformers docker image to the Google Cloud Container registry in our Google Cloud Project. Theres a helper script in the repo for registering images. You can call it by doing the following:

./bin/register-image.sh <WHATEVER_YOU_WANT_TO_CALL_THE_IMAGE> <IMAGE_VERSION_TAG> .

The image name isn’t that important but the version is worth paying attention to. After you make updates to the codebase you will need to register the image as a new version. This version is then specified when submitting jobs.

The image itself is pretty heavy. It has a miniconda version of python, cuda, pytorch, nvidia apex for pytorch so that we can use 16-bit (mixed) precision and the google cloud sdk in order to use gsutil for downloading data. I’ve found that using the -m flag is the fastest way I’ve been able to download / upload files to storage without using rsync.

Job submission

Read the pricing docs!

Hyperparameter tuning

I will typically start with a hyperparameter tuning job to try to get an idea of what the optimal parameters are for the task. An example config can be found in the repo. These arguments are based on the Hugging Face run_glue.py example args.

Google has documentation on running HP tuning jobs and how to configure the various args. Its really a balance between the maxTrials configuration and the number of hyperparameter combinations you’d like to survey. Snippet of hyperparameter config:

trainingInput:
  scaleTier: CUSTOM # Used when specifying a masterType that isn't in Google's standard tier lsit
  masterType: standard_p100 # Only one server for non-distributed or TPU training
  hyperparameters:
    goal: MAXIMIZE # We want to maximize F1, could be switched to minimize for loss
    hyperparameterMetricTag: f1 # The score we want to maximize
    maxTrials: 50 # How many trials to do in total, stop when reaching this
    maxParallelTrials: 10 # You will have 10x p100 GPUs at once in this example
    enableTrialEarlyStopping: True # Stop early if the trial has a score < other trials
    params:
      - parameterName: num_train_epochs # Example parameter
        type: INTEGER
        minValue: 2
        maxValue: 3
        scaleType: UNIT_LINEAR_SCALE

The hyperparameter score is recorded in the model.py file. It uses Google’s hypertune library:

hpt.report_hyperparameter_tuning_metric(
    hyperparameter_metric_tag=SCORE_NAME,
    metric_value=SCORE,
)

The different scoring metrics available to tune by are the output of get_eval_report:

def get_eval_report(labels, preds):
    mcc = matthews_corrcoef(labels, preds)
    f1 = f1_score(y_true=labels, y_pred=preds)
    acc = simple_accuracy(preds, labels)
    tn, fp, fn, tp = confusion_matrix(labels, preds).ravel()
    prec = precision_score(labels, preds)
    recall = recall_score(labels, preds)
    return (
        {
            "mcc": mcc,
            "tp": tp,
            "tn": tn,
            "fp": fp,
            "fn": fn,
            "f1": f1,
            "acc": acc,
            "acc_and_f1": (acc + f1) / 2,
            "prec": prec,
            "recall": recall,
        }
    )

Use the hyperparameter tuning job helper script for actually submitting. Your command will look something like:

./bin/submit-aiplatform-hptune.sh <IMAGE_NAME> \
    <IMAGE_TAG_VERSION> \
    configs/hptune.yaml \
    <GCS_BUCKET_WITH_DATA> \
    <GCS_OUTPUT_BUCKET> \
    bert \
    bert-large-uncased \
    <TASK_NAME>

The third argument is important because it specifies the machine config that the AI platform job will use. This is where you have the option to specify the number and what kind of GPUs to use. The official GPU documentation contains all the necessary references to know which kind of GPU and how many of each is allowed.

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_p100

I’ve found that 1,2 K80s is more than enough for a data set with 10-50k training rows. Its worth noting that when using a hyperparameter tuning job, whatever configuration you used is scaled by the maximum number of simultaneous trials.

The AI platform product makes it really easy to monitor different trials, their performance and their logs. Just visit the AI Platform section in GCP and click your job name.

AI Jobs trial monitoring
AI Jobs trial monitoring

Once the hyperparameter tuning job is complete we’ll have an idea of what parameters to use. Each trial will copy over the saved model, so the optimal model should be in GCS somewhere. That being said, I haven’t implemented an easy eay to figure out which saved model in storage corresponds to which trial so I’ll just run a one-off job to completely train the model once again with the optimal parameters, potentially on a larger training set depending on the learning curve.

One-off

The transformers-ai-platform README has more instructions on how to run the actual job. It explains the various arguments and which are required vs. optional. The README also discusses various methods for longer documents to deal with the 512 sequence limit of currently BERT implementations. To run a job locally your command, after installation, will look something like:

python3 task.py \
    --data_dir DIRECTORY_WITH_TSV_TRAINING_FILES \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --task_name local \
    --output_dir ./local \
    --max_seq_length 512 \
    --do_train

When submitting a job to GCP your command will look something like:

./bin/submitaiplatform.sh \
    <IMAGE_NAME> \
    <IMAGE_VERSION> \
    machines/complex_model_m_gpu.yaml \
    <BUCKET_PATH_WITH_DATA> \
    <BUCKET_PATH_TO_WRITE_OUTPUT> \
    bert \
    bert-large-uncased \
    <TASK_NAME>

Monitoring

Monitoring the job is pretty simple with the Google Cloud Platform AI jobs dashboard.

Hypertuning GPU resource monitoring, one trial per tuning parameter combination
Hypertuning GPU resource monitoring, one trial per tuning parameter combination

This shows GPU usage for a hyperparameter tuning job where each color is a separate trial. I try to get the GPU usage almost to 100%. You are billed by hourly usage depending on GPU so its worth while to try to max out usage. After running the job you’ll get a tensorboard log file copied over to Google Cloud Storage title something like: YOUR_JOB_NAME.out.tfevents.SOME_TIME.cmle-training-SOME_TIME.

Tensorboard output, logging steps on the x vs. evaluation f1 score y
Tensorboard output, logging steps on the x vs. evaluation f1 score y

This kind of display gives you the sense for whether adding more training data would improve performance or if you have reached an upper-bound. The increments on the x-axis can be adjusted by changing the logging_steps argument when submitting a job. To start tensorboard locally the command looks something like:

tensorboard --logdir=.

I will usually download the tensorboard logs to my downloads folder and run that command from there. The logdir arg specifies where the tensorboard logs are stored. Note: this can also be a path in GCS.

Gotchas

RuntimeError: CUDA out of memory.

The training and evaluation code currently is not distributed across multiple GPUs. Although cuda allows for specifying a per gpu batch size and X number of GPUs, it would be much more efficient to utilize a distributed version of training where resources can be allocated more efficiently. Another option to solve this would be to implement a TPU version of the same task - which may only require changes to the deployment portion rather than the actual task’s python codebase.

CommandException: 1 file/object could not be transferred.

subprocess.CalledProcessError: Command ‘[’gsutil’, ’-m’, ’cp’, ’-r’, SOME_BUCKET_PATH ]’

The subprocess call to copy data to/from Google Cloud Storage failed and probably because it couldn’t find the storage location. Triple check the bucket path specified.

Conclusion

If you try this out and experience an issue that you fix feel free to open a merge request. If you encounter a problem and want me to fix it then open an issue in the repo. Lastly some items on potential future work:

Resources