Paperspace How-To

8 minute read

  • Paperspace is a low-cost alternative for running ML algorithms on GPUs in the cloud
  • Being a newer cloud computing platform, some of the documentation hasn’t caught up with new functionality
  • This post helps you to use Paperspace effectively

Introduction

I have a very fast computer, but it’s unfortunately without an NVIDIA GPU, making it CPU-fast but not fast for purposes of Machine Learning algorithms.

I thought about building my own computer, which is possible for about $1000 and up. With even a medium compute utilization (percentage of time spent computing per day), this can be cost effective [See Choosing Your Deep Learning Infrastructure: The Cloud vs. On-Prem Debate ] However, because I didn’t want to have to manage more hardware, put down the up-front costs, and am not sure how fast a machine I will ultimately want to build, I decided for now to start running my ML experiments in the cloud.

In investigating the available cloud ML compute providers, as of today there seems to be a rough consensus that people use the following:

  • Traditional Cloud Providers
    • Google Cloud Platform
    • Amazon Web Services
  • ML-Specific Cloud Providers
    • Paperspace
    • FloydHub

The traditional Cloud Providers are generally higher cost, and FloydHub seems to have a lot of collaborative and IDE features I didn’t want. So, since what I really wanted was just raw compute power that I could submit code to, I chose to pursue running my experiments on Paperspace.

This post is a description of how to get work done on Paperspace if you want to use it a compute farm for your ML python code.

Using Paperspace

The ML-specific portion of the Paperspace cloud offerings is called Gradient. For the rest of this post I’ll be describing how to make it work.

Choose a Machine Type

The Gradient machine offerings are split between East Coast and West Coast. The two sets of machines use the same path for persistent storage, /storage, but two different actual storage spaces. Files put in East Coast /storage will not show up in West Coast /storage and vice-versa.

This Paperspace doc on Persistent Storage describes it in more detail. Note that specific machine types change frequently and descriptions in the docs about which the machine types belong to a particular region frequently miss newer machine types.

Choose a Docker image (or create one)

One of the key things you’ll need for a compute job is a Docker image with most or all of the resources needed for your job. You can install things as part of your job script, but it’s much more efficient to have a prebuilt Docker image containing all of your requirements. (It’s also faster–you’re paying for time!)

Docker’s central repository of images is the Docker Hub, which Paperspace command syntax makes easy to use. To specify a Docker image from Docker Hub, you may omit the name of the docker hub URL and just use a label of the form \<namespace\>/\<image_name\>:\<tags\>.

All of the Gradient machines are based on Ubuntu 16.04. So any Docker image you use should be based off of Ubuntu 16.04. Not so coincidentally, all Tensorflow Docker containers are currently based on Ubuntu 16.04, so if you’re using Tensorflow, starting with a Tensorflow Docker image is an easy choice.

Docker’s good documentation on creating your own docker image is very useful. Paperspace also has a good tutorial about creating images: Creating & Using Custom Containers with Notebooks. The basic idea is to create a Dockerfile that starts with a FROM keyword describing a Docker image that is the closest to what you want, and follow it in the Dockerfile with everything additional you wish to install in your image. Once you have an image that contains everything your code needs, upload it to the Docker Hub, and you’ll be able to use it for any Paperspace job.

Use the Paperspace CLI for access

There are two flavors of command-line (CLI) utilities provided by Paperspace to control and monitor jobs:

paperspace-python sounds as though it should be conveniently specific to python scripts. But in reality it is a less mature tool (as of March 26, 2019) and really you should use the Paperspace CLI. It has a much more complete and robust operation than paperspace-python. The paperspace-node implementation seems to be the reference for Paperspace CLI access.

Installing the Paperspace CLI

You can install the Paperspace CLI in a few different ways. The easiest ways are:

Prepare your login to access Paperspace through the CLI

See Getting Started in Paperspace’s docs to prepare your login. This involves fetching your API key, which will be stored in your environment to allow you to access Paperspace through your CLI in the future. You’ll need to login through the CLI to your Paperspace account to retrieve your API key only once, or until you specifically “logout”. Logging in through the CLI stores your login API KEY in ~/.paperspace/config.json and it stays there until you logout through the CLI.

paperspace-node implements what paperspace simply calls the “Paperspace API” and the commands for the paperspace-node CLI are synonymous with the Paperspace API. Here is the full documentation for the Paperspace API.

Get your data onto the cloud persistent storage

Before you run your job, you’ll need to put any needed data into Paperspace’s persistent storage in the same Storage region as your Machine type (Either East Coast or West Coast.) The easiest way to upload arbitrary files is to use a Jupyter notebook, which is easy to spin up using Paperspace’s web console. Go to “Notebooks” under “Gradient” on the sidebar of the Paperspace web console, click “Create Notebook”, pick basically any public container, select the cheapest machine type in the proper storage region. You can get a current listing of machine types and which storage region they belong to on your local computer using the CLI, with:

paperspace jobs machineTypes

Once you start your Jupyter notebook and open it, simply click on the “Upload” button, which will open a File Open dialog where you can specify a local file to upload.

Jupyter Upload button

Another handy tip: If you need shell access to do any housekeeping like unzipping zip files, etc., Jupyter Notebook also has a Terminal that you can open and use in your web browser. Open from the “New” pull-down menu.

Jupyter New pull-down

Note: you may need to install any utilities you need to use in the Terminal with apt-get install.

Jupyter Terminal

So simple!

Create a script to run your code

To run a python script with the Paperspace CLI, first create a bash script that will run on the remote machine. The last line should invoke python and your script, e.g.

python3 myscript.pl my-arg --my-switch1

Before this line you can put any shell commands you need to execute to prepare your environment for the python script. (e.g. apt-get install ... or pip install ...)

I usually call my script cmd_paperspace.sh.

Assemble your CLI command to launch your compute job

Now it’s time to compose a Paperspace CLI command to launch your job on the Paperspace cloud. I like to create a shell script to do this. That way I can make sure that all of the options are always set properly.

You can call your script something like do_paperspace, and write a script like the following:

#!/bin/bash
  
# send job to Paperspace, using Paperspace CLI
paperspace jobs create \
    --container  tensorflow/tensorflow:latest-gpu-py3 \
    --machineType K80 \
    --command ./cmd_paperspace.sh \
    --project my_experiments \
    --ignoreFiles "venv"

The above command

  • Uses the Docker image tensorflow/tensorflow:latest-gpu-py3 from the Docker Hub
  • Sends the job to a “K80” machine
  • Runs the command cmd_paperspace.sh on the remote machine. cmd_paperspace.sh is the bash script we wrote in the previous section.
  • Puts this job under project “my_experiments”. Experiments can be set and viewed on the Paperspace web console
  • Ignores the directory venv, where I have my virtual environment for my python script setup locally.

paperspace jobs create will also by default zip up your current local directory, send the zip file to the remote host, and use the unzipped directory as the run directory of your remote job. (This is also why it will be able to find and run cmd_paperspace.sh.) Thus, if you have very large files in your current directory or a subdirectory, make sure to put those files/directories in the --ignoreFiles list, separated by commas, so they are not sent. This method of zipping up the workspace will fail if the size of the zip file gets too large.

When your do_paperspace script is written, run it on your local machine. It should start your job running on a Paperspace machine if all goes well.

Supervise / Monitor your job

Monitor you Gradient Compute jobs and resources from the Paperspace web console or via the CLI. For a list of all jobs, current and past, view the jobs section of the web console.

Advanced notes: Accessing your job via the internet

If your job has network ports that you might want to access while it is running, you can access it from the internet.

The public DNS address for your Paperspace job is <jobId>.gradient.paperspace.com .

When submitting your job, add a --port option to your paperspace jobs create command to expose any ports to the outside world. The format is --port <external port>:<internal port> where <external port> is the port number as viewed from the internet, and <internal port> is the port number as seen by the job. If something of your code has a port to access, call that port the <internal port>, and select an <external port> greater than 1023 to access it from the outside world. See Public IPs and Port Forwarding

Updated: