import torchvision
= torchvision.datasets.EMNIST(
train_data ="./",
root="digits",
split=True,
train=True
download )
Problem: Automatic EMNIST download failed
Earlier today, I wanted to reproduce the results of a machine learning paper that uses the EMNIST digits data set to train a PyTorch
model. Normally, PyTorch makes loading and even downloading data sets extremely easy for us. The torchvision.datasets
module provides a handful of commonly used data sets with a user-friendly API. Most importantly for us right now, the data set loaders come with the convenient download=True
argument to download a data set automatically:
Unfortunately, that throws a RuntimeError
:
RuntimeError: File not found or corrupted.
Next, I wanted to just download the data from a URL via torchvision.datasets.util.download_url(...)
. I found a handful of EMNIST URLs on the internet, but either got the same old File not found or corrupted
or an SSL error.
Fix: Manual download and directory adjustments
Here’s a brief list of steps for downloading the EMNIST data manually and then preparing the directory for torchvision.datasets.EMNIST(..., download=False)
.
Step 1: Download the files
Go to the official EMNIST website (Link) and head to Binary format as the original MNIST dataset. Alternatively, here’s the link: EMNIST Direct Download Link
That archive with the great name gzip.zip
has a size of approximately 500MB.
Step 2: Unpack the gzip.zip
archive
Head to your project’s data directory (or global data directory if you have that) and unpack the previously downloaded gzip.zip
archive there. You will get a folder gzip/
that contains a whole lot of *.gz
files:
.
└── gzip
├── emnist-balanced-mapping.txt
├── emnist-balanced-test-images-idx3-ubyte.gz
├── emnist-balanced-test-labels-idx1-ubyte.gz
├── emnist-balanced-train-images-idx3-ubyte.gz
├── emnist-balanced-train-labels-idx1-ubyte.gz
├── ...
├── emnist-digits-mapping.txt
├── emnist-digits-test-images-idx3-ubyte.gz
├── emnist-digits-test-labels-idx1-ubyte.gz
├── emnist-digits-train-images-idx3-ubyte.gz
├── emnist-digits-train-labels-idx1-ubyte.gz
├── ...
├── emnist-mnist-mapping.txt
├── emnist-mnist-test-images-idx3-ubyte.gz
├── emnist-mnist-test-labels-idx1-ubyte.gz
├── emnist-mnist-train-images-idx3-ubyte.gz
└── emnist-mnist-train-labels-idx1-ubyte.gz
You’ll notice a structure: There are different splits, encoded in the filenames as emnist-<split>-...
. This <split>
corresponds to the split=...
argument in torchvision.datasets.EMNIST
. For this project, I only needed the digits
split, so I deleted the files of all the other splits.
Step 3: Unpack the individual .gz
files
Unpack all the *.gz
files that you need. On MacOS, the built-in archive tools can handle .gz
files, YMMV. Delete the *.gz
files after you’re done unpacking. You should have the following structure now:
.
└── gzip
├── emnist-balanced-mapping.txt
├── emnist-balanced-test-images-idx3-ubyte
├── emnist-balanced-test-labels-idx1-ubyte
├── emnist-balanced-train-images-idx3-ubyte
├── emnist-balanced-train-labels-idx1-ubyte
├── ...
├── emnist-digits-mapping.txt
├── emnist-digits-test-images-idx3-ubyte
├── emnist-digits-test-labels-idx1-ubyte
├── emnist-digits-train-images-idx3-ubyte
├── emnist-digits-train-labels-idx1-ubyte
├── ...
├── emnist-mnist-mapping.txt
├── emnist-mnist-test-images-idx3-ubyte
├── emnist-mnist-test-labels-idx1-ubyte
├── emnist-mnist-train-images-idx3-ubyte
└── emnist-mnist-train-labels-idx1-ubyte
Step 4: Adjust the directory structure for PyTorch
If we try to load the data set into PyTorch with the download=False
argument now,
= torchvision.datasets.EMNIST(
train_data ="./",
root="digits",
split=True,
train=False
download )
we get the following error:
RuntimeError: Dataset not found. You can use download=True to download it
Well, we kind of did all the downloading so that we circumvent the problematic download=True
call.
As you might expect, we have to make PyTorch
find our downloaded EMNIST data. That’s a two-step process: (1) We will make the EMNIST data fit the format that PyTorch
expects; and (2) we will point PyTorch
to where our EMNIST data lives.
(1) Required directory tree
PyTorch
wants the following structure:
DATASET_NAME
└── raw
├── ...-mapping.txt
├── ...-ubyte
To achieve this, we simply rename gzip
to raw
and wrap the entire raw
folder into a parent folder called EMNIST
. Now your file tree should look like this:
EMNIST
└── raw
├── emnist-balanced-mapping.txt
├── emnist-balanced-test-images-idx3-ubyte
├── emnist-balanced-test-labels-idx1-ubyte
├── emnist-balanced-train-images-idx3-ubyte
├── emnist-balanced-train-labels-idx1-ubyte
├── ...
├── emnist-digits-mapping.txt
├── emnist-digits-test-images-idx3-ubyte
├── emnist-digits-test-labels-idx1-ubyte
├── emnist-digits-train-images-idx3-ubyte
├── emnist-digits-train-labels-idx1-ubyte
├── ...
├── emnist-mnist-mapping.txt
├── emnist-mnist-test-images-idx3-ubyte
├── emnist-mnist-test-labels-idx1-ubyte
├── emnist-mnist-train-images-idx3-ubyte
└── emnist-mnist-train-labels-idx1-ubyte
(2) Point PyTorch
to the correct path.
Finally, the call to the PyTorch
data loader will work as intended because the EMNIST
folder is directly below my current working directory ./
:
= "./"
data_root
= torchvision.datasets.EMNIST(
train_data =data_root,
root="digits",
split=True,
train=False
download )
If your EMNIST/
folder lives somewhere else (e.g., in a dedicated data/
folder), simply adjust data_root
.
Step 5: Profit!
Now off you go and make some fancy machine learning stuff with EMNIST! ✨
–Marvin