One Humanities

from fastai.vision.all import *

Processing data

Cleaning and processing data is one of the most time consuming tasks within machine learning.

Preparing raw data for a model is really just a sequence of transformations. For instance, in a classic image classification problem, we start with the filenames. We have to open corresponding images, resize them, convert them to tensors, maybe apply some kind of data augmentation, all before we’re ready to batch them. And that’s just for the inputs.

Transform

First, we need to see the basic steps using one MNIST image.

source = untar_data(URLs.MNIST_TINY)/'train'
items = get_image_files(source)
fn = items[0]; fn

Let’s look at each Transform, let’s first open th eimage file:

img = PILImage.create(fn); img

Now we can convert it to a C*H*W tensor (channel x height x width, the standard in PyTorch):

tconv = ToTensor()
img = tconv(img)
img.shape,type(img)

Now, we can create our labels, extracting the text label:

lbl = parent_label(fn); lbl

Now converting to an integer for modeling:

tcat = Categorize(vocab=['3','7'])
lbl = tcat(lbl); lbl

We use decode to reverse transforms for display. Reversing Categorize results in a class name we display:

lbld = tcat.decode(lbl)
lbld

Pipeline

We can compose our image steps using Pipeline

pipe = Pipeline([PILImage.create,tconv])
img = pipe(fn)
img.shape

A Pipeline can decode and show an item:

pipe.show(img, figsize=(1,1), cmap='Greys');

The show method works behind the scenes with types. Transforms make sure the type of an element is preserved.

type(img)