This is a series of posts exploring some of the new features in tensorflow 2.0, which I am currently using in my own projects. These posts are introductory guides and do not cover more advanced uses.
Tensorflow 2.0 introduced the concept of a
Dataset. This high level API allows you to load different data formats such as images, numpy arrays and panda dataframes.
Previously, in Keras, when we want to load a training dataset that is too big to fit into memory, we create a custom generator that iterates over the dataset in batches which are fed into the model during training using method calls such as
The issue with the above approach is that it can be error-prone to setup. For instance, changes to the dataset structure means changes to the generator or there could be issues in the generator code implementation.
Dataset is a high-level construct in TF 2.0 which represent a collection of data or documents. It supports batching, caching and pre-fetching of data in the background. The dataset is not loaded into memory but streamed into the model when its iterated through.
Dataset generally follows the guidelines:
Create a dataset from input data
Apply transformations to preprocess the data
Iterate over dataset and process its elements i.e. training loop
Let’s go through each of the above stages in the pipeline.
Creating a dataset
The easiest method to create a dataset is to use the
1 dataset = tf.data.Dataset.from_tensor_slices([1,2,3]) 2 for ele in dataset: 3 print(ele) # returns tf.Tensor
If we try to print each element of a dataset, we get a
Tensor object back. In order to inspect the contents, we can call the
as_numpy_iterator method to convert the tensors into numpy arrays, which returns an iterable:
1 for num in dataset.as_numpy_iterator(): 2 print(num)
To create dataset from a directory list of files, we can use the
list_files method which accepts a file/glob matching pattern. For example, if we had a directory of
"/mydir/", consisting of python files such as
"/mydir/a.py", "/mydir/b.py", it would produce the following:
1 dataset = tf.data.Dataset.list_files("/mydir/*.py") 2 files_list = list(dataset.as_numpy_iterator()) 3 print(files_list) # => returns ["/mydir/a.py", "/mydir/b.py"]
The issue with the above approach is that globbing occurs for every filename encountered in the path, so its more efficient to produce the list of file names first and construct the dataset using
There are other methods such as
from_tensors which are outside the scope of this article. We will be using
from_tensor_slices in a working example below.
Apply transformations to dataset
Now that we have a dataset of elements, the next step would be to preprocess it. We can call the
map method and pass a function to process each element.
For instance, we may want to resize each image and perform mean normalization as part of preprocessing.
1 # list_of_files is a collection of file paths... 2 dataset = tf.data.Dataset.from_tensor_slices(list_of_files) 3 4 train_ds = dataset.map(process_img) 5 6 def process_img(file_path): 7 # read and process the image 8 img = tf.io.read_file(file_path) 9 img = tf.image.decode_jpeg(img, channels=3) 10 # mean normalization 11 img = tf.image.convert_image_dtype(img, tf.float32) 12 img /= 255.0 13 # resize the image 14 img = tf.image.resize(img, (64, 64)) 15 return img
process_img in the above,
train_ds will now contain a dataset of preprocessed images.
map returns a dataset, we can chain multiple calls together, clarifying the sequence of operations:
1 def func1(x): 2 return x * 2 3 4 def func2(x): 5 return x ** 2 6 7 ds = tf.data.Dataset.from_tensor_slices([1,2,3]) 8 9 new_ds = ds.map(func1).map(func2) 10 11 print(list(new_ds.as_numpy_iterator())) # => [4, 16, 36]
Iteration over dataset
We need to set certain parameters on the dataset object before we can pass it into a model for training. This would include setting the batch size, caching, pre-fetching options.
Using the image classification example above, we can do the following:
1 dataset = tf.data.Dataset.from_tensor_slices(list_of_files) 2 train_ds = dataset.map(process_img) 3 train_ds = train_ds.shuffle(buffer_size=1024).batch(64) 4 5 model.fit(train_ds, epochs=3)
shuffle function randomly shuffles the elements in the dataset. The
batch function sets the batch size for each training epoch. Note that by using
batch we don’t have to set the batch size argument in the
One can also chain further functions such as
cache to cache the data in memory or on the filesystem by setting the filename argument in the function. This is extremely useful when training large datasets.
Note that, the first iteration of the training loop will create the cache, after which, subsequent runs will use the same cached data in the same sequence. To randomize the data between iterations, call
1 train_ds = train_ds.cache("cache/mycache").shuffle(buffer_size=1024).batch(64)
When the training loop is restarted, the cache directory needs to be cleared else it will raise an exception.
For most training scenarios, passing the dataset into
model.fit will be sufficient. However, if you do have a custom/manual training process where you are iterating the dataset across multiple epochs, you need to call
batch to iterate over the dataset.
1 train_ds = train_ds.repeat().batch(64) 2 3 for ele in train_ds.as_numpy_iterator(): 4 print(ele)
To access the next batch of data, you can create an iterator from the dataset by calling
as_numpy_iterator or wrapping the dataset object in
iter() and call
next to retrieve the next batch of data.