Write a DataFlow

First, make sure you know about Python’s generators and yield keyword. If you don’t, learn it on Google.

Write a Source DataFlow

There are several existing DataFlow, e.g. ImageFromFile, DataFromList, which you can use if your data format is simple. In general, you probably need to write a source DataFlow to produce data for your task, and then compose it with other DataFlow (e.g. mapping, batching, prefetching, …).

The easiest way to create a DataFlow to load custom data, is to wrap a custom generator, e.g.:

def my_data_loader():
  # load data from somewhere with Python, and yield them
  for k in range(100):
    yield [my_array, my_label]

df = DataFromGenerator(my_data_loader)

To write more complicated DataFlow, you need to inherit the base DataFlow class. Usually, you just need to implement the __iter__() method which yields a datapoint every time.

class MyDataFlow(DataFlow):
  def __iter__(self):
    # load data from somewhere with Python, and yield them
    for k in range(100):
      digit = np.random.rand(28, 28)
      label = np.random.randint(10)
      yield [digit, label]
      
df = MyDataFlow()
df.reset_state()
for datapoint in df:
    print(datapoint[0], datapoint[1])

Optionally, you can implement the __len__ and reset_state method. The detailed semantics of these three methods are explained in the API documentation. If you’re writing a complicated DataFlow, make sure to read the API documentation for the semantics.

DataFlow implementations for several well-known datasets are provided in the dataflow.dataset module. You can take them as examples.

More Data Processing

You can put any data processing you need in the source DataFlow you write, or you can write a new DataFlow for data processing on top of the source DataFlow, e.g.:

class ProcessingDataFlow(DataFlow):
  def __init__(self, ds):
    self.ds = ds
    
  def reset_state(self):
    self.ds.reset_state()

  def __iter__(self):
    for datapoint in self.ds:
      # do something
      yield new_datapoint

Some built-in dataflows, e.g. MapData and MapDataComponent can do common types of data processing for you.