API Reference


An instance of this class is used as a reader to read videos.

class mydia.Videos(target_size=None, to_gray=False, num_frames=None, mode='auto', normalize=False, data_format='channels_last', random_state=17)[source]

Class to read in videos and store them as numpy arrays

The videos are stored as a 5-dimensional tensor where the shape of the tensor depends on data_format.

  • target_size (tuple[int, int]) – A tuple of form (width, height) indicating the dimension to resize the frames of the video, defaults to None. The dimension of the frames will not be altered if this parameter is not set.
  • to_gray (bool) – Convert video to grayscale, defaults to False.
  • num_frames (int) – The (exact) number of frames to extract from the video, defaults to None. Frames are extracted based on the value of mode. If not set, all the frames of the video are kept.
  • mode (str) –

    The method used for frame extraction if num_frames is set. It could be one of “auto”, “random”, “first”, “last” or “middle”.

    • "auto": N frames will be extracted at equal intervals.
    • "random": N frames will be randomly extracted (no repetetion). Use random_state to ensure reproducibility.
    • "first", "last" and "middle" will extract N contiguous frames from the beginning, end and middle of the video respectively.
  • normalize (bool) – Shifts each video to the range (0, 1) by subtracting the minimum and dividing by the difference between the maximum and the minimum pixel value. Defaults to False
  • data_format (str) –

    Video data format, either “channels_last” or “channels_first”.

    • "channels_last": The tensor will have shape (<videos>, <frames>, <height>, <width>, <channels>)
    • "channels_first": The tensor will have shape (<videos>, <channels>, <frames>, <height>, <width>)

    channels will be 3 for videos in RGB format, or 1 for videos in grayscale.

  • random_state (int) – Integer that seeds the (numpy) random number generator, defaults to 17. Used only when mode is set to “random”.


from mydia import Videos

reader = Videos(
    target_size=(720, 480),

video = reader.read("./path/to/video")


You could also pass a callable to mode for custom frame extraction. The callable should return a list of integers, denoting the indices of the frames to be extracted. It should take 4 (non-keyword) arguments:

  • total_frames: The total number of frames in the video
  • num_frames: The number of frames that you want to extract
  • fps: The frame rate of the video
  • random_state: Integer to seed the random number generator

These arguments may/may not be used to generate the required frame indices. Detailed examples are provided in the documentation.


If you are passing a callable to mode, then make sure that the number of frames (indices) it returns is equal to the value of num_frames. If this condition is not met, then this would mean that the number of frames selected is different for different videos, and therefore they cannot be stacked into a single tensor.

read(paths, verbose=1, workers=0)[source]

Function to read videos

  • paths (str or list[str]) – A list of paths/path of the video(s) to be read.
  • verbose (int) – If set to 0, the progress bar will be disabled.
  • workers (int) –

    The number of processes (CPUs) to use for reading the videos. This uses the multiprocessing module present in the python standard library.

    Its value can range from 0 to max_workers where the latter can be determined by calling multiprocessing.cpu_count() on your machine.

    Defaults to 0, which means that multiprocessing will not be used.


A 5-dimensional tensor, whose shape will depend on the value of data_format.

  • For "channels_last": The tensor will have shape (<videos>, <frames>, <height>, <width>, <channels>)
  • For "channels_first": The tensor will have shape (<videos>, <channels>, <frames>, <height>, <width>)

Return type:


  • ValueError – If paths is neither a string, not a list of strings.
  • IndexError – If num_frames is set to a value greater than the total number of frames available in the video.


If multiple videos are to be read, then each video should have the same dimension (frames, height, width), otherwise they cannot be stacked into a single tensor. Therefore, the user must use the parameters target_size and num_frames to make sure of this.


This method can be used for converting a video into a grid of frames. Inspired from a similar utility provided in torchvision

mydia.make_grid(video, num_col=3, padding=5)[source]

Converts a video into a grid of frames.

  • video (numpy.ndarray) – A 4-dimensional video tensor (a single video).
  • num_col (int) – The number of columns in the grid, defaults to 3.
  • padding (int) – Amount of padding (in pixels), defaults to 5.

A gird of frames (numpy array) of shape (height, width, 3) if the video is in RGB format, or (height, width) if the video is in grayscale.

Return type:



ValueError – If the dimension of the video tensor is invalid.


import matplotlib.pyplot as plt
from mydia import Videos, make_grid

reader = Videos(target_size=(720, 480), to_gray=True)
video = reader.read("./path/to/video")

grid = make_grid(video[0], num_col=6, padding=8)
plt.imshow(grid, cmap="gray")


The input to this function should be a single video tensor, with any data_format. However, the grid of frames produced as the output will always be "channels_last".