ML Engineering 101: A Thorough Explanation of The Error “DataLoader worker (pid(s) xxx) exited unexpectedly” | by Mengliu Zhao | Jun, 2024

0

[ad_1]

Torch.multiprocessing Best Practices

However, virtual memory is only one side of the story. What if the issue doesn’t go away after adjusting the swap disk?

The other side of the story is the underlying issues of the torch.multiprocessing module. There are a number of best practices recommendations on the official webpage:

But besides these, three more approaches should be considered, especially regarding memory usage.

The first thing is shared memory leakage. Leakage means that memory is not released properly after each run of the child worker, and you will observe this phenomenon when you monitor the virtual memory usage at runtime. Memory consumption will keep increasing and reach the point of being “out of memory.” This is a very typical memory leakage.

So what will cause the leakage?

Let’s take a look at the DataLoader class itself:

https://github.com/pytorch/pytorch/blob/main/torch/utils/data/dataloader.py

Looking under the hood of DataLoader, we’ll see that when nums_worker > 0, _MultiProcessingDataLoaderIter is called. Inside _MultiProcessingDataLoaderIter, Torch.multiprocessing creates the worker queue. Torch.multiprocessing uses two different strategies for memory sharing and caching: file_descriptor and file_system. While file_system requires no file descriptor caching, it is prone to shared memory leaks.

To check what sharing strategy your machine is using, simply add in the script:

torch.multiprocessing.get_sharing_strategy()

To get your system file descriptor limit (Linux), run the following command in the terminal:

ulimit -n

To switch your sharing strategy to file_descriptor:

torch.multiprocessing.set_sharing_strategy(‘file_descriptor’)

To count the number of opened file descriptors, run the following command:

ls /proc/self/fd | wc -l

As long as the system allows, the file_descriptor strategy is recommended.

The second is the multiprocessing worker starting method. Simply put, it’s the debate as to whether to use a fork or spawn as the worker-starting method. Fork is the default way to start multiprocessing in Linux and can avoid certain file copying, so it is much faster, but it might have issues handling CUDA tensors and third-party libraries like OpenCV in your DataLoader.

To use the spawn method, you can simply pass the argument multiprocessing_context= “spawn”. to the DataLoader.

Three, make the Dataset Objects Pickable/Serializable

There is a super nice post further discussing the “copy-on-read” effect for process folding: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/

Simply put, it’s no longer a good approach to create a list of filenames and load them in the __getitem__ method. Create a numpy array or panda dataframe to store the list of filenames for serialization purposes. And if you’re familiar with HuggingFace, using a CSV/dataframe is the recommended way to load a local dataset: https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/loading_methods#datasets.load_dataset.example-2

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *