trainer.fit() Object Memory Error in Local Machine #63

New Issue

GiteaMirror · 2025-11-02T00:02:51-05:00

GiteaMirror commented

2025-11-02 00:02:51 -05:00

Originally created by @satyamnyati on GitHub (Feb 2, 2024).

I run out of memory in trainer.fit(), I have 8gb RAM and i7 8th gen 12 core CPU. I also see error in trainer.fit(). Is there any way to reduce the load on RAM or will I need another RAM to be able to run this.

Originally created by @satyamnyati on GitHub (Feb 2, 2024). I run out of memory in trainer.fit(), I have 8gb RAM and i7 8th gen 12 core CPU. I also see error in trainer.fit(). Is there any way to reduce the load on RAM or will I need another RAM to be able to run this.

GiteaMirror closed this issue

2025-11-02 00:02:51 -05:00

GiteaMirror commented

2025-11-02 00:02:52 -05:00

@totovivi commented on GitHub (Feb 9, 2024):

These changes helped me:

setting num_workers to 1
setting ray.init(object_store_memory=10**9)
changing the batch_size to 32

But I am surprised that I had to do these changes, given my computer's characteristics

@totovivi commented on GitHub (Feb 9, 2024): These changes helped me: - setting num_workers to 1 - setting ray.init(object_store_memory=10**9) - changing the batch_size to 32 But I am surprised that I had to do these changes, given my computer's characteristics

GiteaMirror commented

2025-11-02 00:02:53 -05:00

@satyamnyati commented on GitHub (Feb 13, 2024):

Yes, I changed the batch size and I could get it to run on ubuntu. However in windows I couldn't run it even with these changes. The path manipulations are causing some problems.

@satyamnyati commented on GitHub (Feb 13, 2024): Yes, I changed the batch size and I could get it to run on ubuntu. However in windows I couldn't run it even with these changes. The path manipulations are causing some problems.

GiteaMirror commented

2025-11-02 00:02:53 -05:00

@capmichal commented on GitHub (Feb 26, 2024):

Same issue here, on both laptops (powerfull Dell XPS 15, and weaker HP from work). After using num_workers=1 and batch_size=32 I could run just fine, it took about 30 minutes to train model.

If I work on a single machine, does the num_workers affect this OOM problem, or all i have to do is manipulate batch_size variable ? Does changing to 2 workers allow me to use larger batch size? Or does my computer(each core) just cannot handle more that batch_size=32 ?

Would love to get explanation about how does num_workers, resources_per_worker and batch_size relate to my RAM.

In Addition: Using GPU on my Dell solves this issue, and trains model in a minute, but i am interested in using only CPU. While running the whole setup my computer RAM usage is about 70% which leaves about 4/5GB of RAM for training, it is simply not enough, so any configuration (if not using GPU) wont allow my to easily train?

@capmichal commented on GitHub (Feb 26, 2024): Same issue here, on both laptops (powerfull Dell XPS 15, and weaker HP from work). After using num_workers=1 and batch_size=32 I could run just fine, it took about 30 minutes to train model. If I work on a single machine, does the num_workers affect this OOM problem, or all i have to do is manipulate batch_size variable ? Does changing to 2 workers allow me to use larger batch size? Or does my computer(each core) just cannot handle more that batch_size=32 ? Would love to get explanation about how does num_workers, resources_per_worker and batch_size relate to my RAM. In Addition: Using GPU on my Dell solves this issue, and trains model in a minute, but i am interested in using only CPU. While running the whole setup my computer RAM usage is about 70% which leaves about 4/5GB of RAM for training, it is simply not enough, so any configuration (if not using GPU) wont allow my to easily train?

GiteaMirror commented

2025-11-02 00:02:53 -05:00

@Koowah commented on GitHub (May 4, 2024):

@capmichal @totovivi I had the same reactions and questions.

After a bit of research and chatting with GPT, here's what I gathered :

When we're training a model, we have to load it in memory. Here, we're using scibert which weighs around 442mb according the HG page for the model. This weight roughly corresponds to the weight of all the model's parameters (110 million).
One foward pass of one sample loads in memory all the network's associated activations and gradients. When we pass a batch, we have to multiply this load by the batch size. Memory load is therefore a big function of model size, and increases linearly with batch size.

So I guess the main issue regarding memory is how much data is passed by iteration (batch_size * num_workers). Using less workers should however be better as each worker has overhead and must load its own copy of the model.

@Koowah commented on GitHub (May 4, 2024): @capmichal @totovivi I had the same reactions and questions. After a bit of research and chatting with GPT, here's what I gathered : - When we're training a model, we have to load it in memory. Here, we're using scibert which weighs around 442mb according the HG page for the model. This weight roughly corresponds to the weight of all the model's parameters (110 million). - One foward pass of one sample loads in memory **all the network's associated activations and gradients**. When we pass a batch, we have to multiply this load by the batch size. Memory load is therefore a big function of model size, and **increases linearly with batch size**. So I guess the main issue regarding memory is how much data is passed by iteration (batch_size * num_workers). Using less workers should however be better as each worker has overhead and must load its own copy of the model.

GiteaMirror commented

2025-11-02 00:02:53 -05:00

@satyamnyati commented on GitHub (Nov 16, 2024):

The thing that worked for me was reducing all the params like batch size etc. in training.

@satyamnyati commented on GitHub (Nov 16, 2024): The thing that worked for me was reducing all the params like batch size etc. in training.

GiteaMirror referenced this issue

2025-11-02 00:05:49 -05:00

[PR #63] [MERGED] updated readme #137

Sign in to join this conversation.