Stuck at NaN (Not a Number) while training your model?

2 min readDec 17, 2019

I am a beginner and I got stuck with my custom ResNet18 model while training for CIFAR10 dataset with NaN value as my loss. In this article I am going to discuss how to find out the problematic layer(s) in an efficient manner and rectify it with an example.

First I am going to write a model:

Looks like a fine working model, but when we try to train it we start getting something of this sort:

Now, what to do?? If we read other articles related to NaN, they state that gradient would have exploded or batch normalization causes this issue, etc. In theory it is fine, but we don’t know what to change in our model for it to train properly.

So, we will print min , max , meanand standard deviation for each layer and see if it can help us here. Thanks to Kinshuk Sarabhai for suggesting this.

We observe from this that, suddenly layer number 36 gives a suspicious high value of range -450 to 60 and everything starts to explode thereafter. Slowly values reach Infinity and then to NaN.

Layer 36 here has shape of (1024, 10) that means it is the last layer and we have made some mistakes while declaring our last layers in the model.

Now that we can concentrate only on last layers, these were the mistakes:

In some layers we have done twice, Batch normalization and ReLU activation back to back.
Convolution, Batch Normalization and ReLU activation needs to be done in sequence after concatenating last layers of maxpool and avgpool, whereas we did not do convolution part.

Corrected Model:

Now, when we train we do not see NaN anymore.

Conclusion :

It is due to some of the layers that we have defined in a wrong way.
Most probably BN-ReLU just after BN-ReLU.
Try printing the values of all the layers in form of mean, median etc to find out the faulty layer and rectify it.

Stuck at NaN (Not a Number) while training your model?

Written by Divyanshu Raj

No responses yet