ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data

Data Science

I have two CSV files(Training set and Test Set). Since there are visible NaN values in few of the columns (status, hedge_value, indicator_code, portfolio_id, desk_id, office_id).


I start the process by replacing the NaN values with some huge value corresponding to the column. Then I am doing LabelEncoding to remove the text data and convert them into Numerical data. Now, when I try to do OneHotEncoding on the categorical data, I get the error. I tried giving input one by one into the OneHotEncoding constructor, but I get the same error for every column.


Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?


I am using Python3.6 with Pandas and Sklearn for data processing.



import pandas as pd

import matplotlib.pyplot as plt

import numpy as np


test_data = pd.read_csv('test.csv')

train_data = pd.read_csv('train.csv')


# Replacing Nan values here








x_train = train_data.iloc[:, :-1].values

y_train = train_data.iloc[:, 17].values


# =============================================================================

# from sklearn.preprocessing import Imputer

# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)

#[:, 15:17])

# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])

#[:, 12:13])

# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])

# =============================================================================



# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 

# Country name, Purchased status will give trouble

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()

x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])

x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])

x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])

x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])

x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])

x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])

x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])



# =============================================================================

# import numpy as np

# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)

# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)

# np.isnan(x_train[:, 3]).any()

# =============================================================================



# =============================================================================

# from sklearn.preprocessing import StandardScaler

# sc_X = StandardScaler()

# x_train = sc_X.fit_transform(x_train)

# =============================================================================


onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])

x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.


Traceback (most recent call last):


  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>

    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.


  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/", line 2019, in fit_transform

    self.categorical_features, copy=True)


  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/", line 1809, in _transform_selected

    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)


  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/", line 453, in check_array



  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/", line 44, in _assert_all_finite

    " or a value too large for %r." % X.dtype)


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').


Looking at the error that you got, I think you still have NaN values in your dataset. I don't know what way you followed to handle the NaN values, but you can simply drop the rows that contain null values.

df = df.dropna(how='any',axis=0)


You might have missed some NaN values in your dataset. You can use the following command to return the columns that have null values. 

pd.isnull(train_data).sum() > 0


