ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data

Data Science

I have two CSV files(Training set and Test Set). Since there are visible NaN values in few of the columns (status, hedge_value, indicator_code, portfolio_id, desk_id, office_id).

 

I start the process by replacing the NaN values with some huge value corresponding to the column. Then I am doing LabelEncoding to remove the text data and convert them into Numerical data. Now, when I try to do OneHotEncoding on the categorical data, I get the error. I tried giving input one by one into the OneHotEncoding constructor, but I get the same error for every column.

 

Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?

 

I am using Python3.6 with Pandas and Sklearn for data processing.

 

Code

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

 

test_data = pd.read_csv('test.csv')

train_data = pd.read_csv('train.csv')

 

# Replacing Nan values here

train_data['status']=train_data['status'].fillna(2.0)

train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)

train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)

train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')

train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')

train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

 

x_train = train_data.iloc[:, :-1].values

y_train = train_data.iloc[:, 17].values

 

# =============================================================================

# from sklearn.preprocessing import Imputer

# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)

# imputer.fit(x_train[:, 15:17])

# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])



# imputer.fit(x_train[:, 12:13])

# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])

# =============================================================================

 

 

# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 

# Country name, Purchased status will give trouble

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()

x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])

x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])

x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])

x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])

x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])

x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])

x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])

 

 

# =============================================================================

# import numpy as np

# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)

# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)

# np.isnan(x_train[:, 3]).any()

# =============================================================================

 

 

# =============================================================================

# from sklearn.preprocessing import StandardScaler

# sc_X = StandardScaler()

# x_train = sc_X.fit_transform(x_train)

# =============================================================================

 

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])

x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

Error

Traceback (most recent call last):

 

  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>

    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

 

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform

    self.categorical_features, copy=True)

 

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected

    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

 

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array

    _assert_all_finite(array)

 

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite

    " or a value too large for %r." % X.dtype)

 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

 


If you want to unleash your potential in this competitive field, please visit the Data Science course page for more information, where you can find the Data Science tutorials and Data Science frequently asked interview questions and answers as well.

2
Answers

Replies

Looking at the error that you got, I think you still have NaN values in your dataset. I don't know what way you followed to handle the NaN values, but you can simply drop the rows that contain null values.



df = df.dropna(how='any',axis=0)

 

You might have missed some NaN values in your dataset. You can use the following command to return the columns that have null values. 



pd.isnull(train_data).sum() > 0

 
 

This topic has been locked/unapproved. No replies allowed

Login to participate in this discussion.

Leave a reply

Before proceeding, please check your email for a verification link. If you did not receive the email, click here to request another.
WhatsApp
To Top