Using Plotly
EN: For self_employed, we have 500 "no", 82 "yes". I can either use the mode to fill the null values, or check by myself which values are closer. In this case, it's more likely that the values are a "no".
IT: Per self_employed, dato che 500 non sono self employed e 82 si, controllo le statistiche per i valori nan e decido se rimpiazzarli con "No", dato che è più probabile che non siano self employed visti i numeri
df['Self_Employed'] = df['Self_Employed'].fillna('No')
EN: My null values in the Gender feature are only 13 and very different from the two genders: I decide to drop them. Otherwise, I could have used the mode in this case too.
IT: Decido di eliminare i record con valori nulli in Gender: hanno dei valori troppo alti e diversi dagli altri due, e sono solo 13.
df= df.dropna(subset = ['Gender'])
Since 3+ is a string, I have to change it to an int
df['Dependents'] = df['Dependents'].replace('3+', 3)
df['Dependents'] = df['Dependents'].astype('int')
from sklearn.preprocessing import LabelEncoder #I'm using the Label Encoder for my target
enc = LabelEncoder()
df['Loan_Status'] = enc.fit_transform(df['Loan_Status'])
enc_name_mapping = dict(zip(enc.classes_, enc.transform(enc.classes_)))
print(enc_name_mapping) #this is the dictionary with the values of my target
categorical_features = df[['Gender', 'Married', 'Education','Self_Employed',
'Property_Area']] #cat featu without target
for col in categorical_features:
print(df[col].unique())
['Male' 'Female']
['No' 'Yes']
['Graduate' 'Not Graduate']
['No' 'Yes']
['Urban' 'Rural' 'Semiurban']
I'll use map to change the categorical into numerical values:
df['Gender']= df['Gender'].map({'Male':0, 'Female':1})
I'll save them as dictionaries so I can have a legend:
Gender = {'Male':0, 'Female':1}