Agriculture Dataset
Practice - Cleaning dataset
CleanAgri_with_Pandas
In [ ]:
import numpy as np
import pandas as pd
In [ ]:
df = pd.read_csv('/content/agriculture_ds.csv')
In [ ]:
df.head()
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
0 Andaman and Nicobar Islands NICOBARS 2000 Kharif Arecanut 1254.0 2000
1 Andaman and Nicobar Islands NICOBARS 2000 Kharif Other Kharif pulses 2.0 1
2 Andaman and Nicobar Islands NICOBARS 2000 Kharif Rice 102.0 321
3 Andaman and Nicobar Islands NICOBARS 2000 Whole Year Banana 176.0 641
4 Andaman and Nicobar Islands NICOBARS 2000 Whole Year Cashewnut 720.0 165

display only null values in any column in a dataframe

In [ ]:
df[df.isna().any(axis=1)]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
38263 Bihar NAWADA 1997 Rabi NaN NaN NaN

Display if state_name has any null values

In [ ]:
df[df['State_Name'].isnull()]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
In [ ]:
df[df['District_Name'].isnull()]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
In [ ]:
df[df['Crop_Year'].isnull()]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production

Display, a column for a particular value in that column

In [ ]:
df[df['State_Name'] == "Andhra Pradesh"]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
203 Andhra Pradesh ANANTAPUR 1997 Kharif Arhar/Tur 21400.0 2600
204 Andhra Pradesh ANANTAPUR 1997 Kharif Bajra 1400.0 500
205 Andhra Pradesh ANANTAPUR 1997 Kharif Castor seed 1000.0 100
206 Andhra Pradesh ANANTAPUR 1997 Kharif Cotton(lint) 7300.0 9400
207 Andhra Pradesh ANANTAPUR 1997 Kharif Dry chillies 3700.0 7100
... ... ... ... ... ... ... ...
9826 Andhra Pradesh WEST GODAVARI 2014 Rabi Tobacco 28046.0 89107
9827 Andhra Pradesh WEST GODAVARI 2014 Rabi Tomato 224.0 3035
9828 Andhra Pradesh WEST GODAVARI 2014 Rabi Urad 5418.0 4833
9829 Andhra Pradesh WEST GODAVARI 2014 Whole Year Coconut 21729.0 718991000
9830 Andhra Pradesh WEST GODAVARI 2014 Whole Year Rice 409286.0 1624324

9628 rows × 7 columns

In [ ]:
df[df['State_Name'] == "Andhra Pradesh"].count()
Out[ ]:
State_Name       9628
District_Name    9628
Crop_Year        9628
Season           9628
Crop             9628
Area             9628
Production       9628
dtype: int64

apply condition on two different columns in dataframe

In [ ]:
df[(df['State_Name'] == "Bihar") & (df['Season'] == "Kharif")]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
In [ ]:
df[(df['State_Name'] == "Andhra Pradesh") & (df['Crop'] == 'Bajra')]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
204 Andhra Pradesh ANANTAPUR 1997 Kharif Bajra 1400.0 500
232 Andhra Pradesh ANANTAPUR 1998 Kharif Bajra 2600.0 1900
258 Andhra Pradesh ANANTAPUR 1999 Kharif Bajra 1635.0 889
271 Andhra Pradesh ANANTAPUR 1999 Rabi Bajra 99.0 54
289 Andhra Pradesh ANANTAPUR 2000 Kharif Bajra 2477.0 1449
... ... ... ... ... ... ... ...
9173 Andhra Pradesh VIZIANAGARAM 2013 Rabi Bajra 3.0 3
9199 Andhra Pradesh VIZIANAGARAM 2014 Kharif Bajra 132.0 95
9433 Andhra Pradesh WEST GODAVARI 2003 Kharif Bajra 75.0 102
9480 Andhra Pradesh WEST GODAVARI 2004 Kharif Bajra 65.0 71
9511 Andhra Pradesh WEST GODAVARI 2005 Kharif Bajra 40.0 48

310 rows × 7 columns

In [ ]:
df[(df['State_Name'] == "Andhra Pradesh") | (df['Season'] == 'Kharif')]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
203 Andhra Pradesh ANANTAPUR 1997 Kharif Arhar/Tur 21400.0 2600
204 Andhra Pradesh ANANTAPUR 1997 Kharif Bajra 1400.0 500
205 Andhra Pradesh ANANTAPUR 1997 Kharif Castor seed 1000.0 100
206 Andhra Pradesh ANANTAPUR 1997 Kharif Cotton(lint) 7300.0 9400
207 Andhra Pradesh ANANTAPUR 1997 Kharif Dry chillies 3700.0 7100
... ... ... ... ... ... ... ...
9826 Andhra Pradesh WEST GODAVARI 2014 Rabi Tobacco 28046.0 89107
9827 Andhra Pradesh WEST GODAVARI 2014 Rabi Tomato 224.0 3035
9828 Andhra Pradesh WEST GODAVARI 2014 Rabi Urad 5418.0 4833
9829 Andhra Pradesh WEST GODAVARI 2014 Whole Year Coconut 21729.0 718991000
9830 Andhra Pradesh WEST GODAVARI 2014 Whole Year Rice 409286.0 1624324

9628 rows × 7 columns

though we have Rabi season for the selected state, it doesnt display because the actual lenght of Rabi is 4 chars but in dataframe it is 11 chars

In [ ]:
df[(df['State_Name'] == "Andhra Pradesh") & (df['Season'] == "Rabi")]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
In [ ]:
df[(df['State_Name'] == "Andhra Pradesh") & (df['Season'] == "Kharif")]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Season Len
In [ ]:
df[(df['State_Name'] == "Andhra Pradesh") & (df['Season'] == "Whole Year")]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Season Len

Strip or remove all traling spaces from a string in a particular column

In [ ]:
df['Season'] = df['Season'].str.strip()
In [ ]:
df['Season Len'] = df['Season'].str.len()
In [ ]:
df.head()
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Season Len
0 Andaman and Nicobar Islands NICOBARS 2000 Kharif Arecanut 1254.0 2000 6
1 Andaman and Nicobar Islands NICOBARS 2000 Kharif Other Kharif pulses 2.0 1 6
2 Andaman and Nicobar Islands NICOBARS 2000 Kharif Rice 102.0 321 6
3 Andaman and Nicobar Islands NICOBARS 2000 Whole Year Banana 176.0 641 10
4 Andaman and Nicobar Islands NICOBARS 2000 Whole Year Cashewnut 720.0 165 10

now the condition works - after stripping the trailing spaces in the Season Column

In [ ]:
df[(df['State_Name'] == "Andhra Pradesh") & (df['Season'] == "Kharif")]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Season Len
203 Andhra Pradesh ANANTAPUR 1997 Kharif Arhar/Tur 21400.0 2600 6
204 Andhra Pradesh ANANTAPUR 1997 Kharif Bajra 1400.0 500 6
205 Andhra Pradesh ANANTAPUR 1997 Kharif Castor seed 1000.0 100 6
206 Andhra Pradesh ANANTAPUR 1997 Kharif Cotton(lint) 7300.0 9400 6
207 Andhra Pradesh ANANTAPUR 1997 Kharif Dry chillies 3700.0 7100 6
... ... ... ... ... ... ... ... ...
9808 Andhra Pradesh WEST GODAVARI 2014 Kharif Small millets 10.0 5 6
9809 Andhra Pradesh WEST GODAVARI 2014 Kharif Sugarcane 19090.0 1732242 6
9810 Andhra Pradesh WEST GODAVARI 2014 Kharif Tomato 212.0 3048 6
9811 Andhra Pradesh WEST GODAVARI 2014 Kharif Turmeric 197.0 1418 6
9812 Andhra Pradesh WEST GODAVARI 2014 Kharif Urad 3095.0 3172 6

4232 rows × 8 columns

In [ ]:
df[(df['State_Name'] == "Andhra Pradesh") & (df['Season'] == "Rabi")]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Season Len
220 Andhra Pradesh ANANTAPUR 1997 Rabi Dry chillies 100.0 100 4
221 Andhra Pradesh ANANTAPUR 1997 Rabi Gram 28000.0 14800 4
222 Andhra Pradesh ANANTAPUR 1997 Rabi Groundnut 20200.0 21700 4
223 Andhra Pradesh ANANTAPUR 1997 Rabi Horse-gram 600.0 200 4
224 Andhra Pradesh ANANTAPUR 1997 Rabi Jowar 18800.0 9400 4
... ... ... ... ... ... ... ... ...
9824 Andhra Pradesh WEST GODAVARI 2014 Rabi Sesamum 727.0 137 4
9825 Andhra Pradesh WEST GODAVARI 2014 Rabi Sunflower 335.0 323 4
9826 Andhra Pradesh WEST GODAVARI 2014 Rabi Tobacco 28046.0 89107 4
9827 Andhra Pradesh WEST GODAVARI 2014 Rabi Tomato 224.0 3035 4
9828 Andhra Pradesh WEST GODAVARI 2014 Rabi Urad 5418.0 4833 4

3516 rows × 8 columns

In [ ]:
df[['State_Name', 'Season', 'Season Len']]
Out[ ]:
State_Name Season Season Len
0 Andaman and Nicobar Islands Kharif 11
1 Andaman and Nicobar Islands Kharif 11
2 Andaman and Nicobar Islands Kharif 11
3 Andaman and Nicobar Islands Whole Year 11
4 Andaman and Nicobar Islands Whole Year 11
... ... ... ...
38259 Bihar Rabi 11
38260 Bihar Rabi 11
38261 Bihar Rabi 11
38262 Bihar Rabi 11
38263 Bihar Rabi 9

38264 rows × 3 columns

In [ ]:
df.drop('Season Len', axis=1, inplace=True)
In [ ]:
df.head(2)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
0 Andaman and Nicobar Islands NICOBARS 2000 Kharif Arecanut 1254.0 2000
1 Andaman and Nicobar Islands NICOBARS 2000 Kharif Other Kharif pulses 2.0 1
In [ ]:
df['Crop Len'] = df['Crop'].str.len()
In [ ]:
df.head(2)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Crop Len
0 Andaman and Nicobar Islands NICOBARS 2000 Kharif Arecanut 1254.0 2000 8.0
1 Andaman and Nicobar Islands NICOBARS 2000 Kharif Other Kharif pulses 2.0 1 19.0
In [ ]:
df['State_Name len'] = df['State_Name'].str.len()
In [ ]:
df.head(3)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Crop Len State_Name len
0 Andaman and Nicobar Islands NICOBARS 2000 Kharif Arecanut 1254.0 2000 8.0 27
1 Andaman and Nicobar Islands NICOBARS 2000 Kharif Other Kharif pulses 2.0 1 19.0 27
2 Andaman and Nicobar Islands NICOBARS 2000 Kharif Rice 102.0 321 4.0 27
In [ ]:
df[['State_Name', 'State_Name len']]
Out[ ]:
State_Name State_Name len
0 Andaman and Nicobar Islands 27
1 Andaman and Nicobar Islands 27
2 Andaman and Nicobar Islands 27
3 Andaman and Nicobar Islands 27
4 Andaman and Nicobar Islands 27
... ... ...
38259 Bihar 5
38260 Bihar 5
38261 Bihar 5
38262 Bihar 5
38263 Bihar 5

38264 rows × 2 columns

In [ ]:
df['Dist Len'] = df['District_Name'].str.len()
In [ ]:
df.head(2)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Crop Len State_Name len Dist Len
0 Andaman and Nicobar Islands NICOBARS 2000 Kharif Arecanut 1254.0 2000 8.0 27 8
1 Andaman and Nicobar Islands NICOBARS 2000 Kharif Other Kharif pulses 2.0 1 19.0 27 8
In [ ]:
df[['District_Name', 'Dist Len']]
Out[ ]:
District_Name Dist Len
0 NICOBARS 8
1 NICOBARS 8
2 NICOBARS 8
3 NICOBARS 8
4 NICOBARS 8
... ... ...
38259 NAWADA 6
38260 NAWADA 6
38261 NAWADA 6
38262 NAWADA 6
38263 NAWADA 6

38264 rows × 2 columns

In [ ]:
df.head(2)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production Crop Len State_Name len Dist Len
0 Andaman and Nicobar Islands NICOBARS 2000 Kharif Arecanut 1254.0 2000 8.0 27 8
1 Andaman and Nicobar Islands NICOBARS 2000 Kharif Other Kharif pulses 2.0 1 19.0 27 8
In [ ]:
df.drop(['State_Name len', 'Dist Len', 'Crop Len'], inplace = True, axis = 1)

now, the length of each data value in each column is equal to the value space - extra traling spaces are removed

In [ ]:
df.head(2)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
0 Andaman and Nicobar Islands NICOBARS 2000 Kharif Arecanut 1254.0 2000
1 Andaman and Nicobar Islands NICOBARS 2000 Kharif Other Kharif pulses 2.0 1
In [ ]:
df[df['Production'] == '=']
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
623 Andhra Pradesh ANANTAPUR 2007 Kharif Moong(Green Gram) 1000.0 =
630 Andhra Pradesh ANANTAPUR 2007 Rabi Horse-gram 1000.0 =
698 Andhra Pradesh ANANTAPUR 2009 Rabi Rapeseed &Mustard 8.0 =
723 Andhra Pradesh ANANTAPUR 2010 Kharif Other Kharif pulses 1.0 =
1153 Andhra Pradesh CHITTOOR 2001 Rabi Wheat 4.0 =
... ... ... ... ... ... ... ...
31799 Bihar GAYA 2010 Kharif Bajra 1.0 =
32109 Bihar GOPALGANJ 2002 Kharif Moong(Green Gram) 5.0 =
32681 Bihar JAMUI 2007 Autumn Rice 7.0 =
33085 Bihar JEHANABAD 2005 Kharif Other Kharif pulses 1.0 =
35648 Bihar LAKHISARAI 2010 Rabi Rapeseed &Mustard 2478.0 =

82 rows × 7 columns

In [ ]:
df.sort_values('Production', inplace=True, ascending=False)
In [ ]:
df.head()
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
3667 Andhra Pradesh KADAPA 2007 Rabi Horse-gram 1000.0 =
2352 Andhra Pradesh EAST GODAVARI 2011 Kharif Niger seed 1.0 =
1423 Andhra Pradesh CHITTOOR 2007 Kharif Small millets 1000.0 =
2043 Andhra Pradesh EAST GODAVARI 2003 Rabi other oilseeds 2070.0 =
4047 Andhra Pradesh KADAPA 2014 Rabi Soyabean 12.0 =
In [ ]:
df[df['Production'] == "="]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
3667 Andhra Pradesh KADAPA 2007 Rabi Horse-gram 1000.0 =
2352 Andhra Pradesh EAST GODAVARI 2011 Kharif Niger seed 1.0 =
1423 Andhra Pradesh CHITTOOR 2007 Kharif Small millets 1000.0 =
2043 Andhra Pradesh EAST GODAVARI 2003 Rabi other oilseeds 2070.0 =
4047 Andhra Pradesh KADAPA 2014 Rabi Soyabean 12.0 =
... ... ... ... ... ... ... ...
32681 Bihar JAMUI 2007 Autumn Rice 7.0 =
28181 Bihar AURANGABAD 2008 Kharif Urad 1.0 =
623 Andhra Pradesh ANANTAPUR 2007 Kharif Moong(Green Gram) 1000.0 =
7044 Andhra Pradesh SRIKAKULAM 1998 Kharif Other Kharif pulses 100.0 =
16241 Assam DHUBRI 2005 Kharif Cotton(lint) 1.0 =

82 rows × 7 columns

In [ ]:
# index_names = df[ df['Production'] == 21 ].index
index_names = df[ df['Production'] == "=" ].index
df.drop(index_names, inplace = True)
In [ ]:
df[df['Production'] == "="]
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
In [ ]:
df.head(2)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
4351 Andhra Pradesh KRISHNA 2005 Kharif Moong(Green Gram) 12334.0 9991
21402 Assam KARBI ANGLONG 2013 Kharif Papaya 571.0 9991
In [ ]:
df
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
4351 Andhra Pradesh KRISHNA 2005 Kharif Moong(Green Gram) 12334.0 9991
21402 Assam KARBI ANGLONG 2013 Kharif Papaya 571.0 9991
2592 Andhra Pradesh GUNTUR 1998 Rabi Rice 29100.0 99900
181 Andaman and Nicobar Islands SOUTH ANDAMANS 2006 Whole Year Sugarcane 53.5 999.5
11244 Arunachal Pradesh PAPUM PARE 1998 Whole Year Sugarcane 107.0 999
... ... ... ... ... ... ... ...
6532 Andhra Pradesh SPSR NELLORE 2002 Whole Year Other Vegetables 525.0 0
7202 Andhra Pradesh SRIKAKULAM 2002 Whole Year Bottle Gourd 45.0 0
7204 Andhra Pradesh SRIKAKULAM 2002 Whole Year Cabbage 242.0 0
5818 Andhra Pradesh PRAKASAM 2003 Whole Year Cucumber 331.0 0
38263 Bihar NAWADA 1997 Rabi NaN NaN NaN

38182 rows × 7 columns

In [ ]:
df.dropna(inplace = True)
In [ ]:
df
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
4351 Andhra Pradesh KRISHNA 2005 Kharif Moong(Green Gram) 12334.0 9991
21402 Assam KARBI ANGLONG 2013 Kharif Papaya 571.0 9991
2592 Andhra Pradesh GUNTUR 1998 Rabi Rice 29100.0 99900
181 Andaman and Nicobar Islands SOUTH ANDAMANS 2006 Whole Year Sugarcane 53.5 999.5
11244 Arunachal Pradesh PAPUM PARE 1998 Whole Year Sugarcane 107.0 999
... ... ... ... ... ... ... ...
6528 Andhra Pradesh SPSR NELLORE 2002 Whole Year Cucumber 85.0 0
6532 Andhra Pradesh SPSR NELLORE 2002 Whole Year Other Vegetables 525.0 0
7202 Andhra Pradesh SRIKAKULAM 2002 Whole Year Bottle Gourd 45.0 0
7204 Andhra Pradesh SRIKAKULAM 2002 Whole Year Cabbage 242.0 0
5818 Andhra Pradesh PRAKASAM 2003 Whole Year Cucumber 331.0 0

38181 rows × 7 columns

convert Crop_year col to string so thta it may not sum up on group functions

In [ ]:
df['Crop_Year'] = df['Crop_Year'].astype(str)
In [ ]:
df.head(2)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
4351 Andhra Pradesh KRISHNA 2005 Kharif Moong(Green Gram) 12334.0 9991
21402 Assam KARBI ANGLONG 2013 Kharif Papaya 571.0 9991
In [ ]:
df.dtypes
Out[ ]:
State_Name        object
District_Name     object
Crop_Year         object
Season            object
Crop              object
Area             float64
Production        object
dtype: object

convert production col to float datatype

In [ ]:
df['Production'] = pd.to_numeric(df['Production'])
In [ ]:
df.dtypes
Out[ ]:
State_Name        object
District_Name     object
Crop_Year         object
Season            object
Crop              object
Area             float64
Production       float64
dtype: object
In [ ]:
df.to_csv('/content/clean_agri_ds.csv')
In [ ]:
df.describe()
Out[ ]:
Area Production
count 38181.000000 3.818100e+04
mean 7285.304866 5.326191e+05
std 27716.183577 1.407250e+07
min 0.200000 0.000000e+00
25% 88.000000 1.070000e+02
50% 498.000000 7.810000e+02
75% 2698.000000 6.515000e+03
max 877029.000000 7.801620e+08
In [ ]:
df.head(3)
Out[ ]:
State_Name District_Name Crop_Year Season Crop Area Production
4351 Andhra Pradesh KRISHNA 2005 Kharif Moong(Green Gram) 12334.0 9991.0
21402 Assam KARBI ANGLONG 2013 Kharif Papaya 571.0 9991.0
2592 Andhra Pradesh GUNTUR 1998 Rabi Rice 29100.0 99900.0

End Cell

In [ ]: