Python - Python Advanced - Pandas Tutorial

Pandas is a open source Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

Why pandas?
Array contains homogenous data.
Allow to analyze big data and make conclusion based on statistical theories. Mostly used in data science.
Visualization is good in pandas, clean data.

Data Structure in pandas
1] Series - 1D - A Pandas Series is like a column in a table.
2] Dataframe - 2D - A Pandas DataFrame is a 2 dimensional data structure, or a table with rows and columns.

Installation-

pip install pandas

Importing Pandas-

import pandas as pd

as - is used for alias to create a short name of pandas i.e pd

1] Series

How to create a series?
i] List
ii] Array
iii] Dictionary

i] List

# list

names = ['sam' , 'raman' , 'bittu', 'Dhanraj']
pd.Series(names)

Output-

0        sam
1      raman
2      bittu
3    Dhanraj
dtype: object

df = pd.Series(names)
print(df.shape)
print(df.ndim)

Output-

(4,)
1

To change the index from digit to alphabet-

pd.Series(names, index = ['a','b','c','d'])

Output-

a        sam
b      raman
c      bittu
d    Dhanraj
dtype: object

# index len should be always equal to len of data

2] Array

pd.Series(np.arange(10,15))

Output-

0    10
1    11
2    12
3    13
4    14
dtype: int64

pd.Series(np.arange(10,15) , index = [100,200,300,400,500])

Output-

100    10
200    11
300    12
400    13
500    14
dtype: int64

3] Dictionary

student = {'name1':'sam' , 'name2':'saurabh' , 'name3': 'Dhanraj' , 'name4':'Ramesh'}
pd.Series(student)

Output-

name1        sam
name2    saurabh
name3    Dhanraj
name4     Ramesh
dtype: object

In dictionary key become the index.We cannot change the value of key/index in dictionary. If we try to change also, the value will become NaN.

pd.Series(student, index = [0,1,2,'name1'])

Output-

0        NaN
1        NaN
2        NaN
name1    sam
dtype: object

Indexing

student = {'name1':'sam' , 'name2':'saurabh' , 'name3': 'Dhanraj' , 'name4':'Ramesh'}
df = pd.Series(student)
df

Output-

name1        sam
name2    saurabh
name3    Dhanraj
name4     Ramesh
dtype: object

- using the custom indexing

df['name2']

Output-

'saurabh'

- using the zero indexing

df[1]

Output-

'saurabh'

Slicing
- using custom indexing start - inclusive , stop- inclusive
- using the zero indexing - start - inclusive , stop- exclusive

- Custom Indexing

student = {'name1':'sam' , 'name2':'saurabh' , 'name3': 'Dhanraj' , 'name4':'Ramesh'}
df = pd.Series(student)
df['name1':'name3':2]

Output-

name1        sam
name3    Dhanraj
dtype: object

- Zero Indexing

student = {'name1':'sam' , 'name2':'saurabh' , 'name3': 'Dhanraj' , 'name4':'Ramesh'}
df = pd.Series(student)
df[0:3:2]

Output-

name1        sam
name3    Dhanraj
dtype: object

Create a dataframe in such as way where currency is index and country is value.

brics_country = ['Brazil', 'Russia', 'India', 'China', 'South Africa']
brics_currency = ['Real', 'Ruble', 'Rupee', 'Renminbi', 'Rand' ]

df=pd.Series(brics_currency,brics_country)
df

Output-

Brazil              Real
Russia             Ruble
India              Rupee
China           Renminbi
South Africa        Rand
dtype: object

df.index

Output-

Index(['Brazil', 'Russia', 'India', 'China', 'South Africa'], dtype='object')

df.values

Output-

array(['Real', 'Ruble', 'Rupee', 'Renminbi', 'Rand'], dtype=object)

2] Dataframe
i] Using the file-

#csv

df = pd.read_csv('datasets/ted_data.csv')
df

Output-

#xlsx

df = pd.read_excel('datasets/football_worldcup.xlsx')
df

Output-

ii] Create a dataframe using dictionary-

df_dict = {'Year' : [1990, 1994, 1998, 2002],
           'Country' : ['Italy', 'USA', 'France', 'Japan'],
           'Winner' : ['Germany', 'Brazil', 'France', 'Brazil'],
           'GoalScored' : [115, 141, 171, 161]
          }

pd.DataFrame(df_dict)

Output-

In dictionary, all array must be of same length. This issue can be solved in series, it takes NaN inplace of empty value.

iii] Create dataframe using series

df_dict_udindex = {'Winner' : pd.Series(['Germany', 'Brazil', 'France'], 
                                         ),               
                   'GoalScored' : pd.Series([115, 141, 171, 161], 
                                            ),
                  } 

pd.DataFrame(df_dict_udindex)

Output-

iv] Create a df using list of dic

df_lodict = [
             {'year' : 2002, 'HostCountry' : 'Japan', 'Winner' : 'Brazil'},
             {'year' : 2006, 'HostCountry' : 'Germany', 'Winner' : 'Italy'},
             {'year' : 2010, 'HostCountry' : 'South Africa', 'Winner' : 'Spain'},
             {'year' : 2014, 'HostCountry' : 'Brazil', 'Winner' : 'Germany'},
            ]

pd.DataFrame(df_lodict)

Output-

v] List of tuples

df_lotuples = [(2002, 'Japan', 'Brazil', 161), 
                (2006, 'Germany', 'Italy', 147), 
                (2010, 'South Africa', 'Spain', 145),
                (2014, 'Brazil', 'Germany', 171)
              ]

# To put column name manully, we can give it like this-
pd.DataFrame(df_lotuples , columns = ['a' , 'b' , 'c' , 'd'])

Output-

Basic Of Pandas

weather_data = pd.read_csv('datasets/weather_data.csv')
weather_data

Output-

i] shape

weather_data.shape

Output-

(6, 4)

ii] head() - top n data - head() -- deafult value is 5

weather_data.head(2)

Output-

weather_data.head()

Output-

iii] tail() - bottom n data - tail() -- deafult value is 5

weather_data.tail(2)

Output-

weather_data.tail()

Output-

iv] info()

weather_data.info()

Output-

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   day          6 non-null      object
 1   temperature  6 non-null      int64 
 2   windspeed    6 non-null      int64 
 3   event        6 non-null      object
dtypes: int64(2), object(2)
memory usage: 320.0+ bytes

v] describe()

describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame

df = pd.read_csv('datasets/weather_data.csv')
df.describe()

Output-

       temperature  windspeed
count  6.000000     6.000000
mean   30.333333    4.666667
std    3.829708     2.338090
min    24.000000    2.000000
25%    28.750000    2.500000
50%    31.500000    5.000000
75%    32.000000    6.750000
max    35.000000    7.000000

df.describe(include = 'object')

Output-

         day	event
count	  6	      6
unique	  6	      3
top	    1/1/2017 Rain
freq	  1	      2

df.describe(include = 'all')

Output-

Slicing and indexing

df = pd.read_csv('datasets/citibike_tripdata.csv')
df

Output-

# Select a particular column - It will give a series datatype

df['starttime']

Output-

# Select 2 or more data type

df[['starttime', 'bikeid']]

Output-

Operations-

1. normal slicing

2. loc

3. iloc

1] Normal Slicing
Syntax -

df[start:stop:step] / df[start_row_name : stop_row_name : step]

- custom indexing
- zero indexing

-it can do slicing only on rows

df[::2]

Output-

df[2:11]

Output-

df_new = df[0:6]
df_new

Output-

index = ['row1' , 'row2' , 'row3' , 'row4' , 'row5' , 'row6']
df_new.index = index
df_new

Output-

i] zero indexing

df_new[2::2]

Output-

ii] Custom indexing

df_new['row2'::2]

Output-

2] loc

- it can be done both on rows as well as column
- syntax df.loc['start' : 'stop' : step , start:stop:step]
- you can only do custom indexing

df_new

Output-

df_new.loc[0::2]

Output-

TypeError: cannot do slice indexing on Index with these indexers [0] of type int

df_new_2 = df[0:6:]
df_new_2

Output-

df_new

Output-

df_new.loc['row1': 'row3' , ::2]

Output-

df_new.loc[::2,'starttime':'bikeid']

Output-

df_new.loc[['row1' , 'row3' , 'row4'], ['tripduration' , 'bikeid' , 'usertype']]

Output-

df

Output-

df.loc[df['tripduration']>400]

Output-

df['tripduration']>400

Output-

# give me all the data whose user type is 'subscriber'

df.loc[df['usertype']=='Subscriber']

Output-

3. iloc
- it can do slicing both on index as well column
- zero indexing
- syntax df.iloc['start' : 'stop' : step , start:stop:step]

df_new.iloc['row1':'row2']

Output-

TypeError: cannot do positional indexing on Index with these indexers [row1] of type str

df_new.iloc[1:3]

Output-

df_new

Output-

df_new.iloc[1:3 , 2:5]

Output-

normal - rows
- zero / custom

loc - both
- custom
- filter

iloc - both
- zero

Addition and Deletion in the data frame

df = pd.read_csv('datasets/imdb_rating.csv')
df

Output-

Adding a column-

df['actors'] = 'the rock'
df

Output-

df['actors'] = range(0,10)
df

Output-

Insert Column

df.insert(1,'actorss','abc')
df

Output-

df = pd.read_csv('datasets/imdb_rating.csv')
df

Output-

df.loc[df['duration']==152].index[0]

Output-

df.loc[df.loc[df['duration']==152].index[0] , 'duration'] = 200
df

Output-

Adding a row

df.loc[len(df)] = [9.8 , 'abc' , 'drama' , 120]
df

Output-

df.append({'star_rating' : 42 , 'title' : 'abc' , 'genre': 140 , 'duration': 200} , ignore_index = True)

Output-

Drop Column

df.drop(columns=['duration'] , inplace = True)
df

Output-

Drop Row

df.drop(index=0 , inplace = True)
df

Output-

Sorting

df.sort_values(by='star_rating')

Output-

df.sort_values(by='genre')

Output-

df.sort_values(by=['genre' , 'star_rating'])

Output-

df.sort_values(by=['genre' , 'star_rating'] , ascending=False)

Output-

Python - Python Advanced - Pandas Tutorial

About Fresherbell

Important Links

Social Media