Python is the top-most used programming language in the world today and one use of Python is to analyze data. The Pandas package is one of the most popular data analysis libraries, providing high-performance, easy-to-use data structures, and data analysis tools. In this article, we will discuss the Pandas library with an overview of some of the useful functions that Pandas has to date or you can say this is a Pandas cheat sheet. We will be using some examples to show the reader how basic data analysis can be done within a few lines of Python code.
Brief Intro about Pandas:
- It is a Python library for data manipulation and analysis.
- It is open-source, freely available, and cross-platform.
- provides data structures and operations for manipulating numerical tables and time series to analyze.
- It allows the user to carry out operations on the data stored in the tables like filtering, sorting, grouping, merging, etc.
How to use pandas library:
Prerequisite:
- To use pandas you need to have python version 3.5 and above you can easily install the python in your system by going to https://www.python.org/downloads/, downloading and installing the software.
- Sometimes pandas are also dependent on another library like NumPy and many more you can install NumPyusing the following command in your command prompt or terminal.
pip3 install numpy
- Now you can install pandas in your system by using the following command in your command prompt or terminal.
pip3 install pandas
- Also, there is another software called Jupyter notebook that we need to use with the panda’s library on our Python project it makes the process of using pandas much more efficient.
Jupyter notebook: Jupyter notebook is a web-based interactive computing platform. it is open-source software that is run using the Python programming language. Jupyter notebook is commonly used by data scientists and other technical users who are looking to share their work, as it enables the ability to create and share documents that contain rich text, code, mathematics, plots, and other kinds of commentaries. You can install the Jupyter notebook using the following command in your terminal.
pip install jupyter notebook
For accessing the notebook you can use this command.
jupyter notebook
Setup
Importing the pandas and NumPy library
import pandas as pd
import numpy as p
Pandas Data Structure
Creating Series:
A series in pandas is just like any other series you might have come across in your day-to-day dealings. This is a one-dimensional labeled array that can hold any data type.
series = pd.Series([2, -8, 3, 9], index=['w', 'x', 'y', 'z'])
Creating Dataframes:
A data frame in pandas library is a two-dimensional labeled data structure where each row represents an observation.
df = pd.DataFrame(
{"words" : ['good' ,'better', 'best'],
"number" : [22, 33, 44],
"names" : ['one' ,'two', 'three']}, index = [1, 2, 3])
Read Data in Pandas:
To read the CSV file use the pd.read_csv function:
pd.read_csv('files.csv')
To see the first two lines you can use :
df.head(2)
To see the rows between certain index
df[3:5]
To see the last two lines you can use :
df.tail(2)
To see analysis of numerical columns:
df.describe()
To see only columns:
df.columns
To see specific columns :
df['Name']
In place of the name you can use any name of the columns that the CSV file contains
To see more than two columns at the same time:
df[['Name', 'HP']]
To see a specific number of names:
df['Name'][5:0]
You can add any number in place of 5 to get the specific number of names
To get the specific row:
df.iloc[2]
You can replace 2 with any number of rows.
To get multiple rows of data frames at the same time:
df.iloc[1:4]
Getting the specific value from the CSV file
df.iloc[5,1]
You can use the pandas iloc function to get specific values from a CSV file based on their row and column position.
To get all the info of data:
df.info()
Exporting the data frame in CSV files.
Export data into an excel file
To export the data into an excel file we need a simple function called to_excel.
df.to_excel('file.xlsx')
Export data into a text file
To export the data into a text file we need a simple function called to_csv.
df.to_csv('file.txt')
Export data into a CSV
To export the data into a CSV file we need a simple function called to_csv.
df.to_csv('file.csv')
Once you execute the Python script, the file file.csv will be located within the project’s root directory. If you want to not include the index numbers, just use:
df.to_csv('files.csv' , index=False)
Sorting Data
Sort by alphabetical name ascending:
df.sort_values(['Type 1'], ascending=False)
If you want to show data in descending form use true
Sort by alphabetical name with two values:
df.sort_values(['Type 1', 'HP'], ascending=[1,0])
Making changes to the data
Adding a new column to our data frame
df['total'] = 0
Adding a new column with the combined value of other data
df['Total'] = df['HP'] + df['Defense'] + df['Attack'] + df['Sp. Atk'] + df['Sp. Def'] + df['Speed']
You can select what column you want to add
To move the place of the column :
df = df[cols[0:4] + [cols[-1]]+cols[4:12]]
Add rand to perticuller value
df.rank()
Conditional Changes in data
For changing the specific value:
df.loc[df['Type 1'] == 'Rock', 'Type 1' ] = 'Fire'
Filtering Data
Rows with matching index values:
df.loc[df['Type 1'] =='Grass']
For getting multiple rows with macting index values:
df.loc[(df['Type 2'] == 'Ghost') | (df['Type 2'] == 'Dark')]
For getting column with lessthen or morethan provided number:
df.loc[df['HP'] > 150]
We can also save the new filtered data frame by creating a new variable and exporting it in that you can checkpoint your word and anytime you can access that data frame.
new_df = df.loc[df['HP'] > 150]
new_df.to_csv('new_file.csv')
For resetting index:
#for reseting the index
new_df.reset_index(drop=True, inplace=True)
For filtering the value with specific words:
df.loc[~df['Name'].str.contains('Mega')]
You can see in the output that every name has mega in them if you want to drop all the names with mega use an almost equal sign:
df.loc[~df['Name'].str.contains('Mega')]
Aggregate Statistics
Counting the number of records for each distinct value in a column:
df['count'] = 1
df.groupby('Type 2').count()['count']
Some other useful functions
#get number of rows and columns from a data frame
df.shape
#get all the columns of the data frame
df.columns
#get all index values of the data frame
df.index
#get min max index value
df.min()
df.max()
#get min max index value
df.idxmin()
df.idxmax()
#get sum of vlaues
df.sum()
#substract and add in all the values with a specific number
s.sub(2)
s.add(2)
#Multiply and Divide all the values with a specific number
s.mul(2)
s.div(2)
Here are some useful tutorials that you can read: