As the name suggests, Big Data is a concept that deals with the massive amounts of data stored and processed in such a way that it can produce meaningful insights. Big Data Analytics is a way to structure the computation and storage in such a way that it can process such a large amount of data. This can be accomplished by leveraging one or more tools but requires quite some expertise to make effective use of these tools for analytics purposes. There are many libraries and packages available for Big Data using Python, but one library that is especially popular with data scientists who need a way to process enormous datasets within their application is Pyspark.
Pre-requisites:
Pyspark:
PySpark is an interface for Apache Spark in Python. Spark is a cluster computing framework built around speed, ease of use, and sophisticated analytics of big data using Python. Recently, Spark has become a major player in the field of big data. Companies, large and small, are turning to it to analyze their data. They’ve chosen Spark because, unlike their competitors, Spark just works. It’s fast, it’s easy to use, and it offers a wide range of capabilities.
To install Pyspark use the following command in your terminal or command prompt
Pip install pyspark
Notebook:
To run your Pyspark we will need to have a clear idea of what code should be written and what environment they should be run on. Here, I am using the Google Colab notebook – which in itself is free and encourages collaborative development; it also doesn’t require any downloads or installations either since it runs in your web browser! but If you already have a Jupyter notebook and you want to use it rather than Google Colab, that’s fine too, as the required steps for running Pyspark are the same for both!
Step -1: Create a Spark session
The Spark Session instance is the way Spark executes user-defined manipulations across the cluster. In Scala and Python, the Spark Session variable is available as a variable spark when you start up the console
Import dependencies
from pyspark.sql import SparkSession
create a spark session
# here you can give any appname
spark = SparkSession.builder.appName('Pyspark').getOrCreate()
print spark session
spark
Output
Load dataset
df = spark.read.csv('/content/movies.csv')
Now let’s perform different tasks using Pyspark
Step -2: Display Rows
Here we are using the show method to display the top 3 rows in the form of the dataset. If you change the number you can see a different number of rows. And we also have to add a header parameter in the df.
# load dataset
df = spark.read.csv('/content/movies.csv', header=True)
df.show(3)
Output:
Step -3: Display Column Names
Here we used the columns attribute to display the columns of the dataset
# load dataset
df = spark.read.csv('/content/movies.csv', header=True)
df.columns
Output:
Step -4: Display the Datatype of Each Column
Here we are using the printSchema
method to show the datatypes of the columns and also added the inferschema
parameter in the df.
# load dataset
df = spark.read.csv('/content/movies.csv', header=True, inferSchema=True)
df.printSchema()
Output:
Step -5: Count the Number of Rows and Columns of the dataset
To display the number of rows we are using the count method
df.count()
To display several columns we can use the length method on the columns list
len(df.columns)
Output:
The number of rows is 77
The number of columns is 8
Step -6: Get Overall Statistics About The Dataset
To display the overall statistic we are using describe method and
df.describe() .show()
Output:
Step -7: Find the unique value in the specific column
For that, we are using the toPandas
method to convert the data frame into the panda’s data frame then add column name then using a unique method
df.toPandas() ['Genre'].unique()
Output:
Step -8: Find the Total Number of Unique Values Available in the Gender column
We are just using the len
method to get the total number
len(df.toPandas() ['Genre'].unique())
Output:
Step -9: How Select Single and Multiple Columns
We are just using the select method to show a single column
df.select('Film').show(5)
Output:
Show multiple columns
df.select('Film', 'Year').show(5)
Output:
Step -10: Create and update New Column in Existing DataFrame
To create a new column in the data frame we are using the withCoumn method and for the score, we are taking another column and adding +1 in the value and giving it to our new column
df.withColumn('score',df.Year+1).show()
Output:
As you can see in the output that the new column that we created has year+1 in the value. Still, after that our new column will not show on the data frame. For that, we need to update the new column.
df = df.withColumn('score',df.Year+1)
df.show()
Output:
Step -11: Rename the Column
To rename the name of the column we are using the withColumnRenamed method
df = df.withColumnRenamed('Film', 'Movies')
df.show(5)
Output:
As you can see that we change the column name film to Movies
Step -12: Display specific values
Example:
We are going to display the name of the movies that have an audience score of greater than 83.
For that, we are using the filter method and we are also selecting specific columns using the select method
df.filter(df['Audience score %']>83).select('Movies').show()
Output:
Step -13: Multiple conditions at the same time
Example:
Here we have to display the name of the movies in the animation category that has an audience score of greater than 83. For that, we have to pass multiple conditions with the filter method and separate them using & the operator.
df.filter((df['Audience score %']>83) & (df['Genre']=='Animation')).select('Movies').show()
Output:
As you can see in the output that only two movies that are in the animation category have an audience score of greater than 83.
Example:
This example is just the opposite of the previous example. We are displaying the name of the movies that have an audience score of less than 83. For that, we just have to use the ~ operator and it will show what does not have that condition.
df.filter((df['Audience score %']>83) & ~(df['Genre']=='Animation')).select('Movies').show(5)
Output:
These movies have an audience score of less than 83.
Step -14: Display average values
To display the average values we are first making a group of genre columns using the groupby method and then we are using the mean method to give us average values
df.groupby('Genre').mean().show()
Output:
Step -15: Sort the rows in ascending and descending Order
To sort them into ascending order we are using the orderby method
df.orderBy(df['Audience score %']).show(10)
Output:
To sort them into Descending order we are using the orderby method and with that desc method
df.orderBy(df['Audience score %'].desc()).show(5)
Output:
Final Words
We hope you enjoyed this blog about using big data using Python and PySpark. This blog is just an overview of the Pyspark library. If you’d like to learn more about data analytics or want to explore some of the other major topics we also cover them. feel free to check out our blog. We’re confident you can find exactly what you’re looking for!
Here are some useful tutorials that you can read:
- Concurrency in Python
- Basic Neural Network in Python to Make Predictions
- Monitor Python scripts using Prometheus
- Test Your Typing Speed Using Python
- Instagram Hashtag Generator in Python
- How to create a Word Guessing Game in Python
- Convert an image to 8-bit image
- Programmatically Generate Video or Animated GIF in Python
- Sudoku game in Python using Pygame
- How to Deploy Flask API on Heroku?
- How to Update your Mac Address using Python
- How to create CLI in Python?
- Automate Reddit Posts using Python