As the name suggests, Big Data is a concept that deals with the massive amounts of data stored and processed in such a way that it can produce meaningful insights. Big Data Analytics is a way to structure the computation and storage in such a way that it can process such a large amount of data. This can be accomplished by leveraging one or more tools but requires quite some expertise to make effective use of these tools for analytics purposes.  There are many libraries and packages available for Big Data using Python, but one library that is especially popular with data scientists who need a way to process enormous datasets within their application is Pyspark.

Pre-requisites:

Pyspark:

PySpark is an interface for Apache Spark in Python. Spark is a cluster computing framework built around speed, ease of use, and sophisticated analytics of big data using Python. Recently, Spark has become a major player in the field of big data. Companies, large and small, are turning to it to analyze their data. They’ve chosen Spark because, unlike their competitors, Spark just works. It’s fast, it’s easy to use, and it offers a wide range of capabilities.

To install Pyspark use the following command in your terminal or command prompt

Pip install pyspark

Notebook:

To run your Pyspark we will need to have a clear idea of what code should be written and what environment they should be run on. Here, I am using the Google Colab notebook – which in itself is free and encourages collaborative development; it also doesn’t require any downloads or installations either since it runs in your web browser! but If you already have a Jupyter notebook and you want to use it rather than Google Colab, that’s fine too, as the required steps for running Pyspark are the same for both!

Step -1: Create a Spark session

The Spark Session instance is the way Spark executes user-defined manipulations across the cluster. In Scala and Python, the Spark Session variable is available as a variable spark when you start up the console

Import dependencies

from pyspark.sql import SparkSession

create a spark session

# here you can give any appname
spark = SparkSession.builder.appName('Pyspark').getOrCreate()

print spark session

spark

Output

Load dataset

df = spark.read.csv('/content/movies.csv')

Now let’s perform different tasks using Pyspark

Step -2: Display Rows

Here we are using the show method to display the top 3 rows in the form of the dataset. If you change the number you can see a different number of rows. And we also have to add a header parameter in the df.

# load dataset
df = spark.read.csv('/content/movies.csv', header=True)

df.show(3)

Output:

Step -3: Display Column Names

Here we used the columns attribute to display the columns of the dataset

# load dataset
df = spark.read.csv('/content/movies.csv', header=True)

df.columns

Output:

Step -4: Display the Datatype of Each Column

Here we are using the printSchema method to show the datatypes of the columns and also added the inferschema parameter in the df.

# load dataset
df = spark.read.csv('/content/movies.csv', header=True, inferSchema=True)
df.printSchema()

Output:

Step -5: Count the Number of Rows and Columns of the dataset

To display the number of rows we are using the count method

df.count()

To display several columns we can use the length method on the columns list

len(df.columns)

Output:

The number of rows is 77

The number of columns is 8

Step -6: Get Overall Statistics About The Dataset

To display the overall statistic we are using describe method and 

df.describe() .show()

Output:

Step -7: Find the unique value in the specific column

For that, we are using the toPandas method to convert the data frame into the panda’s data frame then add column name then using a unique method

df.toPandas() ['Genre'].unique()

Output:

Step -8: Find the Total Number of Unique Values Available in the Gender column

We are just using the len method to get the total number

len(df.toPandas() ['Genre'].unique())

Output:

Step -9: How Select Single and Multiple Columns

We are just using the select method to show a single column

df.select('Film').show(5)

Output:

Show multiple columns 

df.select('Film', 'Year').show(5)

Output:

Step -10: Create and update New Column in Existing DataFrame

To create a new column in the data frame we are using the withCoumn method and for the score, we are taking another column and adding +1 in the value and giving it to our new column

df.withColumn('score',df.Year+1).show()

Output:

As you can see in the output that the new column that we created has year+1 in the value. Still, after that our new column will not show on the data frame. For that, we need to update the new column.

df = df.withColumn('score',df.Year+1)
df.show()

Output:

Step -11: Rename the Column

To rename the name of the column we are using the withColumnRenamed method

df = df.withColumnRenamed('Film', 'Movies')
df.show(5)

Output:

As you can see that we change the column name film to Movies

Step -12: Display specific values

Example:

We are going to display the name of the movies that have an audience score of greater than 83.

For that, we are using the filter method and we are also selecting specific columns using the select method

df.filter(df['Audience score %']>83).select('Movies').show()

Output:

Step -13: Multiple conditions at the same time

Example:

Here we have to display the name of the movies in the animation category that has an audience score of greater than 83. For that, we have to pass multiple conditions with the filter method and separate them using & the operator.

df.filter((df['Audience score %']>83) & (df['Genre']=='Animation')).select('Movies').show()

Output:

As you can see in the output that only two movies that are in the animation category have an audience score of greater than 83.

Example:

This example is just the opposite of the previous example. We are displaying the name of the movies that have an audience score of less than 83. For that, we just have to use the ~ operator and it will show what does not have that condition.

df.filter((df['Audience score %']>83) & ~(df['Genre']=='Animation')).select('Movies').show(5)

Output:

These movies have an audience score of less than 83.

Step -14: Display average values

To display the average values we are first making a group of genre columns using the groupby method and then we are using the mean method to give us average values

df.groupby('Genre').mean().show()

Output:

Step -15: Sort the rows in ascending and descending Order

To sort them into ascending order we are using the orderby method

df.orderBy(df['Audience score %']).show(10)

Output:

To sort them into Descending order we are using the orderby method and with that desc method

df.orderBy(df['Audience score %'].desc()).show(5)

Output:

Final Words

We hope you enjoyed this blog about using big data using Python and PySpark. This blog is just an overview of the Pyspark library. If you’d like to learn more about data analytics or want to explore some of the other major topics we also cover them. feel free to check out our blog. We’re confident you can find exactly what you’re looking for!

Here are some useful tutorials that you can read: