DataFrame Operations in Pyspark

Pyspark is nothing but Python API of Spark. It is one of the most tool used for Big data analytics.

In this tutorial we usually learn about basic about Pyspark DataFrame. DataFrame is similar like relational database table structured which contains rows and columns. In Pyspark DataFrame is a distributed collection of rows under columns.

Read Data

We can read different types of file in Pyspark like csv, excel, parquet etc. For read csv file we use spark.load.csv.

Datasets credit: AnalyticsVidhya Loan Prediction source

df=spark.read.csv('train.csv',inferSchema=True,header=True)

See Result or Data
df.show() 
Note: it show 20 records from beginning but if we need more or less result we can pass dersired number inside show method.
df.show(10) 
Dimension of Data
Number column: len(df.columns)
Total number of records: df.count()

print(len(df.columns),df.count())
Output
13,614
Print Schema
df.printSchema()

root
|-- Loan_ID: string (nullable = true)
|-- Gender: string (nullable = true)
|-- Married: string (nullable = true)
|-- Dependents: string (nullable = true)
|-- Education: string (nullable = true)
|-- Self_Employed: string (nullable = true)
|-- ApplicantIncome: integer (nullable = true)
|-- CoapplicantIncome: double (nullable = true)
|-- LoanAmount: integer (nullable = true)
|-- Loan_Amount_Term: integer (nullable = true)
|-- Credit_History: integer (nullable = true)
|-- Property_Area: string (nullable = true)
|-- Loan_Status: string (nullable = true)
Basic summary about data

describe function calculate the basic statistics basically descriptive statistics like count, mean, standard deviation, minimum and maximum.

df.describe('ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term').show()
Calculate Missing values

Missing values calculation and imputing plays very important role while data analysis. We can calculate column wise missing value in Pyspark.

for col in df.columns:
missing_Value=df.where(df[col].isNull()).count()
print(col," ",missing_Value)

Output


Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
Select columns from Data Frame
Select distinct values from categorical features
df.select('Gender').distinct().show()

output
+------+
|Gender|
+------+
|Female|
| Male|
+------+
InĀ [75]:

One more example
df.select('Gender','Property_Area','Loan_Status','Self_Employed','Dependents').distinct().show()
Select Multiple columns:
df.select('Loan_ID','ApplicantIncome','CoapplicantIncome','LoanAmount','Credit_History','Loan_Status').show()
Sorting the DataFrame

We can sort data by two methods either in Ascending order or Descending order.

Ascending Order
df.sort('ApplicantIncome').show()
Descending Order
df.sort(F.desc('ApplicantIncome')).show()
Data Filter
Data filter on single column with condition
df.filter(df.Loan_Status=='Y').show()
Multiple column filter with condition
df.filter((df.Loan_Status=='Y')&(df.LoanAmount<500)).show()
df.filter((df.Loan_Status=='Y')&(df.LoanAmount<500)&(df.ApplicantIncome>2000)).show()
Group By
df.groupBy('Loan_Status').count().show()

df.groupBy(['Credit_History',"Loan_Status"]).agg(F.sum('ApplicantIncome')).show()
Column Rename
df=df.withColumnRenamed('LoanAmount',"Loan_Amount")
df.printSchema()
Add new Columns
df = df.withColumn('Total_Income', df['ApplicantIncome'] + df['CoapplicantIncome'])

df.printSchema()
About Mitra N Mishra 34 Articles
Mitra N Mishra is working as a full-stack data scientist.

2 Comments

Leave a Reply

Your email address will not be published.


*