Leveraging Pandas and SQL Collectively for Environment friendly Information Evaluation -

Picture by Writer | Canva

Pandas and SQL are each efficient for knowledge evaluation, however what if we may merge their energy? With pandasql, you’ll be able to write SQL queries immediately inside a Jupyter pocket book. This integration seamlessly allows us to mix SQL logic with Python for efficient knowledge evaluation.

On this article, we are going to use each pandas and SQL collectively on an information mission from Uber. Let’s get began!

# What Is pandasql?

Pandasql may be built-in with any DataFrame by way of an in-memory SQLite engine, so you’ll be able to write pure SQL inside a Python setting.

# Benefits of Utilizing Pandas and SQL Collectively

SQL is helpful for simply filtering rows, aggregating knowledge, or making use of multi-condition logic.
Python, then again, gives superior instruments for statistical evaluation and customized computations, in addition to set-based operations, which prolong past SQL’s capabilities.
When used collectively, SQL simplifies knowledge choice, whereas Python provides analytical flexibility.

# The right way to Run pandasql Inside a Jupyter Pocket book?

To run pandasql inside a Jupyter Pocket book, begin with the next code.

import pandas as pd
from pandasql import sqldf
run = lambda q: sqldf(q, globals())

Subsequent, you’ll be able to run your SQL code like this:

run("""
SELECT *
FROM df
LIMIT 10;
""")

We’ll use the SQL code with out exhibiting the run perform every time on this article.

Let’s see how utilizing SQL and Pandas collectively works in a real-life mission from Uber.

# Actual-World Undertaking: Analyzing Uber Driver Efficiency Information

Real-World Project: Analyzing Uber Driver Performance Data

Picture by Writer

On this knowledge mission, Uber asks us to research driver efficiency knowledge and consider bonus methods.

// Information Exploration and Analytics

Now, let’s discover the datasets. First, we are going to load the information.

// Preliminary Dataset Loading

Let’s load the dataset by utilizing simply pandas.

import pandas as pd
import numpy as np
df = pd.read_csv('dataset_2.csv')

// Exploring the Information

Now let’s evaluation the dataset.

The output seems to be like this:

Now we have now a glimpse of the information.
As you’ll be able to see, the dataset consists of every driver’s identify, the variety of journeys they accomplished, their acceptance fee (i.e., the share of journey requests accepted), whole provide hours (the overall hours spent on-line), and their common ranking.
Let’s confirm the column names earlier than beginning the information evaluation so we are able to use them accurately.

Right here is the output.

As you’ll be able to see, our dataset has 5 completely different columns, and there aren’t any lacking values.
Let’s now reply the questions utilizing each SQL and Python.

# Query 1: Who Qualifies for Bonus Possibility 1?

Within the first query, we’re requested to find out the overall bonus payout for Possibility 1, which is:

$50 for every driver that’s on-line at the least 8 hours, accepts 90% of requests, completes 10 journeys, and has a ranking of 4.7 or higher throughout the timeframe.

// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)

On this step, we are going to begin utilizing pandasql.

Within the following code, we have now chosen all drivers who meet the circumstances for the Possibility 1 bonus utilizing the WHERE clause and the AND operator for linking a number of circumstances. To discover ways to use WHERE and AND, discuss with this documentation.

opt1_eligible = run("""
    SELECT Title                -- hold solely a reputation column for readability
    FROM   df
    WHERE  `Provide Hours`    >=  8
      AND  `Journeys Accomplished` >= 10
      AND  `Settle for Charge`     >= 90
      AND  Ranking            >= 4.7;
""")
opt1_eligible

Right here is the output.

// Step 2: Ending in Pandas

After filtering the dataset utilizing SQL with pandasql, we swap to Pandas to carry out numerical calculations and finalize the evaluation. This hybrid method, which mixes SQL and Python, enhances each readability and suppleness.

Subsequent, utilizing the next Python code, we calculate the overall payout by multiplying the variety of certified drivers (utilizing len()) by the $50 bonus per driver. Take a look at the documentation to see how you should utilize the len() perform.

payout_opt1 = 50 * len(opt1_eligible)
print(f"Possibility 1 payout: ${payout_opt1:,}")

Right here is the output.

# Query 2: Calculating the Whole Payout for Bonus Possibility 2

Within the second query, we’re requested to search out the overall bonus payout utilizing Possibility 2:

$4/journey for all drivers who full 12 journeys, and have a 4.7 or higher ranking.

// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)

First, we use SQL to filter for drivers who meet the Possibility 2 standards: finishing at the least 12 journeys and sustaining a ranking of 4.7 or increased.

# Seize solely the rows that fulfill the Possibility-2 thresholds
opt2_drivers = run("""
    SELECT Title,
           `Journeys Accomplished`
    FROM   df
    WHERE  `Journeys Accomplished` >= 12
      AND  Ranking            >= 4.7;
""")
opt2_drivers.head()

Right here’s what we get.

// Step 2: Ending the Calculation in Pure Pandas

Now let’s carry out the calculation utilizing Pandas. The code computes the overall bonus by summing the Journeys Accomplished column with sum() after which multiplying the outcome by the $4 bonus per journey.

total_trips   = opt2_drivers["Trips Completed"].sum()
option2_bonus = 4 * total_trips
print(f"Whole journeys: {total_trips},  Possibility-2 payout: ${option2_bonus}")

Right here is the outcome.

# Query 3: Figuring out Drivers Who Qualify for Possibility 1 However Not Possibility 2

Within the third query, we’re requested to depend the variety of drivers who qualify for Possibility 1 however not for Possibility 2.

// Step 1: Constructing Two Eligibility Tables with SQL (pandasql)

Within the following SQL code, we create two datasets: one for drivers who meet the Possibility 1 standards and one other for many who meet the Possibility 2 standards.

# All Possibility-1 drivers
opt1_drivers = run("""
    SELECT Title
    FROM   df
    WHERE  `Provide Hours`    >=  8
      AND  `Journeys Accomplished` >= 10
      AND  `Settle for Charge`     >= 90
      AND  Ranking            >= 4.7;
""")

# All Possibility-2 drivers
opt2_drivers = run("""
    SELECT Title
    FROM   df
    WHERE  `Journeys Accomplished` >= 12
      AND  Ranking            >= 4.7;
""")

// Step 2: Utilizing Python Set Logic to Spot the Distinction

Subsequent, we are going to use Python to determine the drivers who seem in Possibility 1 however not in Possibility 2, and we are going to use set operations for that.

Right here is the code:

only_opt1 = set(opt1_drivers["Name"]) - set(opt2_drivers["Name"])
count_only_opt1 = len(only_opt1)

print(f"Drivers qualifying for Possibility 1 however not Possibility 2: {count_only_opt1}")

Right here is the output.

By combining these strategies, we leverage SQL for filtering and Python’s set logic for evaluating the ensuing datasets.

# Query 4: Discovering Low-Efficiency Drivers with Excessive Scores

In query 4, we’re requested to find out the share of drivers who accomplished fewer than 10 journeys, had an acceptance fee under 90%, and nonetheless maintained a ranking of 4.7 or increased.

// Step 1: Pulling the Subset with SQL (pandasql)

Within the following code, we choose all drivers who’ve accomplished fewer than 10 journeys, have an acceptance fee of lower than 90%, and maintain a ranking of at the least 4.7.

low_kpi_df = run("""
    SELECT *
    FROM   df
    WHERE  `Journeys Accomplished` = 4.7;
""")
low_kpi_df

Right here is the output.

// Step 2: Calculating the Share in Plain Pandas

On this step, we are going to use Python to calculate the share of such drivers.

We merely divide the variety of filtered drivers by the overall driver depend, then multiply by 100 to get the share.

Right here is the code:

num_low_kpi   = len(low_kpi_df)
total_drivers = len(df)
proportion    = spherical(100 * num_low_kpi / total_drivers, 2)

print(f"{num_low_kpi} out of {total_drivers} drivers ⇒ {proportion}%")

Right here is the output.

# Query 5: Calculating Annual Revenue With out Partnering With Uber

Within the fifth query, we have to calculate the annual earnings of a taxi driver with out partnering with Uber, primarily based on the given price and income parameters.

// Step 1: Pulling Yearly Income and Bills with SQL (pandasql)

Through the use of SQL, we first calculate yearly income from each day fares and subtract bills for gasoline, hire, and insurance coverage.

taxi_stats = run("""
SELECT
    200*6*(52-3)                      AS annual_revenue,
    ((200+500)*(52-3) + 400*12)       AS annual_expenses
""")
taxi_stats

Right here is the output.

// Step 2: Deriving Revenue and Margin with Pandas

Within the subsequent step, we are going to use Python to compute the revenue and margin the drivers get when not partnering with Uber.

rev  = taxi_stats.loc[0, "annual_revenue"]
price = taxi_stats.loc[0, "annual_expenses"]

revenue  = rev - price
margin  = spherical(100 * revenue / rev, 2)

print(f"Income  : ${rev:,}")
print(f"Bills : ${price:,}")
print(f"Revenue   : ${revenue:,}    (margin: {margin}%)")

Right here’s what we get.

# Query 6: Calculating the Required Fare Improve to Preserve Profitability

Within the sixth query, we assume that the identical driver decides to purchase a City Automobile and associate with Uber.

The gasoline bills enhance by 5%, insurance coverage decreases by 20%, and rental prices are eradicated, however the driver must cowl the $40,000 price of the automobile. We’re requested to calculate how a lot this driver’s weekly gross fares should enhance within the first yr to each repay the automobile and preserve the identical annual revenue margin.

// Step 1: Constructing the New One-12 months Expense Stack with SQL

On this step, we are going to use SQL to calculate the brand new one-year bills with adjusted gasoline and insurance coverage and no rental charges, plus the automobile price.

new_exp = run("""
SELECT
    40000             AS automobile,
    200*1.05*(52-3)   AS gasoline,        -- +5 %
    400*0.80*12       AS insurance coverage   -- –20 %
""")
new_cost = new_exp.sum(axis=1).iloc[0]
new_cost

Right here is the output.

// Step 2: Calculating the Weekly Fare Improve with Pandas

Subsequent, we use Python to calculate how way more the driving force should earn per week to protect that margin after shopping for the automobile.

# Current values from Query 5
old_rev    = 58800
old_profit = 19700
old_margin = old_profit / old_rev
weeks      = 49

# new_cost was calculated within the earlier step (54130.0)

# We have to discover the brand new income (new_rev) such that the revenue margin stays the identical:
# (new_rev - new_cost) / new_rev = old_margin
# Fixing for new_rev provides: new_rev = new_cost / (1 - old_margin)
new_rev_required = new_cost / (1 - old_margin)

# The full enhance in annual income wanted is the distinction
total_increase = new_rev_required - old_rev

# Divide by the variety of working weeks to get the required weekly enhance
weekly_bump = spherical(total_increase / weeks, 2)

print(f"Required weekly gross-fare enhance = ${weekly_bump}")

Right here’s what we get.

# Conclusion

Bringing collectively the strengths of SQL and Python, primarily by way of pandasql, we solved six completely different issues.

SQL helps in fast filtering and summarizing structured datasets, whereas Python is sweet at superior computation and dynamic manipulation.

All through this evaluation, we leveraged each instruments to simplify the workflow and make every step extra interpretable.

Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor instructing analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from high firms. Nate writes on the most recent tendencies within the profession market, provides interview recommendation, shares knowledge science tasks, and covers the whole lot SQL.

Leveraging Pandas and SQL Collectively for Environment friendly Information Evaluation

# What Is pandasql?

# Benefits of Utilizing Pandas and SQL Collectively

# The right way to Run pandasql Inside a Jupyter Pocket book?

# Actual-World Undertaking: Analyzing Uber Driver Efficiency Information

// Information Exploration and Analytics

// Preliminary Dataset Loading

// Exploring the Information

# Query 1: Who Qualifies for Bonus Possibility 1?

// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)

// Step 2: Ending in Pandas

# Query 2: Calculating the Whole Payout for Bonus Possibility 2

// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)

// Step 2: Ending the Calculation in Pure Pandas

# Query 3: Figuring out Drivers Who Qualify for Possibility 1 However Not Possibility 2

// Step 1: Constructing Two Eligibility Tables with SQL (pandasql)

// Step 2: Utilizing Python Set Logic to Spot the Distinction

# Query 4: Discovering Low-Efficiency Drivers with Excessive Scores

// Step 1: Pulling the Subset with SQL (pandasql)

// Step 2: Calculating the Share in Plain Pandas

# Query 5: Calculating Annual Revenue With out Partnering With Uber

// Step 1: Pulling Yearly Income and Bills with SQL (pandasql)

// Step 2: Deriving Revenue and Margin with Pandas

# Query 6: Calculating the Required Fare Improve to Preserve Profitability

// Step 1: Constructing the New One-12 months Expense Stack with SQL

// Step 2: Calculating the Weekly Fare Improve with Pandas

# Conclusion

Debugging Python in Docker: A Tutorial for Learners