
Picture by Writer | Ideogram
You understand that feeling when you could have knowledge scattered throughout totally different codecs and sources, and it’s worthwhile to make sense of all of it? That is precisely what we’re fixing at present. Let’s construct an ETL pipeline that takes messy knowledge and turns it into one thing really helpful.
On this article, I am going to stroll you thru making a pipeline that processes e-commerce transactions. Nothing fancy, simply sensible code that will get the job completed.
We’ll seize knowledge from a CSV file (such as you’d obtain from an e-commerce platform), clear it up, and retailer it in a correct database for evaluation.
🔗 Hyperlink to the code on GitHub
What Is an Extract, Rework, Load (ETL) Pipeline?
Each ETL pipeline follows the identical sample. You seize knowledge from someplace (Extract), clear it up and make it higher (Rework), then put it someplace helpful (Load).

ETL Pipeline | Picture by Writer | diagrams.web (draw.io)
The method begins with the extract section, the place knowledge is retrieved from numerous supply methods similar to databases, APIs, information, or streaming platforms. Throughout this section, the pipeline identifies and pulls related knowledge whereas sustaining connections to disparate methods that will function on totally different schedules and codecs.
Subsequent the rework section represents the core processing stage, the place extracted knowledge undergoes cleansing, validation, and restructuring. This step addresses knowledge high quality points, applies enterprise guidelines, performs calculations, and converts knowledge into the required format and construction. Widespread transformations embody knowledge sort conversions, subject mapping, aggregations, and the elimination of duplicates or invalid information.
Lastly, the load section transfers the now remodeled knowledge into the goal system. This step can happen via full hundreds, the place whole datasets are changed, or incremental hundreds, the place solely new or modified knowledge is added. The loading technique is determined by elements similar to knowledge quantity, system efficiency necessities, and enterprise wants.
Step 1: Extract
The “extract” step is the place we get our palms on knowledge. In the actual world, you is likely to be downloading this CSV out of your e-commerce platform’s reporting dashboard, pulling it from an FTP server, or getting it by way of API. Right here, we’re studying from an out there CSV file.
def extract_data_from_csv(csv_file_path):
attempt:
print(f"Extracting knowledge from {csv_file_path}...")
df = pd.read_csv(csv_file_path)
print(f"Efficiently extracted {len(df)} information")
return df
besides FileNotFoundError:
print(f"Error: {csv_file_path} not discovered. Creating pattern knowledge...")
csv_file = create_sample_csv_data()
return pd.read_csv(csv_file)
Now that we have now the uncooked knowledge from its supply (raw_transactions.csv), we have to rework it into one thing usable.
Step 2: Rework
That is the place we make the information really helpful.
def transform_data(df):
print("Remodeling knowledge...")
df_clean = df.copy()
# Take away information with lacking emails
initial_count = len(df_clean)
df_clean = df_clean.dropna(subset=['customer_email'])
removed_count = initial_count - len(df_clean)
print(f"Eliminated {removed_count} information with lacking emails")
# Calculate derived fields
df_clean['total_amount'] = df_clean['price'] * df_clean['quantity']
# Extract date parts
df_clean['transaction_date'] = pd.to_datetime(df_clean['transaction_date'])
df_clean['year'] = df_clean['transaction_date'].dt.12 months
df_clean['month'] = df_clean['transaction_date'].dt.month
df_clean['day_of_week'] = df_clean['transaction_date'].dt.day_name()
# Create buyer segments
df_clean['customer_segment'] = pd.minimize(df_clean['total_amount'],
bins=[0, 50, 200, float('inf')],
labels=['Low', 'Medium', 'High'])
return df_clean
First, we’re dropping rows with lacking emails as a result of incomplete buyer knowledge is not useful for many analyses.
Then we calculate total_amount
by multiplying value and amount. This appears apparent, however you would be stunned how usually derived fields like this are lacking from uncooked knowledge.
The date extraction is de facto helpful. As a substitute of simply having a timestamp, now we have now separate 12 months, month, and day-of-week columns. This makes it simple to research patterns like “can we promote extra on weekends?”
The shopper segmentation utilizing pd.minimize()
could be significantly helpful. It routinely buckets prospects into spending classes. Now as a substitute of simply having transaction quantities, we have now significant enterprise segments.
Step 3: Load
In an actual challenge, you is likely to be loading right into a database, sending to an API, or pushing to cloud storage.
Right here, we’re loading our clear knowledge into a correct SQLite database.
def load_data_to_sqlite(df, db_name="ecommerce_data.db", table_name="transactions"):
print(f"Loading knowledge to SQLite database '{db_name}'...")
conn = sqlite3.join(db_name)
attempt:
df.to_sql(table_name, conn, if_exists="substitute", index=False)
cursor = conn.cursor()
cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
record_count = cursor.fetchone()[0]
print(f"Efficiently loaded {record_count} information to '{table_name}' desk")
return f"Information efficiently loaded to {db_name}"
lastly:
conn.shut()
Now analysts can run SQL queries, join BI instruments, and really use this knowledge for decision-making.
SQLite works nicely for this as a result of it is light-weight, requires no setup, and creates a single file you may simply share or backup. The if_exists="substitute"
parameter means you may run this pipeline a number of occasions with out worrying about duplicate knowledge.
We have added verification steps so you recognize the load was profitable. There’s nothing worse than pondering your knowledge is safely saved solely to search out an empty desk later.
Working the ETL Pipeline
This orchestrates all the extract, rework, load workflow.
def run_etl_pipeline():
print("Beginning ETL Pipeline...")
# Extract
raw_data = extract_data_from_csv('raw_transactions.csv')
# Rework
transformed_data = transform_data(raw_data)
# Load
load_result = load_data_to_sqlite(transformed_data)
print("ETL Pipeline accomplished efficiently!")
return transformed_data
Discover how this ties all the things collectively. Extract, rework, load, completed. You’ll be able to run this and instantly see your processed knowledge.
You will discover the entire code on GitHub.
Wrapping Up
This pipeline takes uncooked transaction knowledge and turns it into one thing an analyst or knowledge scientist can really work with. You have bought clear information, calculated fields, and significant segments.
Every perform does one factor nicely, and you’ll simply modify or prolong any half with out breaking the remainder.
Now attempt operating it your self. Additionally attempt to modify it for an additional use case. Blissful coding!
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, knowledge science, and content material creation. Her areas of curiosity and experience embody DevOps, knowledge science, and pure language processing. She enjoys studying, writing, coding, and low! At the moment, she’s engaged on studying and sharing her information with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates participating useful resource overviews and coding tutorials.