ETL Pipeline: Fake Internet Sales Data Integration

This project generates synthetic daily Internet Sales data using Python and loads it into a SQL Server table (FactInternetSales2) via an SSIS package. The goal is to simulate realistic ETL workflows for testing, transformation, and automation purposes

Create a copy table for SQL Server AdventureWorksDW2022.FactInternetSales table.


image


Create a fake daily Internet Sales report file

Prepare a review of the field to be created

I used the Excel Data Model connected to Adventureworks FactInternetSales table to prepare the parameter to use Python Faker to create some random InternetSales data.


image


We are no trying to recreate an exact structure the SQL table because we want to do some transforations in the SSIS Data Flow. The last five fields will be added in the transformation step.


Python

The Python script connects to the AdventureWorksDW2022 database, retrieves metadata from FactInternetSales2 and DimProduct, and generates daily CSV files with randomized sales data using pandas, random, and Faker.

Import Liberies

    import pandas as pd                                 # to create the Data Frame
    import random                                       # to create the Random Data
    from datetime import datetime, timedelta            # to manade Date Data
    import pyodbc                                       # to connect and query SQL Server DB Data
    from decimal import Decimal                         # to define decimal numbers for tax rate
    import os                                           # to export the final resulting files to folder

Connect to AdventureWorksDW2022

    conn = pyodbc.connect('DRIVER={SQL Server};SERVER="Server_Name";DATABASE=AdventureWorksDW2022;Trusted_Connection=yes;')
    cursor = conn.cursor()

Get the last OrderDate From FactInternetSales2

    cursor.execute("SELECT MAX(OrderDate) FROM FactInternetSales2")
    last_order_date = cursor.fetchone()[0]
    start_date = last_order_date + timedelta(days=1)

# Get the highest SalesOrderNumber with prefix ‘SO’

    cursor.execute("""
    SELECT MAX(CAST(SUBSTRING(SalesOrderNumber, 3, LEN(SalesOrderNumber)) AS INT))
    FROM FactInternetSales2
    WHERE SalesOrderNumber LIKE 'SO%'
    """)
    last_order_number = cursor.fetchone()[0]
    order_counter = (last_order_number or 0) + 1

Get ListPrice and StandardCost from DimProduct

    cursor.execute("""
    SELECT ProductKey, ListPrice, StandardCost
    FROM DimProduct
    WHERE ProductKey IS NOT NULL
    """)

Store price data in a dictionary for quick lookup

    product_prices = {}
    for row in cursor.fetchall():
    product_key = row.ProductKey
    list_price = row.ListPrice if row.ListPrice is not None else 0
    standard_cost = row.StandardCost if row.StandardCost is not None else 0
    product_prices[product_key] = {
    'UnitPrice': list_price,
    'ProductStandardCost': standard_cost
    }

    conn.close()

Define date range for order generation

    current_date = start_date
    end_date = datetime(2014, 2, 3)

This script will read the last order date in FactInternetSales2 and set it up as start_date

end_date will be provided manually to determin how many daily files will this scritp generate

Define lookup values

    product_options = [214, 217, 222, 225, 228, 231, 234, 237, 310, 311, 312, 313, 314, 320, 321, 322, 323, 324, 325,             326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348,            349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 368, 369, 370, 371, 372, 373, 374, 375,            376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 463, 465, 467, 471, 472, 473, 474, 475,            476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 528, 529, 530, 535, 536, 537, 538,            539, 540, 541, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579,            580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 604, 605,            606]
    promotion_options = [1, 2, 13, 14]
    currency_options = [6, 19, 29, 39, 98, 100]

Initialize list to store order records

    orders = []

Loop through each day in the date range

    while current_date <= end_date:

Generate a random number of orders for the current day

    num_orders = random.randint(1, 8)

    for _ in range(num_orders):
    product_key = random.choice(product_options)
    price_info = product_prices.get(product_key, {'UnitPrice': 0, 'ProductStandardCost': 0})

    order_id = f"SO{order_counter}"

    tax_rate = Decimal('0.08')

    order_date = current_date.date()
    order_date_key = int(current_date.strftime('%Y%m%d'))  # Formato YYYYMMDD como entero

Generate DueDate: OrderDate + 4 to 30 days

    due_date = order_date + timedelta(days=random.randint(4, 30))
    due_date_key = int(due_date.strftime('%Y%m%d'))

Generate ShipDate: DueDate + 2 to 15 days

    ship_date = due_date + timedelta(days=random.randint(2, 15))
    ship_date_key = int(ship_date.strftime('%Y%m%d'))

Create Fields Dictionary

    orders.append({
        'ProductKey': product_key,                                # Product identifier
        'OrderDateKey': order_date_key,                           # Order date as integer YYYYMMDD
        'DueDateKey': due_date_key,                               # Due date as integer YYYYMMDD
        'ShipDateKey': ship_date_key,                             # Ship date as integer YYYYMMDD
        'CustomerKey': random.randint(11000, 29483),              # Customer identifier
        'PromotionKey': random.choice(promotion_options),         # Promotion applied
        'CurrencyKey': random.choice(currency_options),           # Currency used
        'SalesTerritory': random.randint(1, 10),                  # Sales territory ID
        'SalesOrderNumber': order_id,                             # Unique order ID
        'SalesOrderLine': random.randint(1, 8),                   # Line number within the order
        'RevisoryNumber': None,                                   # Revisory number null
        'SalesOrderQuantity': 1,                                  # Fixed quantity per order
        'UnitPrice': price_info['UnitPrice'],                     # ListPrice from DimProduct
        'ExtendedAmount': price_info['UnitPrice'],                # Extended amount = UnitPrice
        'UnitPriceDiscount': 0,                                   # Unit price discount = 0
        'ProductStandardCost': price_info['ProductStandardCost'], # StandardCost from DimProduct
        'TotalProductCost': price_info['ProductStandardCost'],    # Total product cost = ProductStandardCost
        'SalesAmount': price_info['UnitPrice'],                   # Sales amount = UnitPrice
        'TaxAmount': price_info['UnitPrice'] * Decimal('0.08'),   # Tax amount = 8% of SalesAmount
        'Freight': None,                                          # Freight requires another DB for practical purposes won't implement it here
        'OrderDate': current_date.date(),                         # Date of the order
        'DueDate': due_date,                                     # Due date
        'ShipDate': ship_date,                                   # Ship date
    })
        
    order_counter += 1  # Increment global order ID

    current_date += timedelta(days=1)  # Move to the next day

Convert list of orders to DataFrame

    df = pd.DataFrame(orders)

Group by OrderDate

    grouped = df.groupby('OrderDate')

Display the resulting DataFrame

    print(df)

Define output folder and ensure it exists

    output_folder = r'C:\Users\aldoa\Documents\Proyectos\Python Faker\FakeInternetSalesOrders\\'
    os.makedirs(output_folder, exist_ok=True)

Iterate for each group (day)

    for order_date, group in grouped:
    # Convert the date to YYYYMMDD format
    date_str = order_date.strftime('%Y%m%d')

    # Define the full path for the file
    filename = f"{output_folder}FakeOrders_{date_str}.csv"

Save the group as CSV in the correct folder

    group.to_csv(filename, index=False, encoding='utf-8')

Resulst

image



image




SQL Server Integration Services (SSIS)

Built in Visual Studio 2022, this SSIS package ingests the generated CSV files, transforms the data, and loads it into FactInternetSales2.

Extract

1. Foreach Loop Container

  • Iterates through all files in the designated SourceFolder.
  • Dynamically assigns each file path to the FileName variable.
  • Constructs full paths for source and backup using FullSourcePath and FullBackupPath.

image


2. Data Flow Task

Processes each file through the following components

Add a Flat File, select one of the FakeOrder

image

Transform

Compare the FactInternetSales2 table and the FakeOrder files headers

The first fields match ok image

3. Derived Column

Create the missing fields and set the RevisoryNumber as RevisionNumber and value set as 1

image

Review the source data types

image

4. Data Convertion

Converts data types to match SQL Server schema

image

5. Conditional Split

Filters out rows with NULLs (precautionary, due to NOT NULL constraints)

image

Load

6. OLE DB Destination

Inserts valid rows into FactInternetSales2

image

The Mapping detects all the fields by defaul but the correct fields to ingest are the copies from Data Convertion

image

7. Flat File Destination

Captures rows that fail during insertion (e.g., due to truncation, null violations).

image

8. File System Task

Moves successfully processed files from to

Create Variables needed

image

Setup System File Task with the Destination and Source Full Paths we set as Variables and Operation to Move File

image