View the Project on GitHub AldoReyes84/Working Experience
Intel Project Analyst Testing Nov 2022 to Mar 2025
This project generates synthetic daily Internet Sales data using Python and loads it into a SQL Server table (FactInternetSales2) via an SSIS package. The goal is to simulate realistic ETL workflows for testing, transformation, and automation purposes
I used the Excel Data Model connected to Adventureworks FactInternetSales table to prepare the parameter to use Python Faker to create some random InternetSales data.
We are no trying to recreate an exact structure the SQL table because we want to do some transforations in the SSIS Data Flow. The last five fields will be added in the transformation step.
The Python script connects to the AdventureWorksDW2022 database, retrieves metadata from FactInternetSales2 and DimProduct, and generates daily CSV files with randomized sales data using pandas, random, and Faker.
import pandas as pd # to create the Data Frame
import random # to create the Random Data
from datetime import datetime, timedelta # to manade Date Data
import pyodbc # to connect and query SQL Server DB Data
from decimal import Decimal # to define decimal numbers for tax rate
import os # to export the final resulting files to folder
conn = pyodbc.connect('DRIVER={SQL Server};SERVER="Server_Name";DATABASE=AdventureWorksDW2022;Trusted_Connection=yes;')
cursor = conn.cursor()
cursor.execute("SELECT MAX(OrderDate) FROM FactInternetSales2")
last_order_date = cursor.fetchone()[0]
start_date = last_order_date + timedelta(days=1)
cursor.execute("""
SELECT MAX(CAST(SUBSTRING(SalesOrderNumber, 3, LEN(SalesOrderNumber)) AS INT))
FROM FactInternetSales2
WHERE SalesOrderNumber LIKE 'SO%'
""")
last_order_number = cursor.fetchone()[0]
order_counter = (last_order_number or 0) + 1
cursor.execute("""
SELECT ProductKey, ListPrice, StandardCost
FROM DimProduct
WHERE ProductKey IS NOT NULL
""")
product_prices = {}
for row in cursor.fetchall():
product_key = row.ProductKey
list_price = row.ListPrice if row.ListPrice is not None else 0
standard_cost = row.StandardCost if row.StandardCost is not None else 0
product_prices[product_key] = {
'UnitPrice': list_price,
'ProductStandardCost': standard_cost
}
conn.close()
current_date = start_date
end_date = datetime(2014, 2, 3)
This script will read the last order date in FactInternetSales2 and set it up as start_date
end_date will be provided manually to determin how many daily files will this scritp generate
product_options = [214, 217, 222, 225, 228, 231, 234, 237, 310, 311, 312, 313, 314, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 463, 465, 467, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 528, 529, 530, 535, 536, 537, 538, 539, 540, 541, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 604, 605, 606]
promotion_options = [1, 2, 13, 14]
currency_options = [6, 19, 29, 39, 98, 100]
orders = []
while current_date <= end_date:
num_orders = random.randint(1, 8)
for _ in range(num_orders):
product_key = random.choice(product_options)
price_info = product_prices.get(product_key, {'UnitPrice': 0, 'ProductStandardCost': 0})
order_id = f"SO{order_counter}"
tax_rate = Decimal('0.08')
order_date = current_date.date()
order_date_key = int(current_date.strftime('%Y%m%d')) # Formato YYYYMMDD como entero
due_date = order_date + timedelta(days=random.randint(4, 30))
due_date_key = int(due_date.strftime('%Y%m%d'))
ship_date = due_date + timedelta(days=random.randint(2, 15))
ship_date_key = int(ship_date.strftime('%Y%m%d'))
orders.append({
'ProductKey': product_key, # Product identifier
'OrderDateKey': order_date_key, # Order date as integer YYYYMMDD
'DueDateKey': due_date_key, # Due date as integer YYYYMMDD
'ShipDateKey': ship_date_key, # Ship date as integer YYYYMMDD
'CustomerKey': random.randint(11000, 29483), # Customer identifier
'PromotionKey': random.choice(promotion_options), # Promotion applied
'CurrencyKey': random.choice(currency_options), # Currency used
'SalesTerritory': random.randint(1, 10), # Sales territory ID
'SalesOrderNumber': order_id, # Unique order ID
'SalesOrderLine': random.randint(1, 8), # Line number within the order
'RevisoryNumber': None, # Revisory number null
'SalesOrderQuantity': 1, # Fixed quantity per order
'UnitPrice': price_info['UnitPrice'], # ListPrice from DimProduct
'ExtendedAmount': price_info['UnitPrice'], # Extended amount = UnitPrice
'UnitPriceDiscount': 0, # Unit price discount = 0
'ProductStandardCost': price_info['ProductStandardCost'], # StandardCost from DimProduct
'TotalProductCost': price_info['ProductStandardCost'], # Total product cost = ProductStandardCost
'SalesAmount': price_info['UnitPrice'], # Sales amount = UnitPrice
'TaxAmount': price_info['UnitPrice'] * Decimal('0.08'), # Tax amount = 8% of SalesAmount
'Freight': None, # Freight requires another DB for practical purposes won't implement it here
'OrderDate': current_date.date(), # Date of the order
'DueDate': due_date, # Due date
'ShipDate': ship_date, # Ship date
})
order_counter += 1 # Increment global order ID
current_date += timedelta(days=1) # Move to the next day
df = pd.DataFrame(orders)
grouped = df.groupby('OrderDate')
print(df)
output_folder = r'C:\Users\aldoa\Documents\Proyectos\Python Faker\FakeInternetSalesOrders\\'
os.makedirs(output_folder, exist_ok=True)
for order_date, group in grouped:
# Convert the date to YYYYMMDD format
date_str = order_date.strftime('%Y%m%d')
# Define the full path for the file
filename = f"{output_folder}FakeOrders_{date_str}.csv"
group.to_csv(filename, index=False, encoding='utf-8')
Built in Visual Studio 2022, this SSIS package ingests the generated CSV files, transforms the data, and loads it into FactInternetSales2.
Processes each file through the following components
Add a Flat File, select one of the FakeOrder
Compare the FactInternetSales2 table and the FakeOrder files headers
The first fields match ok
Create the missing fields and set the RevisoryNumber as RevisionNumber and value set as 1
Review the source data types
Converts data types to match SQL Server schema
Filters out rows with NULLs (precautionary, due to NOT NULL constraints)
Inserts valid rows into FactInternetSales2
The Mapping detects all the fields by defaul but the correct fields to ingest are the copies from Data Convertion
Captures rows that fail during insertion (e.g., due to truncation, null violations).
Moves successfully processed files from to
Create Variables needed
Setup System File Task with the Destination and Source Full Paths we set as Variables and Operation to Move File