2 minutes
Do not use pandas use with pyodbc
Yesterday I published a post on and shared on LinkedIn and had a quite big reaction as an popular PBI influencer commented on my post.
One of the comments by Jesus Jayaro called my attention saying that if pyodbc
is used with execute many it is faster than data factory.
So I did a quick research and in reality it is fast I am impressed.
On my quick research I could put together a working script. It failed at first and did not want to spend too much time on this.
I tried to do the insert into one chunk and it failed using a script i found in medium and then I tried in small chunks and it worked.
but took 500 seconds. Not bad 👍🏻 compared to the initial 55,000 seconds estimate. So it is 0.8% of the time. So more than 100x faster.
At least in this example data factory was faster it was a 5 minutes 35 seconds against 8 minutes 20 seconds of pyodbc
.
here is the script.
import os
import logging
import pandas as pd
import pyodbc
from utils import timing
SQL_SERVER = os.getenv('SQL_SERVER')
SQL_DATABASE = os.getenv('SQL_DATABASE')
SQL_USER_NAME = os.getenv('SQL_AZURE_USERNAME')
SQL_PASSWORD = os.getenv('SQL_AZURE_PASSWORD')
driver= '{ODBC Driver 18 for SQL Server}'
def fast_load(chunk: pd.DataFrame)->None:
connection_string = f'DRIVER={driver};SERVER=tcp:{SQL_SERVER};PORT=1433;DATABASE={SQL_DATABASE};UID={SQL_USER_NAME};PWD={SQL_PASSWORD}'
MY_TABLE = "taxi"
with pyodbc.connect(connection_string) as conn:
try:
insert_to_tmp_tbl_stmt = f"INSERT INTO {MY_TABLE} VALUES (?,?,?,?,?,?,?,?,?,?,?)"
cursor = conn.cursor()
cursor.fast_executemany = True
tuples = chunk.values.tolist()
if tuples:
cursor.executemany(insert_to_tmp_tbl_stmt, tuples)
cursor.commit()
except Exception as e:
logging.debug(e)
def upload_batches(batch_size:int, df:pd.DataFrame)-> None:
batches = int(round(df.shape[0] / batch_size,0))
reminder = df.shape[0] % batch_size
end = 0
for i in range(batches + 1):
start = end
end += batch_size if i < batches else (reminder+1)
print(f"uploading batch {i+1}/{batches+1}")
fast_load(df[start:end])
start = end
@timing
def run_script()->None:
taxi = pd.read_csv('taxi.csv')
upload_batches(10000, taxi)
run_script()
# func: run_script took: 500.3895 sec
If you can provide a faster script of challenge data factory please contact me on LinkedIn