Remix.run Logo
abirch 3 days ago

What my workself would love is to easily dump Pandas or Polar data frames to SQL Tables in SQL Server as fast as possible. I see this bcp, but I don't see an example of uploading a large panda dataframe to SQL Server.

RaftPeople 3 days ago | parent | next [-]

> What my workself would love is to easily dump Pandas or Polar data frames to SQL Tables in SQL Server as fast as possible

We run into this issue also where we want to upload volumes of data but don't want to assume access to BCP on every DB server.

We wrote a little utility that actually works pretty fast compared to other methods we've tested (fast=about 1,000,000 rows per minute for a table with 10 random columns with random data), here's the approach:

1-Convert rows into fixed length strings so each row is uploaded as one single varchar column (which makes parsing+execution of SQL stmt during upload much quicker)

2-Repeatedly upload groups of fixed length rows into temp table until all uploaded.

Details:

Multiple fixed length rows are combined into one fixed length varchar column that will be uploaded as one single raw buffer row. We found a buffer size of 15,000 to be the sweet spot.

Multiple threads will each process a subset of source data rows. We found 5 threads to be generally pretty good.

At the end of this step, the destination temp table will have X rows of buffers (the buffer column is just a varchar(15000), and inside each of those buffers are Y source data rows with Z number of columns in fixed format.

3-Once the buffer rows are all uploaded then split out the source data rows+columns using a temp sproc generated for the exact schema (e.g. substring(Buffer_Data,x,y) as Cust_Name)

ludamn 3 days ago | parent [-]

[dead]

bob1029 3 days ago | parent | prev | next [-]

On BULK INSERT, the data source doesn't have to live on the actual MSSQL box. It can be something on a UNC or even in Azure:

https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-...

> Beginning with SQL Server 2017 (14.x), the data_file can be in Azure Blob Storage.

You don't need to have access to the MSSQL box to invoke high performance batch loads. UNC should give you what you need if your IT department is willing to work with it.

A4ET8a8uTh0_v2 3 days ago | parent | prev | next [-]

Honestly, what I find myself doing more often than not lately is not having problems with the actual data/code/schema whatever, but, instead, fighting with layers of bureaucracy, restrictions, data leakage prevention systems, specific file limitations imposed by the previously listed items...

There are times I miss being a kid and just doing things.

3 days ago | parent | prev | next [-]
[deleted]
kermatt 3 days ago | parent | prev | next [-]

While bcp lacks some features to make this as straightforward as PostgreSQL for example, i.e. piped data into bcp, it is a fast ingest option for MSSQL.

We wound up staging a local tab delimited file, and importing via bcp:

    bcp "$DESTINATION_TABLE" in "$STAGE_FILE.dat" -u -F 2 -c -t'\t'
Not elegant, but it works.
qsort 3 days ago | parent | prev | next [-]

How large? In many cases dumping to file and bulk loading is good enough. SQL Server in particular has openrowsets that support bulk operations, which is especially handy if you're transferring data over the network.

abirch 3 days ago | parent [-]

Millions of rows large. I tried doing the openrowsets but encountered permission issues with the shared directory. Using fast_executemany with sqlalchemy has helped, but sometimes it's a few minutes. I tried bcp as well locally but IT has not wanted to deploy it to production.

sceadu 3 days ago | parent | prev [-]

You might be able to do it with ibis. Don't know about the performance though

abirch 3 days ago | parent [-]

Thank you, I'll look into this. Yes performance is the main driver when some data frames have millions of rows.