Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 751 Vote(s) - 3.49 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to execute large amount of sql queries asynchronous and in threads

#1
**Problem:** I have huge amount of sql queries (around 10k-20k) and I want to run them asynchronous in 50 (or more) threads.

I wrote a powershell script for this job, but it is very slow (It took about 20 hours to execute all). **Desired result is 3-4 hours max.**

**Question:** How can I optimize this powershell script? Should I reconsider and use another technology like `python` or `c#`?

I think it's powershell issue, because when I check with `whoisactive` the queries are executing fast. Creating, exiting and unloading jobs takes a lot of time, because for each thread is created separate PS instances.

**My code:**



$NumberOfParallerThreads = 50;


$Arr_AllQueries = @('Exec [mystoredproc] @param1=1, @param2=2',
'Exec [mystoredproc] @param1=11, @param2=22',
'Exec [mystoredproc] @param1=111, @param2=222')

#Creating the batches
$counter = [pscustomobject] @{ Value = 0 };
$Batches_AllQueries = $Arr_AllQueries | Group-Object -Property {
[math]::Floor($counter.Value++ / $NumberOfParallerThreads)
};

forEach ($item in $Batches_AllQueries) {
$tmpBatch = $item.Group;

$tmpBatch | % {

$ScriptBlock = {
# accept the loop variable across the job-context barrier
param($query)
# Execute a command

Try
{
Write-Host "[processing '$query']"
$objConnection = New-Object System.Data.SqlClient.SqlConnection;
$objConnection.ConnectionString = 'Data Source=...';

$ObjCmd = New-Object System.Data.SqlClient.SqlCommand;
$ObjCmd.CommandText = $query;
$ObjCmd.Connection = $objConnection;
$ObjCmd.CommandTimeout = 0;

$objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter;
$objAdapter.SelectCommand = $ObjCmd;
$objDataTable = New-Object System.Data.DataTable;
$objAdapter.Fill($objDataTable) | Out-Null;

$objConnection.Close();
$objConnection = $null;
}
Catch
{
$ErrorMessage = $_.Exception.Message
$FailedItem = $_.Exception.ItemName
Write-Host "[Error processing: $($query)]" -BackgroundColor Red;
Write-Host $ErrorMessage
}

}

# pass the loop variable across the job-context barrier
Start-Job $ScriptBlock -ArgumentList $_ | Out-Null
}

# Wait for all to complete
While (Get-Job -State "Running") { Start-Sleep 2 }

# Display output from all jobs
Get-Job | Receive-Job | Out-Null

# Cleanup
Remove-Job *

}

***UPDATE*:**

**Resources:** The DB server is on a remote machine with:

- 24GB RAM,
- 8 cores,
- 500GB Storage,
- SQL Server 2016

We want to use the maximum cpu power.

**Framework limitation:** The only limitation is ***not** to use SQL Server* to execute the queries. The requests should come from outside source like: Powershell, C#, Python, etc.
Reply

#2
Try using [SqlCmd][1].

You can use run multiple processes using [Process.Start()][2] and use sqlcmd to run queries in parallel processes.

Of course if you're obligated to do it in threads, this answer will no longer be the solution.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

Reply

#3
You need to reorganize your script so that you keep a database connection open in each worker thread, using it for all queries performed by that thread. Right now you are opening a new database connection for each query, which adds a large amount of overhead. Eliminating that overhead should speed things up to or beyond your target.
Reply

#4
1. Group your queries based on the table and operations on that table.
Using this you can identity how much async sql queries you could run against your different tables.
2. Make sure the size of the each table against which you are going to run.
Because if table contains millions of rows and your doing a join operation with some other table as well will increase the time or if it is a CUD operation then might lock your table as well.
3. And also choose number of threads based on your CPU cores and not based on assumptions. Because CPU core will run one process at a time so better you could create **number of cores * 2** threads are efficient one.

So first study your dataset and then do the above 2 items so that you could easily identity what are all the queries are run parallely and efficiently.

Hope this will give some ideas. Better you could use any python script for that So that you could easily trigger more than one process and also monitor their activites.
Reply

#5
Sadly I don't have the time right this instant to answer this fully, but this should help:

First, you aren't going to use the entire CPU for inserting that many records, almost promised. But!

Since it appears you are using SQL string commands:

1. Split the inserts into groups of say ~100 - ~1000 and manually build bulk inserts:

Something like this as a POC:

$query = "INSERT INTO [dbo].[Attributes] ([Name],[PetName]) VALUES "

for ($alot = 0; $alot -le 10; $alot++){
for ($i = 65; $i -le 85; $i++) {
$query += "('" + [char]$i + "', '" + [char]$i + "')";
if ($i -ne 85 -or $alot -ne 10) {$query += ",";}
}
}

Once a batch is built, then pass it to SQL for the insert, using effectively your existing code.

The buld insert would look something like:

INSERT INTO [dbo].[Attributes] ([Name],[PetName]) VALUES ('A', 'A'),('B', 'B'),('C', 'C'),('D', 'D'),('E', 'E'),('F', 'F'),('G', 'G'),('H', 'H'),('I', 'I'),('J', 'J'),('K', 'K'),('L', 'L'),('M', 'M'),('N', 'N'),('O', 'O'),('P', 'P'),('Q', 'Q'),('R', 'R'),('S', 'S')


This alone should speed up your inserts by a ton!

2. Don't use 50 threads, as previous mentioned unless you have 25+ logical cores. You will spend most of your SQL insert times waiting on the network, and hard drives NOT the CPU. By having that many threads enqueued you will have most of your CPU time reserved on waiting for the slower parts of the stack.

These two things alone I'd imagine can get your inserts down to a matter of minutes (I did 80k+ once using basically this approach in about 90 seconds).

The last part could be refactoring so that each core gets its own Sql connection, and then you leave it open until you are ready to dispose of all threads.
Reply

#6
RunspacePool is the way to go here, try this:


$AllQueries = @( ... )
$MaxThreads = 5

# Each thread keeps its own connection but shares the query queue
$ScriptBlock = {
Param($WorkQueue)

$objConnection = New-Object System.Data.SqlClient.SqlConnection
$objConnection.ConnectionString = 'Data Source=...'

$objCmd = New-Object System.Data.SqlClient.SqlCommand
$objCmd.Connection = $objConnection
$objCmd.CommandTimeout = 0

$query = ""

while ($WorkQueue.TryDequeue([ref]$query)) {
$objCmd.CommandText = $query
$objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter $objCmd
$objDataTable = New-Object System.Data.DataTable
$objAdapter.Fill($objDataTable) | Out-Null
}

$objConnection.Close()

}

# create a pool
$pool = [RunspaceFactory]::CreateRunspacePool(1, $MaxThreads)
$pool.ApartmentState = 'STA'
$pool.Open()

# convert the query array into a concurrent queue
$workQueue = New-Object System.Collections.Concurrent.ConcurrentQueue[object]
$AllQueries | % { $workQueue.Enqueue($_) }

$threads = @()

# Create each powershell thread and add them to the pool
1..$MaxThreads | % {
$ps = [powershell]::Create()
$ps.RunspacePool = $pool
$ps.AddScript($ScriptBlock) | Out-Null
$ps.AddParameter('WorkQueue', $workQueue) | Out-Null
$threads += [pscustomobject]@{
Ps = $ps
Handle = $null
}
}

# Start all the threads
$threads | % { $_.Handle = $_.Ps.BeginInvoke() }

# Wait for all the threads to complete - errors will still set the IsCompleted flag
while ($threads | ? { !$_.Handle.IsCompleted }) {
Start-Sleep -Seconds 1
}

# Get any results and display an errors
$threads | % {
$_.Ps.EndInvoke($_.Handle) | Write-Output
if ($_.Ps.HadErrors) {
$_.Ps.Streams.Error.ReadAll() | Write-Error
}
}


Unlike powershell jobs, a RunspacePools can share resources. So there is one concurrent queue of all the queries, and each thread keeps its own connection to the database.

As others have said though - unless you're stress testing your database, you're probably better off reorganising the queries into bulk inserts.
Reply

#7
I don't know much about powershell, but I do execute SQL in C# all the time at work.

C#'s new async/await keywords make it extremely easy to do what you are talking about.
C# will also make a thread pool for you with the optimal amount of threads for your machine.

async Task<DataTable> ExecuteQueryAsync(query)
{
return await Task.Run(() => ExecuteQuerySync(query));
}

async Task ExecuteAllQueriesAsync()
{
IList<Task<DataTable>> queryTasks = new List<Task<DataTable>>();

foreach query
{
queryTasks.Add(ExecuteQueryAsync(query));
}

foreach task in queryTasks
{
await task;
}
}

The code above will add all the queries to the thread pool's work queue.
Then wait on them all before completing. The result being that the max level of parallelism will be reached for your SQL.

Hope this helps!
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through