2017-07-30

SqlBulkCopy with PowerShell

The challenge

A rather common task is to copy many (all) rows from one SQL Server database table to another as smooth and fast as possible. There are some tuning on the platform that is important, but the copy itself can be done in several very different ways.

A copy direct from table to table can be done with the .NET SqlBulkCopy class using PowerShell.

Alternative T-SQL statements can be used with reduced possibilities and (maybe) better performance:
INSERT INTO ... SELECT FROM where the target table must be created before, but can be in any filegroup.
Or SELECT INTO can be used where the target table will be created in the primary filegroup if the table does not exist there already.

Copying data between a table and a file can also be done with the SQL Server utility bcp.exe.
To copy data from a file to a database table can be done with the T-SQL statement BULK INSERT.

But for now I will focus on SqlBulkCopy with PowerShell.
Yan Pan wrote the great post „Use PowerShell to Copy a Table Between Two SQL Server Instances“ at the Hey, Scripting Guy! blog, but I will go through details on my own to get a deeper understanding.

Using SqlBulkCopy

There is a special case on identity insert as key constraints and not null is not checked by default. This can be changed using the SqlBulkCopyOption enumerations CheckConstraints.

With large amounts of data it is important to use streaming with a SqlDataReader object instead of a static DataSet object, as the DataSet object will hold all data in memory. This can really stress a server and might bring it down.

There are some interesting articles on MSDN Library and a good discussion on stackoverflow on SqlBulkCopy specific and SQL Server bulk operations in general:

I made the PowerShell function Copy-Sqltable to handle the copy of the data:
function Copy-SqlTable {
<#
.DESCRIPTION
  Copy single table from source database to target database on same SQL Server Database Engine instance.
#>
[CmdletBinding()]
[OutputType([void])]
Param()

Begin {
  $mywatch = [System.Diagnostics.Stopwatch]::StartNew()
  "{0:s}Z :: Copy-SqlTable()" -f [System.DateTime]::UtcNow | Write-Verbose

  [string]$ApplicationName = 'SqlBulkCopy.ps1'

  #Candidates for function parameters:
  [string]$SourceInstanceName = '(local)\SQL2016A'
  [string]$SourceDatabaseName = 'source'
  [string]$SourceTableName = '[test].[business]'

  [string]$TargetInstanceName = $SourceInstanceName
  [string]$TargetDatabaseName = 'target'
  [string]$TargetTableName = $SourceTableName
}

Process {
  'Connect to source...' | Write-Verbose
  [string]$CnnStrSource = "Data Source=$SourceInstanceName;Integrated Security=SSPI;Initial Catalog=$SourceDatabaseName;Application Name=$ApplicationName"
  "Source connection string: '$CnnStrSource'" | Write-Debug
  $SqlCnnSource = New-Object -TypeName System.Data.SqlClient.SqlConnection $CnnStrSource
  $SqlCommand = New-Object -TypeName System.Data.SqlClient.SqlCommand("SELECT * FROM $SourceTableName;", $SqlCnnSource)
  $SqlCnnSource.Open()
  [System.Data.SqlClient.SqlDataReader]$SqlReader = $SqlCommand.ExecuteReader()

  'Copy to target...' | Write-Verbose
  [string]$CnnStrTarget = "Data Source=$TargetInstanceName;Integrated Security=SSPI;Initial Catalog=$TargetDatabaseName;Application Name=$ApplicationName"
  "Target connection string: '$CnnStrTarget'" | Write-Debug
  try {
    $SqlBulkCopy = New-Object -TypeName System.Data.SqlClient.SqlBulkCopy($CnnStrTarget, [System.Data.SqlClient.SqlBulkCopyOptions]::KeepIdentity)
    $SqlBulkCopy.EnableStreaming = $true
    $SqlBulkCopy.DestinationTableName = $TargetTableName
    $SqlBulkCopy.BatchSize = 1000000 # Another candidate for function parameter
    $SqlBulkCopy.BulkCopyTimeout = 0 # seconds, 0 (zero) = no timeout limit
    $SqlBulkCopy.WriteToServer($SqlReader)
  }
  catch [System.Exception] {
    $_.Exception | Write-Output
  }
  finally {
    'Copy complete. Closing...' | Write-Verbose
    $SqlReader.Close()
    $SqlCnnSource.Close()
    $SqlCnnSource.Dispose()
    $SqlBulkCopy.Close()
  }
}

End {
  $mywatch.Stop()
  [string]$Message = "Copy-SqlTable finished with success. Duration = $($mywatch.Elapsed.ToString()). [hh:mm:ss.ddd]"
  "{0:s}Z $Message" -f [System.DateTime]::UtcNow | Write-Output
}
} # Copy-SqlTable


Measure

Execution time is measured on each run. For the PowerShell function I used a .NET Stopwatch object. The T-SQL statements are clocked default by Management Studio.

Also I kept an eye on Memory and CPU usage in Windows Performance Monitor on the Process object with all counters (*) on the processes sqlsrvr.exe and powershell/powershell_ise during each run.

Finally I caught actual execution plan on T-SQL statements and kept an eye on the SQL Server Activity Monitor, e.g. the wait statistics. Actually I enabled the SQL Server Query Store on both tables on creation to have some extra figures to look at.

Create test data

Using the batch delimiter "GO" in Management Studio or SQLCMD with the number of rows as the count parameter value to generate several rows of source data:
USE [source];
GO
SET NOCOUNT ON;
GO
INSERT INTO [test].[business] ([test_str],[test_nr])
VALUES (NEWID(), CONVERT(int, RAND()*2147483647));

GO 1000000000


Define SQL Server objects

The source database and table is created simple but still with parallel I/O in mind:
CREATE DATABASE [source]
ON PRIMARY
  (NAME = N'source_data', FILENAME = N'C:\MSSQL_data\source.primary.mdf',
  SIZE = 8MB, FILEGROWTH = 10MB ),
FILEGROUP [user_data]
  ( NAME = N'user_data00',
   FILENAME = N'C:\MSSQL_data\source.user_data00.ndf',
   SIZE = 128MB, FILEGROWTH = 10MB),
  ( NAME = N'user_data01',
   FILENAME = N'C:\MSSQL_data\source.user_data01.ndf',
   SIZE = 128MB, FILEGROWTH = 10MB),
  ( NAME = N'user_data02',
   FILENAME = N'C:\MSSQL_data\source.user_data02.ndf',
   SIZE = 128MB, FILEGROWTH = 10MB),
  ( NAME = N'user_data03',
   FILENAME = N'C:\MSSQL_data\source.user_data03.ndf',
   SIZE = 128MB, FILEGROWTH = 10MB),
  ( NAME = N'user_data04',
   FILENAME = N'C:\MSSQL_data\source.user_data04.ndf',
   SIZE = 128MB, FILEGROWTH = 10MB),
  ( NAME = N'user_data05',
   FILENAME = N'C:\MSSQL_data\source.user_data05.ndf',
   SIZE = 128MB, FILEGROWTH = 10MB),
  ( NAME = N'user_data06',
   FILENAME = N'C:\MSSQL_data\source.user_data06.ndf',
   SIZE = 128MB, FILEGROWTH = 10MB),
  ( NAME = N'user_data07',
   FILENAME = N'C:\MSSQL_data\source.user_data07.ndf',
   SIZE = 128MB, FILEGROWTH = 10MB)
LOG ON
  ( NAME = N'source_log',
   FILENAME = N'C:\MSSQL_translog\source_log.ldf',
   SIZE = 56MB, FILEGROWTH = 10MB);
GO

ALTER DATABASE [source] SET QUERY_STORE = ON;
ALTER DATABASE [source] SET QUERY_STORE (OPERATION_MODE = READ_WRITE);
GO

USE [source];
GO
CREATE SCHEMA [test];
GO
CREATE TABLE [test].[business] (
  [test_id] bigint NOT NULL IDENTITY (1, 1),
  [test_str] nvarchar(256) NOT NULL,
  [test_nr] int NOT NULL
  ) ON [user_data];
GO

USE [master];
GO


The target database and table is created in similar way but on another drive to further optimize I/O:
CREATE DATABASE [target]
ON PRIMARY
  (NAME = N'taget_data', FILENAME = N'M:\MSSQL_data\target.primary.mdf',
  SIZE = 8MB, FILEGROWTH = 8MB ) ,
FILEGROUP [user_data]
  ( NAME = N'user_data00',
   FILENAME = N'M:\MSSQL_data\target.user_data00.ndf',
   SIZE = 1792MB, FILEGROWTH = 32MB),
  ( NAME = N'user_data01',
   FILENAME = N'M:\MSSQL_data\target.user_data01.ndf',
   SIZE = 1792MB, FILEGROWTH = 32MB),
  ( NAME = N'user_data02',
   FILENAME = N'M:\MSSQL_data\target.user_data02.ndf',
   SIZE = 1792MB, FILEGROWTH = 32MB),
  ( NAME = N'user_data03',
   FILENAME = N'M:\MSSQL_data\target.user_data03.ndf',
   SIZE = 1792MB, FILEGROWTH = 32MB),
  ( NAME = N'user_data04',
   FILENAME = N'M:\MSSQL_data\target.user_data04.ndf',
   SIZE = 1792MB, FILEGROWTH = 32MB),
  ( NAME = N'user_data05',
   FILENAME = N'M:\MSSQL_data\target.user_data05.ndf',
   SIZE = 1792MB, FILEGROWTH = 32MB),
  ( NAME = N'user_data06',
   FILENAME = N'M:\MSSQL_data\target.user_data06.ndf',
   SIZE = 1792MB, FILEGROWTH = 32MB),
  ( NAME = N'user_data07',
   FILENAME = N'M:\MSSQL_data\target.user_data07.ndf',
   SIZE = 1792MB, FILEGROWTH = 32MB)
LOG ON
  ( NAME = N'source_log',
   FILENAME = N'C:\MSSQL_translog\target_log.ldf',
   SIZE = 56MB, FILEGROWTH = 16MB);
GO

ALTER DATABASE [target] SET QUERY_STORE = ON
ALTER DATABASE [target] SET QUERY_STORE (OPERATION_MODE = READ_WRITE)
GO

USE [target];
GO
CREATE SCHEMA [test];
GO
CREATE TABLE [test].[business] (
  [test_id] bigint NOT NULL IDENTITY (1, 1),
  [test_str] nvarchar(256) NOT NULL,
  [test_nr] int NOT NULL
) ON [user_data];
GO

USE [master];
GO


Evaluation

The first measure is a basic run with default batch size.

Copy-SqlTable finished with success. Duration = 00:13:28.0940294. [hh:mm:ss.ddd]; 134 308 637 rows; 12.3 GB data
Copy-SqlTable finished with success. Duration = 00:16:12.9162091. [hh:mm:ss.ddd]; BatchSize = 1 000
Copy-SqlTable finished with success. Duration = 00:11:34.3647701. [hh:mm:ss.ddd]; BatchSize = 10 000
Copy-SqlTable finished with success. Duration = 00:10:15.7085043. [hh:mm:ss.ddd]; BatchSize = 100 000
Copy-SqlTable finished with success. Duration = 00:10:00.1098163. [hh:mm:ss.ddd]; BatchSize = 1 000 000

2017-07-06

COM objects with PowerShell

COM (Component Object Model) is a rather old technology from Microsoft. But still relevant i practice and recognized by Microsoft. Actually there is a new COM implementation in Windows Management Framework (WMF) 5.0. with significant performance improvements as described in „What's New in Windows PowerShell“. Also some issues are fixed with WMF 5.1 (Bug Fixes).

To get a list of the COM components available on the computer you can use this PowerShell statement:
Get-ChildItem HKLM:\Software\Classes -ErrorAction SilentlyContinue |
Where-Object {
$_.PSChildName -match '^\w+\.\w+$' -and (Test-Path -Path "$($_.PSPath)\CLSID")
  } |
Select-Object -ExpandProperty PSChildName | Out-GridView

On the computer I am writing this on there are 1219 COM components…
The statement above is part of the short and fine article „Get a list of all Com objects available“ by Jaap Brasser.

On MSDN there are some nice introductions:
MSDN Library: „Creating .NET and COM Objects (New-Object)“.

A COM object is created with the PowerShell CmdLet New-Object using the parameter -Strict to stabilize the script when the COM component is using an Interop Assembly.

The integration between COM and .NET is based on COM Interop.

Microsoft Office

Each product in Microsoft Office has at least one COM component. Usually they are used inside a Office product with Visual Basic for Applications (VBA), but VBA code can very easy be refactored to VBScript.

Excel

The central COM component when working with Excel is the „Application Object“. This is a very short and incomplete example:
$Excel = New-Object -ComObject Excel.Application -Property @{Visible = $true} -Strict -ErrorAction SilentlyContinue

$Workbook = $Excel.Workbooks.Add()
$Workbook.Author = 'Niels Grove-Rasmussen'
$Workbook.Title = 'PowerShell COM sandbox'
$Workbook.Subject = 'Sandbox on using COM objects in PowerShell'

$Workbook.Sheets.Item(3).Delete()
$Workbook.Sheets.Item(2).Delete()

$Sheet1 = $Workbook.Sheets.Item(1)
$Sheet1.Name = 'COM'

$Sheet1.Range('A1:A1').Cells = 'Cell A1'
$Sheet1.Range('B1:B1').Cells = 2
$Sheet1.Range('B2:B2').Cells = 3.1
$Sheet1.Range('B3:B3').Cells.Formula = '=SUM(B1:B2)'

$Excel.Quit()

Using -Strinct on the CmdLet New-Object also generates 5+ lines of output about IDispatch. This is suppressed with -ErrorAction SilentlyContinue.

Working with Excel through COM will often - of not always - end up using Application.Sheets. In the example above it is used to give the sheet a custom name and add values or a formula to cells.

The Excel object does not close nice. So some COM cleanup is required to reset the COM environment after a Excel object is closed.
while( [System.Runtime.Interopservices.Marshal]::ReleaseComObject($Excel) ) { 'Released one Excel COM count.' | Write-Verbose }
Remove-Variable Excel
[System.GC]::Collect()

As this is a old challenge there is a Microsoft article on the more general subject: „Getting Rid of a COM Object (Once and For All)“. One interesting thing about Excel in the article is that when Excel is closed manually a EXCEL.EXE process still is active.

Word

This is another example on a COM component in Microsoft Office. The naming is similar to Excel, so the central class is the Application Class.

$Word = New-Object -ComObject Word.Application -Property @{Visible = $true} -Strict -ErrorAction SilentlyContinue

$Word.Quit()


Internet Explorer

The old Microsoft web browser has a COM component that used to be popular for a automation GUI. Some naming is aligned with Microsoft Office COM components – or the other way. The central class is the Application Class.

$IE = New-Object -ComObject InternetExplorer.Application -Strict
$IE.Visible = $true

$IE = $null


Windows Scripting Host

The „old“ Windows automation Windows Scripting Host (WSH) from before PowerShell. By default there is script engine for Visual Basic Scripting Edition (VBScript) and JScript. Other languages are implemented to integrate with WSH, e.g. Perl and REXX. There are many old but good books, articles and examples around. Personally I prefer JScript as it has a more robust error handling than VBScript - and other subjective benefits.

WScript

In this part of WSH the central COM class is WshShell.

$vbInformation = 64
$vbOKCancel = 1

$Timeout = 0
$Wsh = New-Object -ComObject WScript.Shell -Strict
[int]$Button = $Wsh.Popup('Hello from PowerShell!', $Timeout, 'PowerShell', $vbInformation + $vbOKCancel)
$Wsh = $null
''
switch($Button) {
  -1 { '???' }
  1 { ':-)' }
  2 { ':-(' }
}

The Popup method used above is a less nice way to put a popup window in a PowerShell Script. I have elsewhere written a nicer solution for a PowerShell MessageBox.
I think that if you need GUI elements then you should look at .NET Windows Forms before WSH.

Controller

WshController

Network

WshNetwork

Windows Shell

The central COM component to Windows Shell is the Shell object.

Actually the name „Windows Shell“ is somewhat misleading as Windows Shell is usually used for the Windows command line shell cmd.exe. So if you are talking with others about the COM class Shell.Application you should make sure to distinguish from cmd.exe Windows Shell.

$Shell = New-Object -ComObject Shell.Application -Strict
$Shell.Open('C:\')

$Shell = $null


ScriptControl

.

$Script = New-Object -ComObject MSScriptControl.ScriptControl -Strict
$Script.Language = 'vbscript'
$Script.AddCode('function getInput() getInput = inputbox(...')
$Input = $Script.eval('getInput')

This is a rather poisonous way to call a piece of VBScript code as the eval method is used. The method opens up for injection attack to the script.

Disclaimer: The example above should not be used in production!

When I have a more suitable example on using ScriptControl this example will be replaced.

History

2017-06-06 Blog entry started as placeholder for ideas and old notes. Not released.
2017-07-06 Notes rewritten, restructured and released.

IT infrastructure principles

This is some thoughts about good general principles on IT infrastructure. The principles are my personal thoughts, but are in content and structure inspired by TOGAF Architecture Principles
(pubs.opengroup.org/architecture/togaf9-doc/arch/chap23.html) from TOGAF 9.1.

Principle: Comply or Explain

Statement; Each solution or architectural element should comply with these principles. If it is not possible the deviation must be documented with motivations.

Rationale; This principle enforce these principles, and make sure they are continuously evaluated.

Implications; All architectural elements are documented thoroughly in a really useful way, that really can be used in future development and governance of the system.


Principle: Business Support

Statement; Each architectural element must support business. Not only the current business but also the future business.

Rationale; The architecture is sponsored by business and then should support business.

Implications; The architecture, the processes or the technology must have business as primary subject.
Data available and systems running (continuity). Easy and direct access to information.


Principle: Compliance

Statement; Each component, element and the combined solution must comply with both law (national and international), regulations and audit (internal and external).

Rationale; This is to protect the organisation, senior management and board from legal, economical or reputational consequences.

Implications; Business, audit, management, development and operation must coorporate to ensure compliance in all details. This will usually involve several different areas like organisation, procedures and technology. Each person must have ownership.

Principle: Secure by Design

(collection point of security principles?)
Attack Surface Reduction
Threat Modelling
"Microsoft Security Development Lifecycle" (https://www.microsoft.com/en-us/sdl/)

The Microsoft presentation „Basics of Secure Design Development Test“ has many interesting points. E.g. a rather well formulated description of a asymmetric problem:
We must be 100 % correct, 100 % of the time, on a schedule, with limited resources, only knowing what we know today.
- And the product has to be reliable, supportable, compatible, manageable, affordable, accessible, usable, global, doable, deployable...
They can spend as long as they like to find one bug, with the benefit of future research.

Cost for attacker to build attack is very low
Cost to your users is very high
Cost of reacting is higher than cost of defending

Principle: Protect data

Statement; Protect both data at rest and data in transport.

Rationale; Data are vital to the organisation and it's business, and protection of data must have very high priority.

Implications; classify data, classify risk, encryption

Principle: Reduce Complexity

Statement; Assign and enable only requires components to the system. Enable only on the required feature level.

Rationale; Complexity overhead generates extra administration and governance, and makes operations and problem analysis more complex.
Acquiring or assigning unused resources adds unnecessary cost.

Implications; Allocate only required resources to each service, e.g. processor, storage or memory. Install and enable only required services, e.g. Full Text features or Integration Services only when direct required. This is also an economic issue like on license.
This principle is often described in other similar but partial principles like "Least Service" or "Least Resource":
  • Least Service; install only services that are actually needed on each component, e.g. only RDBMS on a database server and management tools on other computers. There might also be an economic issue, e.g. on license.
  • Least Resource; Allocate only required resources to each service, e.g. processor, storage or memory. This is also an economic issue like on license.


Principle: Least Privilege

Statement; Security principle to minimize risk to data exposure.

Rationale; Protect the data and reduce risk.

Implications; Do not use default security groups. Only grant minimum access and rights. Fine grained security model on all roles.


Principle: Segregation of duties

Statement; Security, Audit

Rationale;

Implications; Role Based Access Control (RBAC) on the system - not the resource.


Principle: Defense in Depth

Statement; Security, Compliance

Rationale;

Implications; segmentation, precise defined layers, each component is hardened.


Principle: Automation

Statement; automate as much as possible. That is about every thing but the decision itself.

Rationale; with automation the infrastructure and architectual components above like platform or application will be handled both with stability and quick response. If there is a enterprise principle on agility the automation is a requirement.

Implications; standards, Root Cause Analysis (RCA) and real fix.
Consider to stride for a production environment without human admins. Later this can be evolved to other environments.

Principle: Scalability

Statement; architect for scale. That is on all aspects such as capacity, response time, geographic and integration.

Rationale;

Implications;

Principle: Easy Exit

Statement; choose the solution that is the most easy to leave.

Rationale; when a solution is tight coupled to a single technology or product then it will be very difficult and costly to migrate the system to another solution.

Implications; you might look more at open formats if you rate this principle high in a given challenge.