SQL Server geeks: October 2007

Tuesday, October 30, 2007

Checkpointing in SSIS

By : Kasim Wirama, MCDBA, MVP SQL Server

If you ever developed sophisticated DTS package, and there is another error in one of the task, usually you have to reexecute the package from beginning. For complex package and big data that get involved in ETL process, it will consume considerably amount of time wasted. In SSIS, you have option not to start package from beginning in case the package gets error. The feature in SSIS to address this issue is checkpoint.
Checkpoint will note execution information to a file, when the task in a package gets failure to execute. When you correct the task, and re-execute the package, the beginning of execution will start at the failed task not the first task of a package. The package will read execution in the file, and resume executing the task and the rest. Once it is successfully executed, the package will remove the file. Nice feature that help you a lot when you develop a sophisticated package.
To enable checkpointing of a package, you define package’s properties in Checkpoints section, you specify Checkpointfilename, set CheckPointUsage to IfExists or always, and set to true for SaveCheckpoint. Then you set FailPackageOnFailure property of a task to true, otherwise the package will not create checkpoint file if it fails.
To simulate the problem, create 3 execute SQL Task package, set second package property name ForceExecutionResult to failure, and run the package. If you notice the output window, package status will have failure status, then you will find checkpoint file that you specify on Checkpointfilename property. Let’s say you have corrected the task, for demonstration purpose, set ForceExecutionResult property of second task from Failure to success. Run the package again, this time you notice that package will be executed from the second task (failed task) rather than first task. Once it is successfully executed, the checkpoint file will no longer exist, because the package deletes the checkpoint file that no longer useful.
Checkpoint feature in SSIS is brand new feature that worth to be implemented, to make your SSIS package is more reliable and manageable.

Sunday, October 28, 2007

Managing Transaction in SSIS

By : Kasim Wirama, MCDBA, MVP SQL Server

SSIS supports transactions as you usually use in TSQL, there are 2 kinds of transactions in SSIS :
1. Distributed Transaction
This transaction uses 2 phase commit transaction with Distributed Transaction Coordinator (DTC) service. Make sure this service runs before you run your package that uses the service. You don’t have to explicitly state begin tran…commit/rollback tran, because they are handled automatically by DTC service

2. Native SQL Server engine transaction
This transaction uses explicit BEGIN TRAN/COMMIT/ROLLBACK TRAN, explicitly in Execute SQL Task component. You must setting up the involved connection manager’s RetainSameConnection property from false to true.

For first alternative, there are 3 options for TransactionOption properties :
1. NotSupported
The component will not join to transaction scope.
2. Supported
The component will join parent’s transaction scope if exists
3. Required
The component will join parent’s transaction scope if exists, otherwise it will start new component.

Let’s start by create 3 Execute SQL Tasks that refers to same connection manager. Point to AdventureWorks, and you can create dummy table called test with 1 character column. Put there 3 each of these 3 statements into each of Execute SQL Tasks.
Create table dbo.dummy (cola varchar(10) not null);
Insert into dbo.dummy values (‘a’);
Drop table dbo.dummy;

Set up property TransactionOption from Supported (default) to Required for package level. Make sure DTC service already runs, otherwise, the package will fail.

Run your package, and there will be a table created and populated in AdventureWorks database.

For second alternative, you don’t have dependency to DTC service. DTC service is used when you want span transaction over more than 1 connection, tasks even more than 1 package, otherwise, it is simple to use native transaction TSQL statement.

Managing One Temporary Table in SSIS

By : Kasim Wirama, MCDBA, MVP SQL Server

In T-SQL, you must have used temporary table to keep temporary resultset. Temporary table is unique per connection, so if there is 2 users creating connection to create temporary table, even though the name is same, but two of them doesn’t interact each other. How it becomes possible? In engine level, SQL Server create temporary table with suffix with connection id.

If you create temporary table in SSIS, you must refers it to the same connection manager, otherwise SSIS will consider the temporary table is created and manipulated under different connection id.

You can create connection manager, pointing to tempdb database. Now I would like to do temporary table creation , insert new record, and drop the table, each of them is put on separate Execute SQL Task.

If you run the package, it will display red background in second task, the error message is : "Invalid object name '’.". It seems that SSIS consider the connection of the second task is different from the connection of the first task, even though both tasks refers to the same connection manager.

You should see property of the connection manager called RetainSameConnection. The default value is false, meaning that SSIS will create another connection session. To just make current connection of one task is shared to other connection session of previous task, you can change the default value from false to true.

Run you package again. This time, SSIS recognises connection session among three packages refers to one connection session.

Setting Up dynamic property value at SSIS

By : Kasim Wirama, MCDBA, MVP SQL Server

SSIS offers more flexibility and debugging feature that doesn’t exists in previous ETL version (DTS). This article I will let you know one of property of each component or task that makes your SSIS package more flexible through Expression Editor and more configurable ever than before. Even you can change your parameter value from XML file.

For example, I would like to iterate each files in particular folder, and I make this particular folder is configurable, so I have change input parameter from outside SSIS package, input parameter will be got from outside SSIS environment for example, from XML file or through command line.

Let’s jump into new SSIS package for making it clear for you. First, you define package level type SSIS variable. Name it DirectoryName and FullFileName, and their types are string.

Then you drag For Loop Each container component, and configure its property through Foreach Loop Editor. At Collection section, you define whatever string inside Folder textbox, and make sure Fully Qualified radio button at Retrieve File Name section is selected. Now the dynamic part is located in Expressions of the component. Click on it to display Property Expression Editor. On Property column select Directory and on Expression click ellipsis button to display Expression Builder, and drag variable DirectoryName from Variables list box to Expression textbox, you can click Evaluate Expression button to make sure before you go back to Expression Editor window by clicking OK. Now you can change DirectoryName not directly into Foreach Loop component, but from variable.

The task has output of full file name (with folder path), and you need to provide variable that will receive return string value from the task. Let’s go to Variable Mappings, and set Variable column to variable FileName, and click OK to go back to Control Flow designer. Drag Script Task into Foreach Loop Container task, and open its editor, on Script section specify FileName in ReadOnlyVariables because I just want to pop up its value through message box. Click on design script button to open another new Visual Studio IDE, and type MsgBox(Dts.Variables(“User::FileName”). Value.ToString()), close it.

Now you fill in DirectoryName variable, that you want to pop up all files inside the directory. Run the package. And it will pop up file names whose files reside in the directory.

Now you need to expand your DirectoryName variable to outside SSIS package, so each time you don’t have to edit your SSIS when directory name changes. Choose SSIS menu and select Package Configurations. Click Add button. Choose XML configuration file type and specify its name, on next screen you choose Value property of DirectoryName.

Now you can specify command line that calls the package as below:

DTEXEC /File “your dts file path” /CONFIGFILE “your xml config file path” /MAXCONCURRENT “-1” /CHECKPOINTING OFF /REPORTING N

With command line specification above, you can achieve same behaviour as it is in Visual Studio IDE. Your package parameter doesn’t tightly coupled with its package by extracting out to XML file, so you can change the value in XML file without having to change to package.

Thursday, October 25, 2007

Slowly Changing Dimension in SSIS

Slowly Changing Dimension in SSIS
By: Kasim Wirama, MCDBA, MVP SQL Server

Another SSIS component that is interesting to examine is Slowly Changing Dimension. Term of SCD (slowly changing dimension) is used to describe changing row attribute in dimension or lookup table, such as change of product price. Changing of these row attributes has some implications of table design, depending whether business user want to keep its history or just override the destination the cell value.

In this case, there are 3 solutions depending on how scenario business users would like to choose:
1. do not keep historical data
it’s simple to be implemented, but users couldn’t see historical value, because old value is overridden by new value
2. keep historical data
users can see its historical value, but table will grow bigger, and transaction table that refers to lookup table should be updated its foreign key reference to lookup table (SCD table)
3. keep historical data in one column
users can see last history value, but you will implement by your own code because the component do not provide for this kind of solution.

Let’s take a look below sample. I create simple table called dbo.EmployeeSalary that keeps salaries of each employee. You can take scripts below :

create table dbo.EmployeeSalary
(
EmployeeID int not null identity(1,1) primary key,
EmployeeNumber char(3) not null ,
EmployeeName varchar(10) not null,
SalaryPerMonth int not null,
[Address] varchar(15) not null,
SalaryStatus varchar(10) not null default ('current')
);

Now, there are some changes, first change is to raise salary of John from 3500 to 4000, second one is to change smiley address from garden road to Sudirman road. Users want to keep history of salary changes but override address changes. So for this case, you will examine how to use this component. But here is the text file that contains data changes, named it list.txt
empnumber empname salary permonth address
001 john 4000 orchad road
002 smiley 3000 sudirman road

Create SSIS package, add Flat file source and SCD component to Data Flow designer, connect two of them, point the Flat File source component to the text file (list.txt), change column type of salary permonth from string to integer. On SCD wizard, you create destination OLEDB point to table dbo.EmployeeSalary, and setting up address (Input column) to Address (Dimension column) as Not a key column, similar to empname, salary per month, and business key for empnumber because it uniquely identifies each instance of employee from business user perspective.

Next, you specify SalaryPerMonth as HistoricalAttribute and Address as ChangingAttribute.

On HistoricalAttributeOptions, select SalaryStatus on Columns to indicate current record, Current at Value When Current, and Expired at Expiration Value.

When you complete the wizard, run the package. If the package runs successfully, you will get 3 rows at dbo.EmployeeSalary. With expired row on salary changes belonged to John and Smiley’s updated address.

If you see final result of the package, SCD component will expand to several component, so it will make implementation of SCD much more easier in SSIS world.

Wednesday, October 24, 2007

Unpivot columns in SSIS

Unpivot columns in SSIS
By : Kasim Wirama, MCDBA, MVP SQL Server

If you see my previous article about pivoting data using SSIS component called Pivot, you will agree with me that it is easy to use, all you need to do is to define relational query that in turns will be converted to the pivot format. Now here is the vice versa condition, you are given table in pivot format, and there is a requirement that need to convert from pivot format to unpivot format. SSIS provides the component to make it possible and easy to use, the component name is Unpivot component.

Some properties of the component that you must configure are :
1. input columns
columns that are needed as input column stream
2. destination column
a column name that you will unpivot from pivoted columns to unpivoted rows
3. pivot key value
Unique name for each pivot columns, usually same with input columns.
4. pivot key value column name
New column name in unpivoted form that contains caption name of pivoted columns.

Let’s look pivoted form at AdventureWorks database, and you query select statement from view called [Sales].[vSalespersonsalesbyfiscalyears]. You notice that it displays sales amount for each sales person from 2002 until 2004.
And final results of unpivoted as follows :
SalesYear AmountSales SalesPersonID FullName Title SalesTerritory
2002 1951086.8256 275 Michael G Blythe Sales Rep Northeast
2003 4743906.8935 275 Michael G Blythe Sales Rep Northeast
2004 4557045.0459 275 Michael G Blythe Sales Rep Northeast
2002 2800029.1538 276 Linda C Mitchell Sales Rep Southwest
2003 4647225.4431 276 Linda C Mitchell Sales Rep Southwest
2004 5200475.2311 276 Linda C Mitchell Sales Rep Southwest
…..so forth

SalesYear and AmountSales are new columns in unpivoted format. Pivot key values are pivoted column (2002-2004). Pivot key value column name is caption name of pivot column name that is mapped to SalesYear unpivot column name. Destination columns are AmountSales. Input columns are all columns including pivoted columns.

You can see how it works by starting new packages, and dragging OLEDB source component, OLEDB destination component, and Unpivot component. Connect each of them from Source – Unpivot – Destination, and specify query for the view at source component. If you open transformation editor of unpivot component, you will see 4 properties as I mention above, configure it. Configure columns of salespersonid, fullname, title, salesterritory as pass through column, the rest (pivoted columns) as input columns.

Run your package, and the final results similar to what I display above.

With fewer steps of configurations, you will do unpivot data extremely easily.

Thursday, October 18, 2007

Implementing Conditional Task Execution with SSIS

By: Kasim Wirama, MCDBA, MVP SQL Server

If you have ever developed DTS package, you must be familiar with Precedence Constraint (Success, Failure, Completion) move from one task to another task. Some enhancements for these Precedence Constraint is added with conditional that you can apply with expression such as @var1 > 5, that should be evaluated before move execution from one task to another task.

In SSIS, you can define execution condition not only by constraint (success or failure) but also by conditional expression (<, >, <= , >= , <>) and mixed (constraint and expression) with AND and OR operator.

Let say, you have 2 tasks at Control Flow Designer, first task is Data Flow Tasks, second task is Execute SQL Tasks, you want second task is executed when first tasks contains records and it is executed without error.

You can achieve this scenario by using RowCount Component that counts number of rows returned from previous tasks. This component requires a variable that saves value of number of rows.

You can try it out, by make new package. Make OLEDB connection at connection manager area. Specify OLEDB source as SQL Server , database AdventureWorks.

Next you create SSIS variables, name it Var1 with Int32 type. And drag Data Flow Tasks into Control Flow designer area, then double click to switch to Data Flow designer. Drag OLE DB source, specify connection manager from OLEDB connection that you have just created, and point to table AWBuildVersion.

Drag RowCount component, connect OLEDB Source’s green arrow to it, and then double click RowCount. At Component Properties tab, assign Var1 variable to VariableName property, close the window by clicking OK button.

Switch back to Control Flow designer, drag Execute SQL task from toolbox to below Data Flow tasks, and connect Data Flow tasks to Execute SQL task. Edit Execute SQL task, and assign ConnectionType property as OLEDB, Connection property as OLEDB connection, and SQLStatement as select ‘dummy’. Edit precedence constraint (green arrow between 2 tasks), and in Precedence Constraint Editor choose Expression and Constraint, fill out Expression entry with @Var1 > 1 (assume number of rows in AWBuildVersion table is 1, so the expected behaviour is Execute SQL Task has been never got executed because it doesn’t pass from conditional expression @Var1 > 1).

Close Precedence Constraint Editor by click OK button. And you execute package now, for first time second task doesn’t get executed because number of rows is 1

End package execution, edit the precedence constraint, this time you change Expression to @Var1 <=1, execute the package again. This time you see that second task gets executed. The usage of RowCount component comes into play for this scenario.

SSIS gives you more choice by providing Expression criteria, besides constraint (success, failure, completion) that must be fulfilled by next task before allowing the task gets executed and combination between constraint and expression.

Tuesday, October 16, 2007

Pivoting data with SSIS

By : Kasim Wirama, MCDBA, MVP SQL Server

SSIS provides other interesting component for data pivoting. It is called Pivot component. It will automate pivot requirement. There are some mandatory properties that you should set up to make it display your data correctly.

1. Pivot Column

Is a column that you would like to change it from rows to columns.

2. Row Column

Is a column that comes from rows in a column.

3. Value column

Is an aggregated value based on row column.

For example, you will display total quantity for first 3 months of one year sales quantity for each product. In this case you will specify month as pivot column, content of month as Row column, and total quantity as Value column, as query below :

Select
p.[name],
p.productnumber,
datepart(mm,th.transactiondate) as month,
sum(th.quantity) as total
from production.product as p
inner join production.transactionhistory as th
on p.productid = th.productid
where th.transactiondate between '20040101' and '20040301'
group by p.[name],p.productnumber,datepart(mm,th.transactiondate)
order by p.[name],p.productnumber,datepart(mm,th.transactiondate)

You can try to use it by dragging OLEDB destination and Pivot component to Data Flow designer and connect OLEDB to Pivot component.

In Input tab of Pivot component, you check all columns, and in Input and Output Properties tab, you can see all input columns that you specify in second tab. Now you specify PivotUsage property for Name column of input columns to 0 (row), productNumber column to 1 (key row), month column to 2 (pivot column), total column to 3 (pivot value).

Expand Pivot Default tab, and add output column by clicking Add Column button. Name it ProductName, set Source Column property to LineageID of Name column (input column). Add new column, name it ProductNumber, set its Source Column property to LineageID of ProductNumber column (input column).

Add new output column, name it January, set Pivot Key Value to 1, and Source Column to LineageID of total column (input column), similar to new columns for February and March, the only difference is setting up their PivotKeyValue properties.

Last step is add RecordSet destination, and connect Pivot component to it, add variable with object type and assign It to VariableName property of first tab, check all columns at the second tab.

You run the package. You will see month columns at row level becomes column names.

Referring To Your Reference Table with Fuzzy Lookup

Referring To Your Reference Table with Fuzzy Lookup
By: Kasim Wirama, MCDBA, MVP SQL Server

One of problem of ETL world is referencing non existent data. For example when one column of fact table that is foreign key to a dimension. It is usually caused by bad data in the fact table. So how to handle it more appropriately in cleansing data with less effort. The answer is that using Fuzzy Lookup. This component will help most of painful process in dealing with data cleansing.

This component provides 2 important informations besides cleansed data:

1. Similarity

Is a number between 0 and 1 representing similarity. You can define how close the similarity is. The close the number to 1, the closer similarity should be met.

2. Confidence

Is a number between 0 and 1 representing confidence of match relative to set of matched results.

I will show you how to use this component.

First you make a flat file named employeelist.txt, its content is as below, separated with tab among columns.

EMPNAME EMPTITLE
HOWARD GEN MANAGER
MATTHEW GENERAL MANAGER
LINDA GENERAL MGR
SUSAN JUNIOR MANAGER
ERIC JUNIOR MGR
JOHN JUN MANAGER

Next you created below script in tempdb database, a lookup table :

create table dbo.OccupationTitle
(
OccupationTitleID int not null,
OccupationTitleDesc varchar(30) not null
);

go

insert into dbo.OccupationTitle values (1,'GENERAL MANAGER');
insert into dbo.OccupationTitle values (2,'JUNIOR MANAGER');

Now, you create SSIS package with Visual Studio 2005, more details how to create this, just refer to early links about SSIS.

Drag Data Flow task into Control Flow designer, double click on the task to switch to Data Flow designer. Drag Flat File Source, double click on it, create new Flat File Connection Manager by clicking New button. And create your flat file connection manager as previous article I ever write before, with Column Names in the first data row checked.

Drag Fuzzy Lookup and connect to it. Double click on it, specify OLEDB connection manager and select reference table that you have just created on previous step

On columns tab connect arrow between EMPTITLE and Occupation TitleDesc, specify input column to EMPTITLE, Lookup column to OccupationTitleDesc, Mapping Type to Fuzzy, and check all comparison Flag check box. And check all column on Available Lookup Column and Available Input Column.

On Advanced tab, you can define your own Similarity threshold, if you have no idea, specify it to 0.5.

Drag Recordset Destination, connect Fuzzy Lookup component to it. Create variable with type Object. Double click on Recordset Destination, specify VariableName property to the variable. On Available Input Columns, check all columns. And click OK.

If you want to see contenct of Recordset Destination, you can add Data Viewer component by right clicking arrow between Recordset Destination and Fuzzy Lookup. Now you can run the package.

You can compare original data and cleansed data between EMPTITLE column and OccupationTitleDesc column, and Fuzzy Lookup will find its OccupationTitleID.

What a useful component for data cleansing.

SQL Server geeks