Last week I was asked to mask or obfuscate a large volume of data residing in a production database for our software quality assurance team to use in testing some functionality. The data resided in various tables and in several columns per table. Yet, this procedure would need to be done over and over again as the data gets depleted. Yet, the masking supposed to be hidden from the QA people and centrally controlled. I proposed to use SSIS. The next day I did a web scan to just find out that noting similar exists and even no good sample code can be found to drive the masking, even in the Microsoft’s All-In-One Code Framework library I could not find anything useful and ended up submitting a request to provide such a useful (at least IMHO) sample. So if you have spare 30 sec. please bother to vote here for my item.
- The preamble and design considerations
To make the story short, I opted for creating my own custom Data Flow Component with my own data masking code. After it was successfully used by our QA for a week I decided it is time to share how I crated it. So here it goes:
To start off, I would say that there are no may examples on the web demonstrating how to create custom SSIS components. The best recourse (arguably perhaps) remains the MSDN. I was to the most part driven by two articles: the general MSDN How to on custom SSIS components development and an example by Benny Austin because I wanted a Data Flow Component similar to what he has created. I would like to note that looking back at what was done the SSIS Data Flow component mostly rotates around employing functions from these two areas: the design time and run-time. The design-time for a component starts from dragging and dropping it onto the DFT canvas (designer), and the run-time, as you have already probably guessed is when the component is at work. To implement your logic you would need to override each applicable function. To make your component addable to a Data Flow Transformation (later DFT), you simply start a new Class Library project (if you plan to use the component in SSIS 2008 then base it on .Net 3.5 and give it a cool default namespace name). The code provided here is in C#, but if my dear reader prefers to work in VB, feel free to convert it using this site: http://www.developerfusion.com/tools/convert/csharp-to-vb/
As an aside, my design goal was to automate the data masking as much as possible so another SSIS developer could set it up very quickly.
- The (hard :-) coding part
In almost all implementations the ProvideComponentProperties needs to be set. For example in my case I ended up with:
1: public override void ProvideComponentProperties()
3: // Set component information
4: ComponentMetaData.Name = "Data Masker";
5: ComponentMetaData.Description = "A SSIS Data Flow Transformation Component To Provide Basic Data Masking";
6: ComponentMetaData.ContactInfo = "Arthur Zubarev";
8: // Reset the component
11: // Add input objects
12: IDTSInput100 input = ComponentMetaData.InputCollection.New();
13: input.Name = "Input";
14: input.Description = "Contains un-masked columns.";
16: // Add output objects
17: IDTSOutput100 output = ComponentMetaData.OutputCollection.New();
18: output.Name = "Output";
19: output.Description = "Contains masked columns. Gets set automatically.";
20: output.SynchronousInputID = input.ID; //Synchronous transformation
22: //Add error objects
23: IDTSOutput100 errorOutput = ComponentMetaData.OutputCollection.New();
24: errorOutput.Name = "Error";
25: errorOutput.IsErrorOut = true;
Then the second most important place where much code can end up would be the DTSValidationStatus Validate procedure. This method fires each time component properties pertaining to its metadata get changed. Because I said I wanted to automate the data masking component usage I added some code to cover the output creation on the fly based on the [virtual] data columns chosen by a developer. The complete code follows:
1: public override DTSValidationStatus Validate()
3: // Determine whether the metdada needs refresh
4: IDTSInput100 input = ComponentMetaData.InputCollection;
5: IDTSVirtualInput100 vInput = input.GetVirtualInput();
7: bool cancel = false;
9: foreach (IDTSInputColumn100 column in input.InputColumnCollection)
13: IDTSVirtualInputColumn100 vColumn = vInput.VirtualInputColumnCollection.GetVirtualInputColumnByLineageID(column.LineageID);
17: ComponentMetaData.FireError(0, ComponentMetaData.Name, "The input column " + column.IdentificationString + " does not match a column in the upstream component.", "", 0, out cancel);
19: return DTSValidationStatus.VS_NEEDSNEWMETADATA;
23: // Validate input to be of type string/numeric only
24: bool pbCancel = false;
25: for (int x = 0; x < input.InputColumnCollection.Count; x++)
27: if (!(input.InputColumnCollection[x].DataType == DataType.DT_STR ||
28: input.InputColumnCollection[x].DataType == DataType.DT_WSTR ||
29: input.InputColumnCollection[x].DataType == DataType.DT_DECIMAL ||
30: input.InputColumnCollection[x].DataType == DataType.DT_NUMERIC ||
31: input.InputColumnCollection[x].DataType == DataType.DT_TEXT))
33: ComponentMetaData.FireError(0, ComponentMetaData.Name,
34: "Column " + input.InputColumnCollection[x].Name + " cannot be used for data masking."
35: + Environment.NewLine + "The data type supplied was " + input.InputColumnCollection[x].DataType.ToString() + "."
36: + Environment.NewLine + "The supported data types are DT_STR, DT_WSTR, DT_DECIMAL, DT_NUMERIC and DT_TEXT."
37: + Environment.NewLine + "Unmark the offending column(s) to correct.", "", 0, out pbCancel);
39: return DTSValidationStatus.VS_ISBROKEN;
43: // Create corresponding output columns dynamically
44: IDTSOutput100 output = ComponentMetaData.OutputCollection;
46: foreach (IDTSInputColumn100 inputcolumn in input.InputColumnCollection)
48: bool IsExist = false;
49: foreach (IDTSOutputColumn100 OutputColumn in output.OutputColumnCollection)
51: if (OutputColumn.Name == "Masked_" + inputcolumn.Name)
53: IsExist = true;
57: if (!IsExist)
59: IDTSOutputColumn100 outputcol = output.OutputColumnCollection.New();
60: outputcol.Name = "Masked_" + inputcolumn.Name;
61: outputcol.Description = "Masked " + inputcolumn.Name;
62: outputcol.SetDataTypeProperties(inputcolumn.DataType, inputcolumn.Length, inputcolumn.Precision, inputcolumn.Scale, inputcolumn.CodePage);
66: //Remove redundant output columns that don't match the input columns
67: if (output.OutputColumnCollection.Count > input.InputColumnCollection.Count)
69: foreach (IDTSOutputColumn100 OutputColumn in output.OutputColumnCollection)
71: Boolean IsRedundant = true;
72: foreach (IDTSInputColumn100 InputCoulmn in input.InputColumnCollection)
74: IsRedundant = OutputColumn.Name.Contains("Masked_" + InputCoulmn.Name) ? false : true;
75: if (!IsRedundant)
79: if (IsRedundant)
86: return DTSValidationStatus.VS_ISVALID;
Do not forget to clean out any invalid input (this code is provided in the complete source code and omitted in the article for brevity).
Another piece that I chose to omit is the prevention from adding user output to support the complete automation.
Congrads! If all is good by now, we are at the run-time methods part! Almost done, bear with me… well almost, if you want, even at this stage you can already use the component, yes, indeed, but it will not mask the data, you can however test its design time functionality, so if you are willing to do so, the skip to the Testing the component in a SSIS Project part below.
- Run-time Methods Implementation
To implement the component’s data processing it is simply enough to override its PreExecute and ProcessInput methods as depicted below:
1: public override void PreExecute()
3: IDTSInput100 input = ComponentMetaData.InputCollection;
4: inputBufferColumnIndex = new int[input.InputColumnCollection.Count];
6: for (int x = 0; x < input.InputColumnCollection.Count; x++)
8: IDTSInputColumn100 column = input.InputColumnCollection[x];
9: inputBufferColumnIndex[x] = BufferManager.FindColumnByLineageID(input.Buffer, column.LineageID);
12: IDTSOutput100 output = ComponentMetaData.OutputCollection;
13: outputBufferColumnIndex = new int[output.OutputColumnCollection.Count];
15: for (int x = 0; x < output.OutputColumnCollection.Count; x++)
17: IDTSOutputColumn100 outcol = output.OutputColumnCollection[x];
19: // A synchronous output does not appear in output buffer, but in input buffer
20: outputBufferColumnIndex[x] = BufferManager.FindColumnByLineageID(input.Buffer, outcol.LineageID);
1: // The actual data masking
2: public override void ProcessInput(int inputID, PipelineBuffer buffer)
4: if (!buffer.EndOfRowset)
6: while (buffer.NextRow())
8: for (int x = 0; x < inputBufferColumnIndex.Length; x++)
10: DataType BufferColDataType;
12: BufferColDataType = buffer.GetColumnInfo(inputBufferColumnIndex[x]).DataType;
14: if (!buffer.IsNull(x))
16: buffer.SetString(outputBufferColumnIndex[x], MaskData(buffer.GetString(inputBufferColumnIndex[x])));
The data masking implementation is in the MaskData routine that can be changed to suite a different logic (data scrambling is implemented) or improved and/or enhanced:
1: // Provides a basic data masking with scrambling column content
2: public string MaskData(string InputData)
4: string MaskedData = InputData;
6: if (MaskedData.Length > 0)
8: // The technique used to mask the data is to replace numbers with random numbers and letters with letters
9: char chars = new char[InputData.Length];
11: Random rand = new Random(DateTime.Now.Millisecond);
13: int index = 0;
15: while (InputData.Length > 0)
17: // Get a random number between 0 and the length of the word.
18: int next = rand.Next(0, InputData.Length - 1);
20: // Take the character from the random position and add to our char array.
21: chars[index] = InputData[next];
23: // Remove the character from the word.
24: InputData = InputData.Substring(0, next) + InputData.Substring(next + 1);
29: MaskedData = new String(chars);
32: // Scrambled or empty
33: return MaskedData;
Guess what? Whoa, you are done!
Well… almost, but the fun continues, the next step is to try the component out!
- Testing the component in a SSIS Project
To test the project up we ought to sign the DLL because it has to go to GAC. To do so just create a new SNK file and opt for password protection. If you do not know how then right-click on your project (not solution) and choose Properties, navigate to the Signing property page and enable the Sign the assembly option, the follow prompts.
End result would be something like here:
Build your project at this stage and copy the resulted DLL (typically found in Debug/Bin folder) to the <DRIVE LETTER>:\Program Files\Microsoft SQL Server\100\DTS\PipelineComponents\ folder where you are likely to see other DLLs that were created by Microsoft. As the last step you have to add this DLL to GAC, too. You may want to drag and drop it to c:\WINDOWS\assembly. Just a quick note: in real life development cycle you will end up registering and unregistering (right click in GAC folder to unreg) DLLs often, so it makes sense to create a post-build step to execute that all for you, but then you will use GACUtil.exe in it as described here: http://bit.ly/z05Ab3
We are almost there. The next step is to bring a new instance up of BIDS or Visual Studio 2008 and create a new SSIS project. Now add this new component as follows:
Right click anywhere inside the Toolbox Data Transformation items area and select Choose Items in the context menu, go to the SSIS Data Flow Items and select the Data Masker component.
You should be able to see it like this:
The next step is to actually employ our component.
Drag and drop it to a DFT between a data source and destination, then configure it as in this figure:
I used the AdventureWorks database from http://msftdbprodsamples.codeplex.com/ table HumanResources.Employee to mask the NationalIDNumber and LoginID columns. The output columns named “Masked_” were created at design time by the component itself. The end result is seen here:
The entire project was contributed to the CodePlex site and can be downloaded from there entirely or just the binaries: