home
Donate to support site


System Interfaces, data transfer and process control
Author Nigel Rivett
The detail of this architecture is specific to SQL Server but the principles are generic to any system. 
Most of the processes involved here are included on this site.

A difficulty in most corporate infrastructure is the system interfaces.
If there are two systems that need to transfer data, how is this accomplished.
Often this is by the destination system that needs the data interogating the source system.
The source system is usually supported by a database which holds the data that needs to be transferred
The destination system is often supported by a database.
This often starts off as a reporting system but soon is changed to work off it's own database system for availability

This causes several problems
	Is the data is available in thje source system
	Can the destination system extract the data without undue impact on the source system (especially for reporting systems)
	Is the data required consistent in the source system
	Can the data be transferred in a timely fashion
	Has the source system changed it's structures to impact the extract

Common data transfer methods that cause issues
	Query access to source system from destination system
	Replication from source system to area available to destination system
	

A common scenario and system evolution is:
	Destination systems are built by extracting data from source systems.
	Soon it is deiscovered that the source system changes so breaking the interface.
	A contract is put in place ensuring that the source system informs the destination system (gets sign-off for changes)
	The source system doesn't have time or a personell change breakes the contract
	Tests are put in place to ensure the release tests encompass the interface
	It is discovered that updates to the destination system makes synchronisation of the testing arduous.
	Changes to the destination system extract causes performance issues and blocking on the source system.
	The destination system is downgraded or removed

This can cause convoluted and fragile architecture
	Semaphores/tokens created which indicate that the systems are ready
		These proliferate as it is discovered more and more are needed
		Personel changes often cause issues as the methods and locking mechanisms are not well understood

These issues can be alleviated by a simple rule
	The source system is responsible for delivering the required data in the required format as defined by the destination system
	
This means source system is responsible for:
	Any performance impact
	The delivery is available after any updates to the system

There is a downside
	Any issues with the source system mean that the data may not be available
But, the upside
	Any issues with the source system mean that bad data will not be delivered or it is clear where the fault lies
	
A simple method of system interface is via file transfer.
The source system produces files of data
Optionally the files are moved to a place accessible by the destination system
The destination system imports the files

This has a number of advantages
	The Contract between the systems is via the file definition.
	Either system can be changed as long as the file definition is supported
	Requests of changes between systems are via the file definition
	The destination system does not impact the source system
	The source system can controlwhen the file is extracted
	There is an automatic historic record of the interaction via the files
	Any negotiation between the systems is limited to the file definition and how that data should be processed.
	Crucially 
		Either system can be tested without access to the other system
		Tests are repeatable by recourse to the files	

This makes system imploementation simpler as there is not system interaction during development or testing.
The system contract is well defined

There are a few things to put in place with the interface contract
	The file name format
	The file contents definition
	The file availablitlity - a contract between the two systems as to when the files will be available.
	A means for the destination system to detect and import the files
	What should happen with an interface failure - late, missing or incorrect files

I would suggest the file name format should be something along the lines of
	_yyyymmdd_hhmmss.txt
		With yyyymmdd_hhmmss representing the datetime the file is produced
	e.g.
		Tr_Sales_20150712_102011.txt
	This allows for many files per day
	If the source system can only produce a file with a file name then put in place a system to rename it so it does not get overwritten.
	In this way it is easy to view the history of the files and order them by sequence
	It allows many files to be produced in a day and records the production datetime.
It may be useful to include a start, end datetime in the filename representing the parameters used in the file production.
	
	
And that FileName should be unique for all the file types to be transferred.
For the file structure it depends on the daat being transferred.
I prefer to include headers, have ascii files.
	The headers make the interface a bit more flexibile as colkumns can be added or removed.
	The source system can add columns and the destination system changed later
	The destination system can be changed to remoce a column and the source system remove the column later
	The interface will be resistent to column order changes
	The file will be readable without decoding
	The file can be zipped for archive

A few things need to be put in place which can be considered infrastructure
	The file needs to be accessible to the destination system
		Usually corporations have a standard for this.
		It can be a shared file area, the source system local area synchronised with it
		It can be an area to which the files are delivered via FTP or some other means
		Warning: 	The source system and transfer method must prevent the files from being accessed before they are complete
					Easiest method rename them - e.g. produce as <>.tmp and rename to <>.txt
					Could also produce in one folder then move to anoth folder on the same drive.
					If this is not done the transfer or import will detect the files before complete and fail.
					Or worse transfer incomplete files (common issue with ftp)
	The destination system needs facilities to
		Detecting the delivery of the files - both to control the import and to monitor the source system
		Import the files when available and when the system is ready to process them
		Process the data after processing

I would suggest the following processes in the destinations system (note: this can process files from any number of systems)
	Make the processing single threaded - do not attempt to parallel process as this will make the system difficult to optimise and maintain
		Often a system can process more efficiently if all resources are available and not blocking
		I have seen systems take hours to process when they could process in minutes if the processes where sequential
	Log all processes
		The appearance of the files eneds to be logged for monitoring
		The processing of the files along with file counts needs to be logged for performance monitoring and failure diagnosis.
			I usually log start and end of processes and any major tasks within the process.
		Create a central log
			I log everything to a table called trace. This is for ad hoc logging by he developed processes and for failure diagnosis.
			Separate are tables used for monitoring and control, arrival of file, which processes need to be run etc.
	Import the files to staging tables in a staging database then process
		This separates the file structure from the processing
		The staging database will generally be unlogged and have volatile data and can be configured with this in mind.
		A failure of the processing means the file does not need to be re-imported.
		The data is available in the staging table for error diagnosis

Destination system tasks (SQL Server specific)
	There are two scheduled tasks
		File detection - scheduled frequently - every 5-10 secs
		System process - scheduled less frequently - every 5 minutes - this depends on the system
			This has multiple tasks including file import followed by system processing
	File detection
		Have a task that repeatedly checks the folders for any new files
		Do not be tempted to allow an outside system to populate a table with file names.
		The process populates a table with the file and datetime
		This can also be used to rename the file if necessary
	File import
		I now generally prefer SSIS. Build a package for a single file import to a staging table
		Load the package with the file name as a parameter to perform the import
		The package calls a stored procedure to indicate the import is complete and the row count.
		The process I use: (this allows for changes to the file column order and only imports the columns needed)
			A control package shich receives the file path from a stored procedure
			Reads the first (header) row from the file
			Calls a stored procedure with the file type and header
			The SP returns the xml to create a package to import the data to the staging table
			The control package writes the XML to a package and executes it.
		If the system is prone to a backlog of files to import then, to stop the staging tables increasing in size and impacting performance
			Import a file
			Merge the data into the production database
			delete the staging data
	File process
		The data is picked up from the staging table and processed into the production database
	Process control
		This is via stored procedures
		Tables hold the processes that need to be run to import each file type, the dependancies between file data and final processes
		There is one stored procedure called to executed processes that are available - in a loop one at a time
		Another stored procedure is called when the process is complete - indicating success or failure
		Note: this method allows for the processes to be distributed to remote systems if necessary
		The control stored procedure is called and executes in turn any process that is available.
		The process table is updated on completion of any task and any tasks now available are executed on the next schedule.


home