System Interfaces, data transfer and process control
Author Nigel Rivett
The detail of this architecture is specific to SQL Server but the principles are generic to any system.
Most of the processes involved here are included on this site.
A difficulty in most corporate infrastructure is the system interfaces.
If there are two systems that need to transfer data, how is this accomplished.
Often this is by the destination system that needs the data interogating the source system.
The source system is usually supported by a database which holds the data that needs to be transferred
The destination system is often supported by a database.
This often starts off as a reporting system but soon is changed to work off it's own database system for availability
This causes several problems
Is the data is available in thje source system
Can the destination system extract the data without undue impact on the source system (especially for reporting systems)
Is the data required consistent in the source system
Can the data be transferred in a timely fashion
Has the source system changed it's structures to impact the extract
Common data transfer methods that cause issues
Query access to source system from destination system
Replication from source system to area available to destination system
A common scenario and system evolution is:
Destination systems are built by extracting data from source systems.
Soon it is deiscovered that the source system changes so breaking the interface.
A contract is put in place ensuring that the source system informs the destination system (gets sign-off for changes)
The source system doesn't have time or a personell change breakes the contract
Tests are put in place to ensure the release tests encompass the interface
It is discovered that updates to the destination system makes synchronisation of the testing arduous.
Changes to the destination system extract causes performance issues and blocking on the source system.
The destination system is downgraded or removed
This can cause convoluted and fragile architecture
Semaphores/tokens created which indicate that the systems are ready
These proliferate as it is discovered more and more are needed
Personel changes often cause issues as the methods and locking mechanisms are not well understood
These issues can be alleviated by a simple rule
The source system is responsible for delivering the required data in the required format as defined by the destination system
This means source system is responsible for:
Any performance impact
The delivery is available after any updates to the system
There is a downside
Any issues with the source system mean that the data may not be available
But, the upside
Any issues with the source system mean that bad data will not be delivered or it is clear where the fault lies
A simple method of system interface is via file transfer.
The source system produces files of data
Optionally the files are moved to a place accessible by the destination system
The destination system imports the files
This has a number of advantages
The Contract between the systems is via the file definition.
Either system can be changed as long as the file definition is supported
Requests of changes between systems are via the file definition
The destination system does not impact the source system
The source system can controlwhen the file is extracted
There is an automatic historic record of the interaction via the files
Any negotiation between the systems is limited to the file definition and how that data should be processed.
Either system can be tested without access to the other system
Tests are repeatable by recourse to the files
This makes system imploementation simpler as there is not system interaction during development or testing.
The system contract is well defined
There are a few things to put in place with the interface contract
The file name format
The file contents definition
The file availablitlity - a contract between the two systems as to when the files will be available.
A means for the destination system to detect and import the files
What should happen with an interface failure - late, missing or incorrect files
I would suggest the file name format should be something along the lines of
With yyyymmdd_hhmmss representing the datetime the file is produced
This allows for many files per day
If the source system can only produce a file with a file name then put in place a system to rename it so it does not get overwritten.
In this way it is easy to view the history of the files and order them by sequence
It allows many files to be produced in a day and records the production datetime.
It may be useful to include a start, end datetime in the filename representing the parameters used in the file production.
And that FileName should be unique for all the file types to be transferred.
For the file structure it depends on the daat being transferred.
I prefer to include headers, have ascii files.
The headers make the interface a bit more flexibile as colkumns can be added or removed.
The source system can add columns and the destination system changed later
The destination system can be changed to remoce a column and the source system remove the column later
The interface will be resistent to column order changes
The file will be readable without decoding
The file can be zipped for archive
A few things need to be put in place which can be considered infrastructure
The file needs to be accessible to the destination system
Usually corporations have a standard for this.
It can be a shared file area, the source system local area synchronised with it
It can be an area to which the files are delivered via FTP or some other means
Warning: The source system and transfer method must prevent the files from being accessed before they are complete
Easiest method rename them - e.g. produce as <>.tmp and rename to <>.txt
Could also produce in one folder then move to anoth folder on the same drive.
If this is not done the transfer or import will detect the files before complete and fail.
Or worse transfer incomplete files (common issue with ftp)
The destination system needs facilities to
Detecting the delivery of the files - both to control the import and to monitor the source system
Import the files when available and when the system is ready to process them
Process the data after processing
I would suggest the following processes in the destinations system (note: this can process files from any number of systems)
Make the processing single threaded - do not attempt to parallel process as this will make the system difficult to optimise and maintain
Often a system can process more efficiently if all resources are available and not blocking
I have seen systems take hours to process when they could process in minutes if the processes where sequential
Log all processes
The appearance of the files eneds to be logged for monitoring
The processing of the files along with file counts needs to be logged for performance monitoring and failure diagnosis.
I usually log start and end of processes and any major tasks within the process.
Create a central log
I log everything to a table called trace. This is for ad hoc logging by he developed processes and for failure diagnosis.
Separate are tables used for monitoring and control, arrival of file, which processes need to be run etc.
Import the files to staging tables in a staging database then process
This separates the file structure from the processing
The staging database will generally be unlogged and have volatile data and can be configured with this in mind.
A failure of the processing means the file does not need to be re-imported.
The data is available in the staging table for error diagnosis
Destination system tasks (SQL Server specific)
There are two scheduled tasks
File detection - scheduled frequently - every 5-10 secs
System process - scheduled less frequently - every 5 minutes - this depends on the system
This has multiple tasks including file import followed by system processing
Have a task that repeatedly checks the folders for any new files
Do not be tempted to allow an outside system to populate a table with file names.
The process populates a table with the file and datetime
This can also be used to rename the file if necessary
I now generally prefer SSIS. Build a package for a single file import to a staging table
Load the package with the file name as a parameter to perform the import
The package calls a stored procedure to indicate the import is complete and the row count.
The process I use: (this allows for changes to the file column order and only imports the columns needed)
A control package shich receives the file path from a stored procedure
Reads the first (header) row from the file
Calls a stored procedure with the file type and header
The SP returns the xml to create a package to import the data to the staging table
The control package writes the XML to a package and executes it.
If the system is prone to a backlog of files to import then, to stop the staging tables increasing in size and impacting performance
Import a file
Merge the data into the production database
delete the staging data
The data is picked up from the staging table and processed into the production database
This is via stored procedures
Tables hold the processes that need to be run to import each file type, the dependancies between file data and final processes
There is one stored procedure called to executed processes that are available - in a loop one at a time
Another stored procedure is called when the process is complete - indicating success or failure
Note: this method allows for the processes to be distributed to remote systems if necessary
The control stored procedure is called and executes in turn any process that is available.
The process table is updated on completion of any task and any tasks now available are executed on the next schedule.