Woodbury, NY - November 15, 2010
What is ETL?
ETL, or Extract, Transform and Load, eases the combination of heterogeneous sources into a unified central repository. Usually this repository is a data warehouse or mart which will support enterprise business intelligence.
Extract - read data from multiple source systems into a single format. This process extracts the data from each native system and saves it to one target location. That source data may be any number of database formats, flat files, or document repositories. Usually, the goal is to extract the entire unmodified source system data, though certain checks and filters may be performed here to ensure the data meets an expected layout or to selectively remove data (e.g. potentially confidential information).
Transform - in this step, the data from the various systems is made consistent and linked. Some of the key operations here are:
Load - the transformed data is now written out to a warehouse/mart. The load process will usually preserve prior data. In some instances existing warehouse data is never removed, just marked as inactive. This provides full auditing and supports historical reporting.
There are a number of commercial and open source ETL tools available to assist in any ETL process. Some of the prominent ones are:
These tools provide a number of functions to facilitate the ETL workflow. The variety of source data types are handled automatically. A transformation engine makes it easy to create reusable scripts to handle the data mapping. Scheduling and error handling are also built in.
It is particularly advantageous to use an ETL tool in the following situations:
There are also times where the overhead and cost of setting up an ETL tool might not make sense. In these situations some combination of stored procedures, custom coding and off the shelf packages may make more sense. Scenarios of this type include:
As illustrated here, a typical ETL workflow will move the data through a few distinct phases. This allows each phase to be better defined and eases troubleshooting.
Source > Extract > Stage - this phase extracts all the appropriate data from each source system. The extract copies only data that has changed in the source system since its last run. The stage library contains all source information in a similar structure to how it appears in the source systems. All extracted information will remain in stage until it is successfully processed by the transform.
Stage > Transform > Warehouse - the data from stage is transformed into a warehouse. In this example this step includes some of the base transformations as well as the load of data into a single warehouse. In this phase, surrogate keys are added where needed, lookup value mappings are applied and related information from multiple source systems is combined into a single structure. Any errors encountered here are reported and the problem data remains in stage until corrected. No information is removed from the warehouse and all data there is tagged with effective, update and end timestamps.
Warehouse > Load > Mart - the current effective date from the warehouse is loaded to the mart to support analysis. While this is the final load of the process, this step also includes a transform of the data to an optimized dimensional form for reporting and analysis.
Business intelligence in the enterprise is greatly enhanced by unified data. ETL can be an important tool when combining heterogeneous sources into one cohesive central repository.