ETL



                                     ETL (Extract Transform and Load)


 This section of the project goes a step further in elaborating the Data Warehouse concept which was described in the previous section. Here, we will see how the tool INFORMATICA is used to extract data from Source(s), transform it and then load it into the Target. Data transformation is done to eliminate any erroneous or redundant data. This ensures that only the correct data is loaded into the Target (OLAP), which will be used for analysis / reporting.
Informatica PowerCenter architecture is used to achieve the extract, transform and load of data. PowerCenter provides an environment that allows you to load data into a centralized location, such as a datamart, data warehouse, or operational data store (ODS). You can extract data from multiple sources, transform the data according to business logic you build in the client application, and load the transformed data into file and relational targets. PowerCenter provides the following integrated components:
  • PowerCenter repository. The PowerCenter repository is at the center of the PowerCenter suite. You create a set of metadata tables within the repository database that the PowerCenter applications and tools access. The PowerCenter Client and Server access the repository to save and retrieve metadata.
  • PowerCenter Repository Server. The PowerCenter Repository Server manages connections to the  repository from client applications. It inserts, updates, and fetches objects from the repository database tables. It also maintains object consistency.
  • PowerCenter Client. Use the PowerCenter Client to manage users, define sources and targets, build mappings and mapplets with the transformation logic, and create workflows to run the mapping logic. The PowerCenter Client has the following client applications: Repository Manager, Repository Server Administration Console, Designer, Workflow Manager, and Workflow Monitor.
  • PowerCenter Server. The PowerCenter Server extracts the source data, performs the data transformation, and loads the transformed data into the targets.
Sources
PowerCenter accesses the following sources:
  • Relational. Oracle, Sybase, Informix, IBM DB2, Microsoft SQL Server, and Teradata.
  • File. Fixed and delimited flat file, COBOL file, and XML.
  • Application. You can purchase additional PowerCenter Connect products to access business sources, such as PeopleSoft, SAP R/3, Siebel, IBM MQSeries, and TIBCO.
  • Mainframe. You can purchase PowerExchange for faster access to IBM DB2 on MVS.
  • Other. Microsoft Excel and Access.
Targets
PowerCenter can load data into the following targets:
  • Relational. Oracle, Sybase, Sybase IQ, Informix, IBM DB2, Microsoft SQL Server, and Teradata. 
  • File. Fixed and delimited flat file and XML.
  • Application. You can purchase additional PowerCenter Connect products to load data into SAP BW. You can also load data into IBM MQSeries message queues and TIBCO.
  • Other. Microsoft Access.
You can load data into targets using ODBC or native drivers, FTP, or external loaders.
Repository
The PowerCenter repository resides on a relational database. The repository database tables contain the instructions required to extract, transform, and load data. PowerCenter Client applications access the repository database tables through the Repository Server.
You add metadata to the repository tables when you perform tasks in the PowerCenter Client application, such as creating users, analyzing sources, developing mappings or mapplets, or creating workflows. The PowerCenter Server reads metadata created in the Client application when you run a workflow. The PowerCenter Server also creates metadata, such as start and finish times of a session or session status.
You can develop global and local repositories to share metadata:
  • Global repository. The global repository is the hub of the domain. Use the global repository to store common objects that multiple developers can use through shortcuts. These objects may include operational or Application source definitions, reusable transformations, mapplets, and mappings.
  • Local repositories. A local repository is within a domain that is not the global repository. Use local repositories for development. From a local repository, you can create shortcuts to objects in shared folders in the global repository. These objects typically include source definitions, common dimensions and lookups, and enterprise standard transformations. You can also create copies of objects in non-shared folders.
  • Version control. A versioned repository can store multiple copies, or versions, of an object. Each version is a separate object with unique properties. PowerCenter version control features allow you to efficiently develop, test, and deploy metadata into production.
You can connect to a repository, back up, delete, or restore repositories using pmrep, a command line program.
You can view much of the metadata in the Repository Manager. The Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the Informatica metadata repository.
Repository Server
The Repository Server manages repository connection requests from client applications. For each repository database registered with the Repository Server, it configures and manages a Repository Agent process. The Repository Server also monitors the status of running Repository Agents, and sends repository object notification messages to client applications.
The Repository Agent is a separate, multi-threaded process that retrieves, inserts, and updates metadata in the repository database tables. The Repository Agent ensures the consistency of metadata in the repository by employing object locking.
PowerCenter Client
The PowerCenter Client consists of the following applications that you use to manage the repository, design mappings, mapplets, and create sessions to load the data:
  • Repository Server Administration Console. Use the Repository Server Administration console to administer the Repository Servers and repositories.
  • Repository Manager. Use the Repository Manager to administer the metadata repository. You can create repository users and groups, assign privileges and permissions, and manage folders and locks.
  • Designer. Use the Designer to create mappings that contain transformation instructions for the PowerCenter Server. Before you can create mappings, you must add source and target definitions to the repository. The Designer has five tools that you use to analyze sources, design target schemas, and build source-to-target mappings:
    • Source Analyzer. Import or create source definitions.
    • Warehouse Designer. Import or create target definitions.
    • Transformation Developer. Develop reusable transformations to use in mappings.
    • Mapplet Designer. Create sets of transformations to use in mappings.
    • Mapping Designer. Create mappings that the PowerCenter Server uses to extract, transform, and load data.
  • Workflow Manager. Use the Workflow Manager to create, schedule, and run workflows. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data. The PowerCenter Server runs workflow tasks according to the links connecting the tasks. You can run a task by placing it in a workflow.
  • Workflow Monitor. Use the Workflow Monitor to monitor scheduled and running workflows for each PowerCenter Server. You can choose a Gantt Chart or Task view. You can also access details about those workflow runs.
PowerCenter Server : 
The PowerCenter Server reads mapping and session information from the repository. It extracts data from the mapping sources and stores the data in memory while it applies the transformation rules that you configure in the mapping. The PowerCenter Server loads the transformed data into the mapping targets.
The PowerCenter Server can achieve high performance using symmetric multi-processing systems. The PowerCenter Server can start and run multiple workflows concurrently. It can also concurrently process partitions within a single session. When you create multiple partitions within a session, the PowerCenter Server creates multiple database connections to a single source and extracts a separate range of data for each connection, according to the properties you configure.
Connectivity
PowerCenter uses the following types of connectivity:
  • Network protocol
  • Native drivers
  • ODBC
The PowerCenter Client uses ODBC and native drivers to connect to source and target databases. It uses TCP/IP to connect to the Repository Server. The Repository Server uses native drivers to connect to the repository database. The Workflow Manager and the PowerCenter Server use TCP/IP to communicate with each other.
The PowerCenter Server uses native drivers to connect to the databases to move data. You can optionally use ODBC to connect the PowerCenter Server to the source and target databases. It uses TCP/IP to connect to the PowerCenter Client.
Database Connections
The Repository Server maintains a pool of reusable database connections for serving client applications. The server generates a Repository Agent process for each database. The Repository Agent creates new database connections only if all the current connections are in use.
For example, if 10 clients send requests to the Repository Agent one at a time, the agent requires only one connection. It reuses the same database connection for all the requests. If the 10 clients send requests simultaneously, the Repository Agent opens 10 connections. You can set the maximum number of open connections using the DatabasePoolSize parameter in the repository configuration file.
For a session, a reader object holds the connection for as long as it needs to read the data from the source tables. A writer object holds a connection for as long as it needs to write data to the target tables.
The PowerCenter Server maintains a database connection pool for stored procedure or lookup databases in a workflow. You can optionally set the MaxLookupSPDBConnections parameter to limit connections when you configure the PowerCenter service. The PowerCenter Server allows an unlimited number of connections to lookup or stored procedure databases. If a database user does not have permission for the number of connections a session requires, the session fails.
For pre-session, post-session, and load stored procedures, consecutive stored procedures reuse a connection if they have identical connection attributes. Otherwise, the connection for one stored procedure closes and a new connection begins for the next stored procedure.
PowerCenter Metadata Reporter
You can use PowerCenter Metadata Reporter, a web-based application, to run prepackaged dashboards and reports against PowerCenter repository metadata. These reports help give you insight into your repository, which enhances your ability to analyze and manage your repository efficiently.
You can run PowerCenter Metadata Reporter from a browser on any workstation, even a workstation that does not have PowerCenter tools installed.


Repository Server Administration Console:

Use the Repository Server Administration Console to administer Repository Servers and repositories. A Repository Server can manage multiple repositories. You use the Repository Server Administration Console to create and administer the repository through the Repository Server.
You can use the Administration Console to perform the following tasks:
  • Add, edit, and remove repository configurations.
  • Export and import repository configurations.
  • Create a repository.
  • Promote a local repository to a global repository.
  • Copy a repository.
  • Delete a repository from the database.
  • Back up and restore a repository.
  • Start, stop, enable, and disable repositories.
  • Send repository notification messages.
  • Register and unregister a repository.
  • Propagate domain connection information for a repository.
  • View repository connections and locks.
  • Close repository connections.
  • Register and remove repository plug-ins.
  • Upgrade a repository.
Administration Console Windows : 
The Administration Console can display the following windows:
  • Console Tree. Repository Servers and managed repositories. The Administration Console displays a       different set of Action menu items depending on which node you select in the Console Tree. You can also right-click a node to access the Action menu items.
The Console Tree contains the following nodes:
    • PowerCenter Repository Servers
    • Repository Server name
    • Repositories
    • Repository name
    • Connections
    • Locks
    • Activity Log
    • Backups
    • Packages

  • Main. The Main window displays details of the node you select in the Console Tree. For example, if you select a repository in the Console Tree, the Main window displays the properties of the repository, such as the status and start time.
The Main window displays results in the following views:
    • List view. Displays a collection of items that includes an icon and a label.
    • HTML view. Displays repository information as a dynamic HTML page. The Administration Console only displays repositories in HTML view.


Repository Manager :
Use the Repository Manager to administer your repositories. The Repository Manager allows you to navigate through multiple folders and repositories, and perform the following tasks:
  • Manage the repository. You can perform repository management functions, such as copying, creating, starting, and shutting down repositories. You launch the Repository Server Administration Console to perform these functions.
  • Implement repository security. You can create, edit, and delete repository users and user groups. You can assign and revoke repository privileges and folder permissions.
  • Perform folder functions. You can create, edit, copy, and delete folders. Work you perform in the Designer and Workflow Manager is stored in folders. If you want to share metadata, you can configure a folder to be shared.
  • View metadata. You can analyze sources, targets, mappings, and shortcut dependencies, search by keyword, and view the properties of repository objects.
Repository Manager Windows :
The Repository Manager can display the following windows:
  • Navigator. Displays all objects that you create in the Repository Manager, the Designer, and the Workflow Manager. It is organized first by repository, then by folder and folder version. Viewable objects include sources, targets, dimensions, cubes, mappings, mapplets, transformations, sessions, and workflows. You can also view folder versions and business components.
  • Main. Provides properties of the object selected in the Navigator window. The columns in this window change depending on the object selected in the Navigator window.
  • Dependency. Shows dependencies on sources, targets, mappings, and shortcuts for objects selected  in either the Navigator or Main window.
  • Output. Provides the output of tasks executed within the Repository Manager, such as creating a repository.
Repository Objects : 
You create repository objects using the Repository Manager, Designer, and Workflow Manager client tools. You can view the following objects in the Navigator window of the Repository Manager:
  • Source definitions. Definitions of database objects (tables, views, synonyms) or files that provide source data.
  • Target definitions. Definitions of database objects or files that contain the target data.
  • Multi-dimensional metadata. Target definitions that are configured as cubes and dimensions.
  • Mappings. A set of source and target definitions along with transformations containing business logic that you build into the transformation. These are the instructions that the PowerCenter Server uses to transform and move data.
  • Reusable transformations. Transformations that you can use in multiple mappings.
  • Mapplets. A set of transformations that you can use in multiple mappings.
  • Sessions and workflows. Sessions and workflows store information about how and when the PowerCenter Server moves data. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data. A session is a type of task that you can put in a workflow. Each session corresponds to a single mapping.
WORKFLOW MONITOR:

The goal of the design process is to create mappings that depict the flow of data between sources and targets, including changes made to the data before it reaches the targets. However, before you can create a mapping, you must first create or import source and target definitions. You might also want to create reusable objects, such as reusable transformations or mapplets.
Perform the following design tasks in the Designer:
1.     Import source definitions. Use the Source Analyzer to connect to the sources and import the source definitions.
2.     Create or import target definitions. Use the Warehouse Designer to define relational, flat file, or XML targets to receive data from sources. You can import target definitions from a relational database or a flat file, or you can manually create a target definition.
3.     Create the target tables. If you add a target definition to the repository that does not exist in a relational database, you need to create target tables in your target database. You do this by generating and executing the necessary SQL code within the Warehouse Designer.
4.     Design mappings. Once you have source and target definitions in the repository, you can create mappings in the Mapping Designer. A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. A transformation is an object that performs a specific function in a mapping, such as looking up data or performing aggregation.
5.     Create mapping objects. Optionally, you can create reusable objects for use in multiple mappings. Use the Transformation Developer to create reusable transformations. Use the Mapplet Designer to create mapplets. A mapplet is a set of transformations that may contain sources and transformations.
6.     Debug mappings. Use the Mapping Designer to debug a valid mapping to gain troubleshooting information about data and error conditions.
7.     Import and export repository objects. You can import and export repository objects, such as sources, targets, transformations, mapplets, and mappings to archive or share metadata.

Designer Windows : 
You can display the following windows in the Designer:
  • Navigator. Connect to repositories, and open folders within the Navigator. You can also copy objects and create shortcuts within the Navigator.
  • Workspace. Open different tools in this window to create and edit repository objects, such as sources, targets, mapplets, transformations, and mappings.
  • Output. View details about tasks you perform, such as saving your work or validating a mapping.
  • Status bar. Displays the status of the operation you perform.
  • Overview. An optional window to simplify viewing a workspace that contains a large mapping or multiple objects. Outlines the visible area in the workspace and highlights selected objects in color.
  • Instance data. View transformation data while you run the Debugger to debug a mapping.
  • Target data. View target data while you run the Debugger to debug a mapping.
Loading Data:
In the Workflow Manager, you define a set of instructions to execute tasks, such as sessions, emails, and shell commands. This set of instructions is called a workflow.
After you create a workflow in the Workflow Designer, the next step is to add tasks to the workflow. The Workflow Manager includes tasks, such as the Session task, the Command task, and the Email task so you can design your workflow. The Session task is based on a mapping you build in the Designer.
You then connect tasks with links to specify the order of execution for the tasks you created. Use conditional links and workflow variables to create branches in the workflow.
When the workflow start time arrives, the PowerCenter Server retrieves the metadata from the repository to execute the tasks in the workflow.
You can monitor the workflow status in the Workflow Monitor.
Workflow Manager : 
The Workflow Manager consists of three tools to help you develop a workflow:
  • Task Developer. Create tasks you want to accomplish in the workflow in the Task Developer.
  • Workflow Designer. Create a workflow by connecting tasks with links in the Workflow Designer. You can also create tasks in the Workflow Designer as you develop the workflow.
  • Worklet Designer. Create a worklet in the Worklet Designer. A worklet is an object that groups a set of tasks. A worklet is similar to a workflow, but without scheduling information. You can nest multiple worklets inside a workflow.
Before you create a workflow, you must configure the following connection information:
  • PowerCenter Server connection. Register the PowerCenter Server with the repository before you can start it or create a session to run against it.
  • Database connections. Create connections to source and target systems.
  • Other connections. If you want to use external loaders or FTP, you configure these connections in the Workflow Manager.
Workflow Manager Windows
The Workflow Manager displays the following windows to help you create and organize workflows:
  • Navigator. Allows you to connect to and work in multiple repositories and folders.
  • Workspace. Allows you to create, edit, and view tasks, workflows, and worklets.
  • Output. Displays messages from the PowerCenter Server and the Repository Server. The Output window also displays messages when you save or validate tasks and workflows.
  • Overview. An optional window that makes it easier to view workbooks containing large workflows. Outlines the visible area in the workspace and highlights selected objects in color. Choose View-Overview Window to display this window.
Workflow Monitor :
After you create a workflow, you run the workflow in the Workflow Manager and monitor it in the Workflow Monitor. The Workflow Monitor is a tool that displays details about workflow runs in two views, Gantt Chart view and Task view. You can monitor workflows in online and offline modes.
The Workflow Monitor consists of the following windows:
  • Navigator window. Displays monitored repositories, servers, and repositories objects.
  • Output window. Displays messages from the PowerCenter Server.
  • Time window. Displays progress of workflow runs.
  • Gantt Chart view. Displays details about workflow runs in chronological format.
  • Task view. Displays details about workflow runs in a report format.


                      Transformations Overview

A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set of transformations that perform specific functions. For example, an Aggregator transformation performs calculations on groups of data.
Transformations in a mapping represent the operations the Integration Service performs on the data. Data passes through transformation ports that you link in a mapping or mapplet.
Transformations can be active or passive. Transformations can be connected to the data flow, or they can be unconnected.

Active Transformations :
An active transformation can perform any of the following actions:
*
Change the number of rows that pass through the transformation. For example, the Filter transformation is active because it removes rows that do not meet the filter condition. All multi-group transformations are active because they might change the number of rows that pass through the transformation.
*
Change the transaction boundary. For example, the Transaction Control transformation is active because it defines a commit or roll back transaction based on an expression evaluated for each row.

*
Change the row type. For example, the Update Strategy transformation is active because it flags rows for insert, delete, update, or reject.
The Designer does not allow you to connect multiple active transformations or an active and a passive transformation to the same downstream transformation or transformation input group because the Integration Service may not be able to concatenate the rows passed by active transformations. For example, one branch in a mapping contains an Update Strategy transformation that flags a row for delete. Another branch contains an Update Strategy transformation that flags a row for insert. If you connect these transformations to a single transformation input group, the Integration Service cannot combine the delete and insert operations for the row.

The Sequence Generator transformation is an exception to the rule listed above. The Designer does allow you to connect a Sequence Generator transformation and an active transformation to the same downstream transformation or transformation input group. A Sequence Generator transformation does not receive data. It generates unique numeric values. As a result, the Integration Service does not encounter problems concatenating rows passed by a Sequence Generator transformation and an active transformation.
The following figure shows how you can connect an active transformation and a passive Sequence Generator transformation to the same downstream transformation input group:
Passive Transformations :
A passive transformation does not change the number of rows that pass through the transformation, maintains the transaction boundary, and maintains the row type.
The Designer allows you to connect multiple transformations to the same downstream transformation or transformation input group only if all transformations in the upstream branches are passive. The transformation that originates the branch can be active or passive.
The following figure shows how you can connect passive transformations to the same downstream transformation input group:
Unconnected Transformations :
Transformations can be connected to the data flow, or they can be unconnected. An unconnected transformation is not connected to other transformations in the mapping. An unconnected transformation is called within another transformation, and returns a value to that transformation.
Transformation Descriptions :
The following table provides a brief description of each transformation:
Transformation
Type
Description
Aggregator
Active/
Connected
Performs aggregate calculations.
Application Source Qualifier
Active/
Connected
Represents the rows that the Integration Service reads from an application, such as an ERP source, when it runs a session.
Custom
Active or Passive/
Connected
Calls a procedure in a shared library or DLL.
Data Masking
Passive
Connected
Replaces sensitive production data with realistic test data for non-production environments.
Expression
Passive/
Connected
Calculates a value.
External Procedure
Passive/
Connected or Unconnected
Calls a procedure in a shared library or in the COM layer of Windows.
Filter
Active/
Connected
Filters data.
HTTP
Passive/
Connected
Connects to an HTTP server to read or update data.
Input
Passive/
Connected
Defines mapplet input rows. Available in the Mapplet Designer.
Java
Active or Passive/
Connected
Executes user logic coded in Java. The byte code for the user logic is stored in the repository.
Joiner
Active/
Connected
Joins data from different databases or flat file systems.
Lookup
Passive/
Connected or Unconnected
Looks up values.
Normalizer
Active/
Connected
Source qualifier for COBOL sources. Can also use in the pipeline to normalize data from relational or flat file sources.
Output
Passive/
Connected
Defines mapplet output rows. Available in the Mapplet Designer.
Rank
Active/
Connected
Limits records to a top or bottom range.
Router
Active/
Connected
Routes data into multiple transformations based on group conditions.
Sequence Generator
Passive/
Connected
Generates primary keys.
Sorter
Active/
Connected
Sorts data based on a sort key.
Source Qualifier
Active/
Connected
Represents the rows that the Integration Service reads from a relational or flat file source when it runs a session.
SQL
Active or Passive/
Connected
Executes SQL queries against a database.
Stored Procedure
Passive/
Connected or Unconnected
Calls a stored procedure.
Transaction Control
Active/
Connected
Defines commit and rollback transactions.
Union
Active/
Connected
Merges data from different databases or flat file systems.
Unstructured Data
Active or Passive/
Connected
Transforms data in unstructured and semi-structured formats.
Update Strategy
Active/
Connected
Determines whether to insert, delete, update, or reject rows.
XML Generator
Active/
Connected
Reads data from one or more input ports and outputs XML through a single output port.
XML Parser
Active/
Connected
Reads XML from one input port and outputs data to one or more output ports.
XML Source Qualifier
Active/
Connected
Represents the rows that the Integration Service reads from an XML source when it runs a session.
When you build a mapping, you add transformations and configure them to handle data according to a business purpose. Complete the following tasks to incorporate a transformation into a mapping:
1.
Create the transformation. Create it in the Mapping Designer as part of a mapping, in the Mapplet Designer as part of a mapplet, or in the Transformation Developer as a reusable transformation.
2.
Configure the transformation. Each type of transformation has a unique set of options that you can configure.

3.
Link the transformation to other transformations and target definitions. Drag one port to another to link them in the mapping or mapplet.


No comments:

Post a Comment