ETL (Extract Transform and Load)
This section of the project goes a step further in elaborating the Data Warehouse concept which was described in the previous section. Here, we will see how the tool INFORMATICA is used to extract data from Source(s), transform it and then load it into the Target. Data transformation is done to eliminate any erroneous or redundant data. This ensures that only the correct data is loaded into the Target (OLAP), which will be used for analysis / reporting.
Informatica PowerCenter architecture is used to achieve the extract, transform and load of data. PowerCenter provides an environment that allows you to load data into a centralized location, such as a datamart, data warehouse, or operational data store (ODS). You can extract data from multiple sources, transform the data according to business logic you build in the client application, and load the transformed data into file and relational targets. PowerCenter provides the following integrated components:
- PowerCenter repository. The PowerCenter repository is at the center of the PowerCenter suite. You create a set of metadata tables within the repository database that the PowerCenter applications and tools access. The PowerCenter Client and Server access the repository to save and retrieve metadata.
- PowerCenter Repository Server. The PowerCenter Repository Server manages connections to the repository from client applications. It inserts, updates, and fetches objects from the repository database tables. It also maintains object consistency.
- PowerCenter Client. Use the PowerCenter Client to manage users, define sources and targets, build mappings and mapplets with the transformation logic, and create workflows to run the mapping logic. The PowerCenter Client has the following client applications: Repository Manager, Repository Server Administration Console, Designer, Workflow Manager, and Workflow Monitor.
- PowerCenter Server. The PowerCenter Server extracts the source data, performs the data transformation, and loads the transformed data into the targets.
- Relational. Oracle, Sybase, Informix, IBM DB2, Microsoft SQL Server, and Teradata.
- File. Fixed and delimited flat file, COBOL file, and XML.
- Application. You can purchase additional PowerCenter Connect products to access business sources, such as PeopleSoft, SAP R/3, Siebel, IBM MQSeries, and TIBCO.
- Mainframe. You can purchase PowerExchange for faster access to IBM DB2 on MVS.
- Other. Microsoft Excel and Access.
- Relational. Oracle, Sybase, Sybase IQ, Informix, IBM DB2, Microsoft SQL Server, and Teradata.
- File. Fixed and delimited flat file and XML.
- Application. You can purchase additional PowerCenter Connect products to load data into SAP BW. You can also load data into IBM MQSeries message queues and TIBCO.
- Other. Microsoft Access.
The PowerCenter repository resides on a relational database. The repository database tables contain the instructions required to extract, transform, and load data. PowerCenter Client applications access the repository database tables through the Repository Server.
You add metadata to the repository tables when you perform tasks in the PowerCenter Client application, such as creating users, analyzing sources, developing mappings or mapplets, or creating workflows. The PowerCenter Server reads metadata created in the Client application when you run a workflow. The PowerCenter Server also creates metadata, such as start and finish times of a session or session status.
- Global repository. The global repository is the hub of the domain. Use the global repository to store common objects that multiple developers can use through shortcuts. These objects may include operational or Application source definitions, reusable transformations, mapplets, and mappings.
- Local repositories. A local repository is within a domain that is not the global repository. Use local repositories for development. From a local repository, you can create shortcuts to objects in shared folders in the global repository. These objects typically include source definitions, common dimensions and lookups, and enterprise standard transformations. You can also create copies of objects in non-shared folders.
- Version control. A versioned repository can store multiple copies, or versions, of an object. Each version is a separate object with unique properties. PowerCenter version control features allow you to efficiently develop, test, and deploy metadata into production.
You can connect to a repository, back up, delete, or restore repositories using pmrep, a command line program.
You can view much of the metadata in the Repository Manager. The Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the Informatica metadata repository.
The Repository Server manages repository connection requests from client applications. For each repository database registered with the Repository Server, it configures and manages a Repository Agent process. The Repository Server also monitors the status of running Repository Agents, and sends repository object notification messages to client applications.
The Repository Agent is a separate, multi-threaded process that retrieves, inserts, and updates metadata in the repository database tables. The Repository Agent ensures the consistency of metadata in the repository by employing object locking.
The PowerCenter Client consists of the following applications that you use to manage the repository, design mappings, mapplets, and create sessions to load the data:
- Repository Server Administration Console. Use the Repository Server Administration console to administer the Repository Servers and repositories.
- Repository Manager. Use the Repository Manager to administer the metadata repository. You can create repository users and groups, assign privileges and permissions, and manage folders and locks.
- Designer. Use the Designer to create mappings that contain transformation instructions for the PowerCenter Server. Before you can create mappings, you must add source and target definitions to the repository. The Designer has five tools that you use to analyze sources, design target schemas, and build source-to-target mappings:
- Source Analyzer. Import or create source definitions.
- Warehouse Designer. Import or create target definitions.
- Transformation Developer. Develop reusable transformations to use in mappings.
- Mapplet Designer. Create sets of transformations to use in mappings.
- Mapping Designer. Create mappings that the PowerCenter Server uses to extract, transform, and load data.
- Workflow Manager. Use the Workflow Manager to create, schedule, and run workflows. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data. The PowerCenter Server runs workflow tasks according to the links connecting the tasks. You can run a task by placing it in a workflow.
- Workflow Monitor. Use the Workflow Monitor to monitor scheduled and running workflows for each PowerCenter Server. You can choose a Gantt Chart or Task view. You can also access details about those workflow runs.
The PowerCenter Server reads mapping and session information from the repository. It extracts data from the mapping sources and stores the data in memory while it applies the transformation rules that you configure in the mapping. The PowerCenter Server loads the transformed data into the mapping targets.
The PowerCenter Server can achieve high performance using symmetric multi-processing systems. The PowerCenter Server can start and run multiple workflows concurrently. It can also concurrently process partitions within a single session. When you create multiple partitions within a session, the PowerCenter Server creates multiple database connections to a single source and extracts a separate range of data for each connection, according to the properties you configure.
The PowerCenter Client uses ODBC and native drivers to connect to source and target databases. It uses TCP/IP to connect to the Repository Server. The Repository Server uses native drivers to connect to the repository database. The Workflow Manager and the PowerCenter Server use TCP/IP to communicate with each other.
The PowerCenter Server uses native drivers to connect to the databases to move data. You can optionally use ODBC to connect the PowerCenter Server to the source and target databases. It uses TCP/IP to connect to the PowerCenter Client.
The Repository Server maintains a pool of reusable database connections for serving client applications. The server generates a Repository Agent process for each database. The Repository Agent creates new database connections only if all the current connections are in use.
For example, if 10 clients send requests to the Repository Agent one at a time, the agent requires only one connection. It reuses the same database connection for all the requests. If the 10 clients send requests simultaneously, the Repository Agent opens 10 connections. You can set the maximum number of open connections using the DatabasePoolSize parameter in the repository configuration file.
For a session, a reader object holds the connection for as long as it needs to read the data from the source tables. A writer object holds a connection for as long as it needs to write data to the target tables.
The PowerCenter Server maintains a database connection pool for stored procedure or lookup databases in a workflow. You can optionally set the MaxLookupSPDBConnections parameter to limit connections when you configure the PowerCenter service. The PowerCenter Server allows an unlimited number of connections to lookup or stored procedure databases. If a database user does not have permission for the number of connections a session requires, the session fails.
For pre-session, post-session, and load stored procedures, consecutive stored procedures reuse a connection if they have identical connection attributes. Otherwise, the connection for one stored procedure closes and a new connection begins for the next stored procedure.
You can use PowerCenter Metadata Reporter, a web-based application, to run prepackaged dashboards and reports against PowerCenter repository metadata. These reports help give you insight into your repository, which enhances your ability to analyze and manage your repository efficiently.
You can run PowerCenter Metadata Reporter from a browser on any workstation, even a workstation that does not have PowerCenter tools installed.
Repository Server Administration Console:
Use the Repository Server Administration Console to administer Repository Servers and repositories. A Repository Server can manage multiple repositories. You use the Repository Server Administration Console to create and administer the repository through the Repository Server.
- Add, edit, and remove repository configurations.
- Export and import repository configurations.
- Create a repository.
- Promote a local repository to a global repository.
- Copy a repository.
- Delete a repository from the database.
- Back up and restore a repository.
- Start, stop, enable, and disable repositories.
- Send repository notification messages.
- Register and unregister a repository.
- Propagate domain connection information for a repository.
- View repository connections and locks.
- Close repository connections.
- Register and remove repository plug-ins.
- Upgrade a repository.
- Console Tree. Repository Servers and managed repositories. The Administration Console displays a different set of Action menu items depending on which node you select in the Console Tree. You can also right-click a node to access the Action menu items.
- Main. The Main window displays details of the node you select in the Console Tree. For example, if you select a repository in the Console Tree, the Main window displays the properties of the repository, such as the status and start time.
Use the Repository Manager to administer your repositories. The Repository Manager allows you to navigate through multiple folders and repositories, and perform the following tasks:
- Manage the repository. You can perform repository management functions, such as copying, creating, starting, and shutting down repositories. You launch the Repository Server Administration Console to perform these functions.
- Implement repository security. You can create, edit, and delete repository users and user groups. You can assign and revoke repository privileges and folder permissions.
- Perform folder functions. You can create, edit, copy, and delete folders. Work you perform in the Designer and Workflow Manager is stored in folders. If you want to share metadata, you can configure a folder to be shared.
- View metadata. You can analyze sources, targets, mappings, and shortcut dependencies, search by keyword, and view the properties of repository objects.
- Navigator. Displays all objects that you create in the Repository Manager, the Designer, and the Workflow Manager. It is organized first by repository, then by folder and folder version. Viewable objects include sources, targets, dimensions, cubes, mappings, mapplets, transformations, sessions, and workflows. You can also view folder versions and business components.
- Main. Provides properties of the object selected in the Navigator window. The columns in this window change depending on the object selected in the Navigator window.
- Dependency. Shows dependencies on sources, targets, mappings, and shortcuts for objects selected in either the Navigator or Main window.
- Output. Provides the output of tasks executed within the Repository Manager, such as creating a repository.
You create repository objects using the Repository Manager, Designer, and Workflow Manager client tools. You can view the following objects in the Navigator window of the Repository Manager:
- Source definitions. Definitions of database objects (tables, views, synonyms) or files that provide source data.
- Target definitions. Definitions of database objects or files that contain the target data.
- Multi-dimensional metadata. Target definitions that are configured as cubes and dimensions.
- Mappings. A set of source and target definitions along with transformations containing business logic that you build into the transformation. These are the instructions that the PowerCenter Server uses to transform and move data.
- Reusable transformations. Transformations that you can use in multiple mappings.
- Mapplets. A set of transformations that you can use in multiple mappings.
- Sessions and workflows. Sessions and workflows store information about how and when the PowerCenter Server moves data. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data. A session is a type of task that you can put in a workflow. Each session corresponds to a single mapping.
WORKFLOW MONITOR:
The goal of the design process is to create mappings that depict the flow of data between sources and targets, including changes made to the data before it reaches the targets. However, before you can create a mapping, you must first create or import source and target definitions. You might also want to create reusable objects, such as reusable transformations or mapplets.
Perform the following design tasks in the Designer:
1. Import source definitions. Use the Source Analyzer to connect to the sources and import the source definitions.
2. Create or import target definitions. Use the Warehouse Designer to define relational, flat file, or XML targets to receive data from sources. You can import target definitions from a relational database or a flat file, or you can manually create a target definition.
3. Create the target tables. If you add a target definition to the repository that does not exist in a relational database, you need to create target tables in your target database. You do this by generating and executing the necessary SQL code within the Warehouse Designer.
4. Design mappings. Once you have source and target definitions in the repository, you can create mappings in the Mapping Designer. A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. A transformation is an object that performs a specific function in a mapping, such as looking up data or performing aggregation.
5. Create mapping objects. Optionally, you can create reusable objects for use in multiple mappings. Use the Transformation Developer to create reusable transformations. Use the Mapplet Designer to create mapplets. A mapplet is a set of transformations that may contain sources and transformations.
6. Debug mappings. Use the Mapping Designer to debug a valid mapping to gain troubleshooting information about data and error conditions.
7. Import and export repository objects. You can import and export repository objects, such as sources, targets, transformations, mapplets, and mappings to archive or share metadata.
- Navigator. Connect to repositories, and open folders within the Navigator. You can also copy objects and create shortcuts within the Navigator.
- Workspace. Open different tools in this window to create and edit repository objects, such as sources, targets, mapplets, transformations, and mappings.
- Output. View details about tasks you perform, such as saving your work or validating a mapping.
- Status bar. Displays the status of the operation you perform.
- Overview. An optional window to simplify viewing a workspace that contains a large mapping or multiple objects. Outlines the visible area in the workspace and highlights selected objects in color.
- Instance data. View transformation data while you run the Debugger to debug a mapping.
- Target data. View target data while you run the Debugger to debug a mapping.
Loading Data:
In the Workflow Manager, you define a set of instructions to execute tasks, such as sessions, emails, and shell commands. This set of instructions is called a workflow.
After you create a workflow in the Workflow Designer, the next step is to add tasks to the workflow. The Workflow Manager includes tasks, such as the Session task, the Command task, and the Email task so you can design your workflow. The Session task is based on a mapping you build in the Designer.
You then connect tasks with links to specify the order of execution for the tasks you created. Use conditional links and workflow variables to create branches in the workflow.
When the workflow start time arrives, the PowerCenter Server retrieves the metadata from the repository to execute the tasks in the workflow.
- Task Developer. Create tasks you want to accomplish in the workflow in the Task Developer.
- Workflow Designer. Create a workflow by connecting tasks with links in the Workflow Designer. You can also create tasks in the Workflow Designer as you develop the workflow.
- Worklet Designer. Create a worklet in the Worklet Designer. A worklet is an object that groups a set of tasks. A worklet is similar to a workflow, but without scheduling information. You can nest multiple worklets inside a workflow.
- PowerCenter Server connection. Register the PowerCenter Server with the repository before you can start it or create a session to run against it.
- Database connections. Create connections to source and target systems.
- Other connections. If you want to use external loaders or FTP, you configure these connections in the Workflow Manager.
- Navigator. Allows you to connect to and work in multiple repositories and folders.
- Workspace. Allows you to create, edit, and view tasks, workflows, and worklets.
- Output. Displays messages from the PowerCenter Server and the Repository Server. The Output window also displays messages when you save or validate tasks and workflows.
- Overview. An optional window that makes it easier to view workbooks containing large workflows. Outlines the visible area in the workspace and highlights selected objects in color. Choose View-Overview Window to display this window.
Workflow Monitor :
After you create a workflow, you run the workflow in the Workflow Manager and monitor it in the Workflow Monitor. The Workflow Monitor is a tool that displays details about workflow runs in two views, Gantt Chart view and Task view. You can monitor workflows in online and offline modes.
- Navigator window. Displays monitored repositories, servers, and repositories objects.
- Output window. Displays messages from the PowerCenter Server.
- Time window. Displays progress of workflow runs.
- Gantt Chart view. Displays details about workflow runs in chronological format.
- Task view. Displays details about workflow runs in a report format.
A transformation is a repository
object that generates, modifies, or passes data. The Designer provides a set of
transformations that perform specific functions. For example, an Aggregator
transformation performs calculations on groups of data.
Transformations in a mapping
represent the operations the Integration Service performs on the data. Data
passes through transformation ports that you link in a mapping or mapplet.
Transformations can be active or
passive. Transformations can be connected to the data flow, or they can be
unconnected.
Active
Transformations :
An active transformation can perform
any of the following actions:
![]() |
Change the number of rows that
pass through the transformation.
For example, the Filter transformation is active because it removes rows that
do not meet the filter condition. All multi-group transformations are active
because they might change the number of rows that pass through the
transformation.
|
![]() |
Change the transaction boundary. For example, the Transaction Control transformation is
active because it defines a commit or roll back transaction based on an
expression evaluated for each row.
|
![]() |
Change the row type. For example, the Update Strategy transformation is active
because it flags rows for insert, delete, update, or reject.
|
The Designer does not allow you to
connect multiple active transformations or an active and a passive
transformation to the same downstream transformation or transformation input
group because the Integration Service may not be able to concatenate the rows
passed by active transformations. For example, one branch in a mapping contains
an Update Strategy transformation that flags a row for delete. Another branch
contains an Update Strategy transformation that flags a row for insert. If you
connect these transformations to a single transformation input group, the
Integration Service cannot combine the delete and insert operations for the row.
The Sequence Generator
transformation is an exception to the rule listed above. The Designer does
allow you to connect a Sequence Generator transformation and an active
transformation to the same downstream transformation or transformation input
group. A Sequence Generator transformation does not receive data. It generates
unique numeric values. As a result, the Integration Service does not encounter
problems concatenating rows passed by a Sequence Generator transformation and
an active transformation.
The following figure shows how you
can connect an active transformation and a passive Sequence Generator
transformation to the same downstream transformation input group:
Passive
Transformations :
A passive transformation does not
change the number of rows that pass through the transformation, maintains the
transaction boundary, and maintains the row type.
The Designer allows you to connect
multiple transformations to the same downstream transformation or
transformation input group only if all transformations in the upstream branches
are passive. The transformation that originates the branch can be active or
passive.
The following figure shows how you
can connect passive transformations to the same downstream transformation input
group:
Unconnected
Transformations :
Transformations can be connected to
the data flow, or they can be unconnected. An unconnected transformation is not
connected to other transformations in the mapping. An unconnected
transformation is called within another transformation, and returns a value to
that transformation.
Transformation
Descriptions :
The following table provides a brief
description of each transformation:
Transformation
|
Type
|
Description
|
Aggregator
|
Active/
Connected
|
Performs aggregate calculations.
|
Application Source Qualifier
|
Active/
Connected
|
Represents the rows that the
Integration Service reads from an application, such as an ERP source, when it
runs a session.
|
Custom
|
Active or Passive/
Connected
|
Calls a procedure in a shared
library or DLL.
|
Data Masking
|
Passive
Connected
|
Replaces sensitive production data
with realistic test data for non-production environments.
|
Expression
|
Passive/
Connected
|
Calculates a value.
|
External Procedure
|
Passive/
Connected or Unconnected
|
Calls a procedure in a shared
library or in the COM layer of Windows.
|
Filter
|
Active/
Connected
|
Filters data.
|
HTTP
|
Passive/
Connected
|
Connects to an HTTP server to read
or update data.
|
Input
|
Passive/
Connected
|
Defines mapplet input rows.
Available in the Mapplet Designer.
|
Java
|
Active or Passive/
Connected
|
Executes user logic coded in Java.
The byte code for the user logic is stored in the repository.
|
Joiner
|
Active/
Connected
|
Joins data from different
databases or flat file systems.
|
Lookup
|
Passive/
Connected or Unconnected
|
Looks up values.
|
Normalizer
|
Active/
Connected
|
Source qualifier for COBOL
sources. Can also use in the pipeline to normalize data from relational or
flat file sources.
|
Output
|
Passive/
Connected
|
Defines mapplet output rows.
Available in the Mapplet Designer.
|
Rank
|
Active/
Connected
|
Limits records to a top or bottom
range.
|
Router
|
Active/
Connected
|
Routes data into multiple
transformations based on group conditions.
|
Sequence Generator
|
Passive/
Connected
|
Generates primary keys.
|
Sorter
|
Active/
Connected
|
Sorts data based on a sort key.
|
Source Qualifier
|
Active/
Connected
|
Represents the rows that the
Integration Service reads from a relational or flat file source when it runs
a session.
|
SQL
|
Active or Passive/
Connected
|
Executes SQL queries against a
database.
|
Stored Procedure
|
Passive/
Connected or Unconnected
|
Calls a stored procedure.
|
Transaction Control
|
Active/
Connected
|
Defines commit and rollback transactions.
|
Union
|
Active/
Connected
|
Merges data from different
databases or flat file systems.
|
Unstructured Data
|
Active or Passive/
Connected
|
Transforms data in unstructured
and semi-structured formats.
|
Update Strategy
|
Active/
Connected
|
Determines whether to insert,
delete, update, or reject rows.
|
XML Generator
|
Active/
Connected
|
Reads data from one or more input
ports and outputs XML through a single output port.
|
XML Parser
|
Active/
Connected
|
Reads XML from one input port and
outputs data to one or more output ports.
|
XML Source Qualifier
|
Active/
Connected
|
Represents the rows that the
Integration Service reads from an XML source when it runs a session.
|
When you build a mapping, you add
transformations and configure them to handle data according to a business
purpose. Complete the following tasks to incorporate a transformation into a
mapping:
1.
|
Create the transformation. Create it in the Mapping Designer as part of a mapping,
in the Mapplet Designer as part of a mapplet, or in the Transformation Developer
as a reusable transformation.
|
2.
|
Configure the transformation. Each type of transformation has a unique set of options
that you can configure.
|
3.
|
Link the transformation to other
transformations and target definitions. Drag
one port to another to link them in the mapping or mapplet.
|
No comments:
Post a Comment