Building a Dynamic Oracle ETL Procedure

vjain's picture
articles: 

If you are on Oracle 9i or higher, you have external tables and pipelined table functions available to meet your ETL needs. But in order to utilize these tools in a dynamic ETL environment, you need to design PL/SQL procedures that can support the loading of files dynamically. This article will provide you with one design that has proven to be very robust and scalable.

In order to understand the procedure design, you should have a basic understanding of how external tables and pipelined table functions are used. You can find more information and detailed examples at http://www.oracle-developer.com/oracle_etl.html. The result of both tools is the ability to select data from a file and insert it into a relational table.

Capturing File Information


The ETL procedure should accept the file name or path as an input parameter. If you need to capture any information from the file path into the relational table, then it is a good idea to use the file path as an input. For example if I had sales data that was uploaded from 3 different stores in 3 different incoming directories and I wanted to load all stores into the same table, I would probably want to include a column to identify the store in my table. However, if the file format and file name did not include the store information, I would need to capture the store information from the file path.

STORE A: /incoming/sales_data/storeA/20070720.dat
STORE B: /incoming/sales_data/storeB/20070720.dat
STORE C: /incoming/sales_data/storeC/20070720.dat

We would pass the entire path into the ETL procedure and the procedure would extract the store using a SQL query in the PL/SQL such as

SQL> with s as (select '/incoming/sales_data/storeA/20070720.dat' p_file_path from dual)
  2  select substr(p_file_path, instr(p_file_path, '/', 1,3)+1, 
  3  instr(p_file_path, '/', 1,4)-instr(p_file_path, '/', 1,3)-1) STORE_NAME from s
  4  /

STORE_
------
storeA

We could then extract the file name using a query such as
SQL> with s as (select '/incoming/sales_data/storeA/20070720.dat' p_file_path from dual)
  2  select substr(p_file_path, instr(p_file_path,'/', -1)+1, 
  3  length(p_file_path)-instr(p_file_path,'/', -1)) FILE_NAME from s
  4  /

FILE_NAME
------------
20070720.dat

If we want, we can get the file size as well. First create the following function:
CREATE OR REPLACE FUNCTION flength (
   location_in   IN   VARCHAR2,
   file_in       IN   VARCHAR2
)
   RETURN PLS_INTEGER

IS 
   TYPE fgetattr_t  IS RECORD (
      fexists       BOOLEAN,
      file_length   PLS_INTEGER,
      block_size    PLS_INTEGER
   );


   fgetattr_rec   fgetattr_t;
BEGIN
   UTL_FILE.fgetattr (
      location         => location_in,
      filename         => file_in,
      fexists          => fgetattr_rec.fexists,

      file_length      => fgetattr_rec.file_length,
      block_size       => fgetattr_rec.block_size
   );
   RETURN fgetattr_rec.file_length;
END flength;

And now, inside the PL/SQL procedure, we can capture the file size:

declare
l_file_size number;
begin
l_file_size := flength ('DAT_DIR’,l_filename);
end;
/

Now that we have the file name, we are ready to dynamically redefine the external table’s data source.

Redefine the External Table


From our original definition, the external table SAMPLE_EXT is pointing to a file called “sample.csv”. In order to load the different files from our incoming directories based on the input parameter, we will need to alter the table appropriately. After we have successfully stored the file name from the input parameter, we are ready to define our external table to point to the current data file.
 
execute immediate 'alter table SAMPLE_EXT location(''' || p_file_name || ''')'; 

Now that our external table is pointing at the file from our input parameter, we are ready to begin our insert from the external table to our relational table.

Insert the data


Since the external table is pointing at our new file, all we need to do is create the insert statement. We can pass any additional variables from the statement.

Insert into sample_tab
Select * from sample_ext
/

If there were some additional values defined in the destination table, we could include the variables in our SELECT. For example if we added a column, LOAD_DATE to the end of the relational table, SAMPLE_TAB, then we would just add SYSDATE at the end of the select fields as such…

Insert into sample_tab
Select A.*, sysdate from sample_ext A
/

If we need to capture the number of rows inserted, we can do so using (prior to the commit)
  
ln_rowcount := SQL%rowcount;

Setup for the demonstration


Here are the set up steps for the demonstration
Create directory DAT_DIR as ‘/usr/tmp’;
File: july19.csv saved to /usr/tmp
ID, BEGIN_DATE, FIRST_NAME, LAST_NAME, STATUS
1, 19-JUL-07, John, Adams, Active
2, 19-JUL-07, Tyler, Howell, Active
3, 19-JUL-07, Jim, Lopez, Active
4, 19-JUL-07, Carlos, White, Inactive
5, 19-JUL-07, Scott, Tiger, Active

File: july20.csv saved to /usr/tmp
ID, BEGIN_DATE, FIRST_NAME, LAST_NAME, STATUS
1, 20-JUL-07, John, Adams, Active
2, 20-JUL-07, Tyler, Howell, Active
3, 20-JUL-07, Jim, Lopez, Active
4, 20-JUL-07, Carlos, White, Inactive
5, 20-JUL-07, Scott, Tiger, Active

External and matching relational table definition
  
CREATE TABLE SAMPLE_EXT 
( 
ID NUMBER, 
BEGIN_DATE VARCHAR2(10),   
FIRST_NAME VARCHAR2(30), 
LAST_NAME VARCHAR2(30), 
STATUS VARCHAR2(10) 
)                       
 ORGANIZATION EXTERNAL
 (
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY DAT_DIR
   ACCESS PARAMETERS
   (
            records delimited by newline LOAD WHEN (ID != ‘ID’)
            badfile BAD_DIR:'SAMPLE_EXT%a_%p.bad'
            logfile LOG_DIR:'SAMPLE_EXT%a_%p.log'
fields terminated by ',' optionally enclosed by '"' LRTRIM
            MISSING FIELD VALUES ARE NULL 
   )
   LOCATION (‘sample.csv')
 )
 PARALLEL 4
 REJECT LIMIT 1000;

CREATE TABLE SAMPLE_TAB 
( 
ID NUMBER, 
BEGIN_DATE VARCHAR2(10),   
FIRST_NAME VARCHAR2(30), 
LAST_NAME VARCHAR2(30), 
STATUS VARCHAR2(10) 
);

Now that the objects are in place, we are ready to define our procedure.

The Final Procedure
Including each of the steps above, we can create a PL/SQL procedure that loads any file from our incoming directory (DAT_DIR) based on the input parameter. This gives us a powerful tool for loading files in a large scale, dynamic environment. Based on the information above, the PL/SQL code might be…

SQL> CREATE OR REPLACE FUNCTION SAMPLE_ETL_FNC( P_FILE_PATH VARCHAR2 ) RETURN NUMBER AS
  2  l_filename varchar2(100);
  3  l_rowcount number;
  4  l_sqlerrm varchar(2000);
  5  l_filesize number;
  6  BEGIN
  7  
  8  DBMS_OUTPUT.PUT_LINE(to_char(sysdate, 'YYYY-MON-DD HH:MI:SS') || ' - BEGIN ETL PROCEDURE - Path
: ' || p_file_path);
  9  
 10  --Flexible to allow for file name or file path
 11  if instr(p_file_path, '/') > 0 then
 12   select substr(p_file_path, instr(p_file_path,'/', -1)+1, 
 13   length(p_file_path)-instr(p_file_path,'/', -1))
 14   INTO l_filename
 15   from dual;
 16  else
 17   l_filename := p_file_path;
 18  end if;
 19  
 20  --Get file size if required
 21  --l_filesize := flength ('DAT_DIR',l_filename);
 22  DBMS_OUTPUT.PUT_LINE('Filename: ' || l_filename);
 23  
 24  BEGIN
 25   --Alter the Log and Bad file names for debugging
 26   --Can modify the bad and log file names if required
 27   DBMS_OUTPUT.PUT_LINE('ALTER EXTERNAL TABLE DATA SOURCE');
 28    execute immediate 'alter table SAMPLE_EXT location('''||l_filename||''')';
 29   --Prepare Insert SQL
 30   DBMS_OUTPUT.PUT_LINE('BEGIN INSERT FROM EXTERNAL TO DATA TABLE');
 31   execute immediate 'Insert into sample_tab select * from sample_ext A';
 32    l_rowcount := SQL%rowcount;
 33    DBMS_OUTPUT.PUT_LINE(to_char(sysdate,'DD-MON-YYYY HH:MI:SS AM') || ': INSERT SUCCESSFUL... IN
SERTED ' || l_rowcount || ' ROWS'); 
 34    COMMIT;
 35    
 36    RETURN 0;
 37   exception      
 38    when others then
 39       DBMS_OUTPUT.PUT_LINE('INSERT FAILED');
 40       l_sqlerrm:=sqlerrm;
 41       DBMS_OUTPUT.PUT_LINE(l_sqlerrm);
 42      ROLLBACK; 
 43     commit;
 44      RETURN 1;
 45   END; 
 46   
 47  exception      
 48    when others then
 49     DBMS_OUTPUT.PUT_LINE(SQLERRM);
 50      RETURN 1;
 51   END;
 52  /

Function created.

Test the Oracle ETL function by placing the sample files described above in the DAT_DIR folder.
SQL> select * from sample_tab
  2  /

no rows selected

SQL> set serveroutput on
SQL> declare
  2  l_return number;
  3  begin
  4  l_return := SAMPLE_ETL_FNC('july19.csv');
  5  dbms_output.put_line('RETURN STATUS: ' || l_return);
  6  end;
  7  /
2007-JUL-24 04:49:11 - BEGIN ETL PROCEDURE - Path: july19.csv
Filename: july19.csv
ALTER EXTERNAL TABLE DATA SOURCE
BEGIN INSERT FROM EXTERNAL TO DATA TABLE
24-JUL-2007 04:49:11 PM: INSERT SUCCESSFUL... INSERTED 5 ROWS
RETURN STATUS: 0

PL/SQL procedure successfully completed.


SQL> select * from sample_tab
  2  /

        ID BEGIN_DATE FIRST_NAME                     LAST_NAME                      STATUS
---------- ---------- ------------------------------ ------------------------------ --------
         1 19-JUL-07  John                           Adams                          Active
         2 19-JUL-07  Tyler                          Howell                         Active
         3 19-JUL-07  Jim                            Lopez                          Active
         4 19-JUL-07  Carlos                         White                          Inactive
         5 19-JUL-07  Scott                          Tiger                          Active

5 rows selected.

SQL> declare
  2  l_return number;
  3  begin
  4  l_return := SAMPLE_ETL_FNC('july20.csv');
  5  dbms_output.put_line('RETURN STATUS: ' || l_return);
  6  end;
  7  /
2007-JUL-24 04:50:16 - BEGIN ETL PROCEDURE - Path: july20.csv
Filename: july20.csv
ALTER EXTERNAL TABLE DATA SOURCE
BEGIN INSERT FROM EXTERNAL TO DATA TABLE
24-JUL-2007 04:50:16 PM: INSERT SUCCESSFUL... INSERTED 5 ROWS
RETURN STATUS: 0

PL/SQL procedure successfully completed.

SQL> select * from sample_tab
  2  /

        ID BEGIN_DATE FIRST_NAME           LAST_NAME                      STATUS
---------- ---------- -------------------- ------------------------------ --------
         1 19-JUL-07  John                 Adams                          Active
         2 19-JUL-07  Tyler                Howell                         Active
         3 19-JUL-07  Jim                  Lopez                          Active
         4 19-JUL-07  Carlos               White                          Inactive
         5 19-JUL-07  Scott                Tiger                          Active
         1 20-JUL-07  John                 Adams                          Active
         2 20-JUL-07  Tyler                Howell                         Active
         3 20-JUL-07  Jim                  Lopez                          Active
         4 20-JUL-07  Carlos               White                          Inactive
         5 20-JUL-07  Scott                Tiger                          Active

10 rows selected.

About the author

V.J. Jain is an Oracle Database and Applications Consultant and owner of Varun Jain, Inc. - Oracle Consulting. With over 12 years of experience with database systems, he specializes in database performance, development, interfaces, and high performance solutions. Based in Orange County, California, he actively explores Oracle's newest technologies and is a member of the Oracle Partner Network and Beta program. Additional material by him can be found at http://www.oracle-developer.com.

Comments

It is an excellent tutorial to find as sample ETL process. May I request some help in how to loop through a folder contents and load file names with specific ID(tenentID) and load to specific tables. I will be thankfull for ever.

vjain's picture

You need to create a procedure that creates a temporary table to list the files in the directory and iterate through them. Tom Kyte has some good examples for how to do this and I always believe there is no reason to re-write a good book. Check out the bottom of this page http://asktom.oracle.com/tkyte/omag/00-nov/o60tom.html

--
Varun Jain, Inc.
www.varunjaininc.com
Oracle Partner, Oracle 11g Beta Program