8 Refreshing a Data Warehouse

You must update your data warehouse on a regular basis to ensure that the information derived from it is current. The process of updating the data is called the refresh process, and this chapter describes the following topics:

About Refreshing Your Data Warehouse

Extraction, transformation and loading (ETL) is done on a schedule to reflect changes made to the original source system. During this step, you physically insert the new, updated data into the production data warehouse schema and take all the other steps necessary (such as building indexes, validating constraints, making backup copies) to make this new data available to the users. After this data has been loaded into the data warehouse, the materialized views must be updated to reflect the latest data.

The partitioning scheme of the data warehouse is often crucial in determining the efficiency of refresh operations in the data warehouse loading process. The loading process is often considered when choosing the partitioning scheme of data warehouse tables.

Most data warehouses are loaded with new data on a regular schedule. For example, every night, week, or month, new data is brought into the data warehouse. The data being loaded at the end of the week or month typically corresponds to the transactions for the week or month. In this common scenario, the data warehouse is being loaded by time. This suggests that the data warehouse tables be partitioned on a date column. In the data warehouse example, suppose the new data is loaded into the sales table every month. Furthermore, the sales table has been partitioned by month. These steps show how the load process will proceed to add the data for a new month (Q1 2006) to the table sales.

Example: Refreshing Your Data Warehouse

Many queries request few columns from the products, customers, and sales tables, restricting the query by date. A materialized view will speed up the majority of the queries against the three tables. Use a prebuilt table on top of which the materialized view will be created. Choose the partitioning strategy of the materialized view in synchronization with the sales table's partitioning strategy.

The following example shows the refreshing of a materialized view. It uses a partition exchange loading operation. The example is based on the sales table in the sh schema.

To refresh a materialized view:

  1. Create a table that will be the basis for the materialized view.

    CREATE TABLE sales_prod_cust_mv
    ( time_id DATE
    , prod_id NUMBER
    , prod_name VARCHAR2(50)
    , cust_id NUMBER
    , cust_first_name VARCHAR2(20)
    , cust_last_name VARCHAR2(40)
    , amount_sold NUMBER
    , quantity_sold NUMBER
    )
    PARTITION BY RANGE (time_id)
    ( PARTITION p1999 VALUES LESS THAN (TO_DATE('01-JAN-2000','DD-MON-YYYY'))
    , PARTITION p2000 VALUES LESS THAN (TO_DATE('01-JAN-2001','DD-MON-YYYY'))
    , PARTITION p2001h1 VALUES LESS THAN (TO_DATE('01-JUL-2001','DD-MON-YYYY'))
    , PARTITION p2001h2 VALUES LESS THAN (TO_DATE('01-JAN-2002','DD-MON-YYYY'))
    , PARTITION p2001q1 VALUES LESS THAN (TO_DATE('01-APR-2002','DD-MON-YYYY'))
    , PARTITION p2002q2 VALUES LESS THAN (TO_DATE('01-JUL-2002','DD-MON-YYYY'))
    , PARTITION p2002q3 VALUES LESS THAN (TO_DATE('01-OCT-2002','DD-MON-YYYY'))
    , PARTITION p2002q4 VALUES LESS THAN (TO_DATE('01-JAN-2003','DD-MON-YYYY'))
    , PARTITION p2003q1 VALUES LESS THAN (TO_DATE('01-APR-2003','DD-MON-YYYY'))
    , PARTITION p2003q2 VALUES LESS THAN (TO_DATE('01-JUL-2003','DD-MON-YYYY'))
    , PARTITION p2003q3 VALUES LESS THAN (TO_DATE('01-OCT-2003','DD-MON-YYYY'))
    , PARTITION p2003q4 VALUES LESS THAN (TO_DATE('01-JAN-2004','DD-MON-YYYY'))
    , PARTITION p2004q1 VALUES LESS THAN (TO_DATE('01-APR-2004','DD-MON-YYYY'))
    , PARTITION p2004q2 VALUES LESS THAN (TO_DATE('01-JUL-2004','DD-MON-YYYY'))
    , PARTITION p2004q3 VALUES LESS THAN (TO_DATE('01-OCT-2004','DD-MON-YYYY'))
    , PARTITION p2004q4 VALUES LESS THAN (TO_DATE('01-JAN-2005','DD-MON-YYYY'))
    , PARTITION p2005q1 VALUES LESS THAN (TO_DATE('01-APR-2005','DD-MON-YYYY'))
    , PARTITION p2005q2 VALUES LESS THAN (TO_DATE('01-JUL-2005','DD-MON-YYYY'))
    , PARTITION p2005q3 VALUES LESS THAN (TO_DATE('01-OCT-2005','DD-MON-YYYY'))
    , PARTITION p2005q4 VALUES LESS THAN (TO_DATE('01-JAN-2006','DD-MON-YYYY'))
    , PARTITION p2006q1 VALUES LESS THAN (TO_DATE('01-APR-2006','DD-MON-YYYY'))
    ) PARALLEL COMPRESS;
    
  2. Load the initial table from the sales table.

    ALTER SESSION ENABLE PARALLEL DML;
    INSERT /*+ PARALLEL smv */ INTO sales_prod_cust_mv smv
    SELECT /*+ PARALLEL s PARALLEL c */ s.time_id
    , s.prod_id
    , p.prod_name
    , s.cust_id
    , cust_first_name
    , c.cust_last_name
    , SUM(s.amount_sold)
    , SUM(s.quantity_sold)
    FROM sales s
    , products p
    , customers c
    WHERE s.cust_id = c.cust_id
    AND s.prod_id = p.prod_id
    GROUP BY s.time_id
    , s.prod_id
    , p.prod_name
    , s.cust_id
    , c.cust_first_name
    , c.cust_last_name;
    COMMIT;
    
  3. Create a materialized view.

    CREATE MATERIALIZED VIEW sales_prod_cust_mv
    ON PREBUILT TABLE
    ENABLE QUERY REWRITE
    AS SELECT s.time_id
    , s.prod_id
    , p.prod_name
    , s.cust_id
    , c.cust_first_name
    , c.cust_last_name
    , SUM(s.amount_sold) amount_sold
    , SUM(s.quantity_sold) quantity_sold
    FROM sales s
    , products p
    , customers c
    WHERE s.cust_id = c.cust_id
    AND s.prod_id = p.prod_id
    GROUP BY s.time_id
    , s.prod_id
    , p.prod_name
    , s.cust_id
    , c.cust_first_name
    , c.cust_last_name;
    
  4. Load a separate table to be exchanged with the new partition.

    CREATE TABLE sales_q1_2006 PARALLEL COMPRESS
    AS SELECT * FROM sales
    WHERE 0 = 1;
     
    /* This would be the regular ETL job */
     
    ALTER SESSION ENABLE PARALLEL DML;
     
    INSERT /* PARALLEL qs */ INTO sales_q1_2006 qs
    SELECT /* PARALLEL s */ prod_id
    , cust_id
    , add_months(time_id,3)
    , channel_id
    , promo_id
    , quantity_sold
    , amount_sold
    FROM sales PARTITION(sales_q4_2005) s;
     
    COMMIT;
     
    CREATE BITMAP INDEX bmp_indx_prod_id ON sales_q1_2006 (prod_id);
    CREATE BITMAP INDEX bmp_indx_cust_id ON sales_q1_2006 (cust_id);
    CREATE BITMAP INDEX bmp_indx_time_id ON sales_q1_2006 (time_id);
    CREATE BITMAP INDEX bmp_indx_channel_id ON sales_q1_2006 (channel_id);
    CREATE BITMAP INDEX bmp_indx_promo_id ON sales_q1_2006 (promo_id);
     
    ALTER TABLE sales_q1_2006 ADD CONSTRAINT sales_q_prod_fk
    FOREIGN KEY (prod_id) REFERENCES products(prod_id) ENABLE NOVALIDATE;
     
    ALTER TABLE sales_q1_2006 ADD CONSTRAINT sales_q_cust_fk
    FOREIGN KEY (cust_id) REFERENCES customers(cust_id) ENABLE NOVALIDATE;
     
    ALTER TABLE sales_q1_2006 ADD CONSTRAINT sales_q_time_fk
    FOREIGN KEY (time_id) REFERENCES times(time_id) ENABLE NOVALIDATE;
     
    ALTER table sales_q1_2006 ADD CONSTRAINT sales_q_channel_fk
    FOREIGN KEY (channel_id) REFERENCES channels(channel_id) ENABLE NOVALIDATE;
     
    ALTER table sales_q1_2006 ADD CONSTRAINT sales_q_promo_fk
    FOREIGN KEY (promo_id) REFERENCES promotions(promo_id) ENABLE NOVALIDATE;
     
    BEGIN
      DBMS_STATS.GATHER_TABLE_STATS('SH','SALES_Q1_2006');
    END;
    /
    
  5. Create and load a separate table to be exchanged with a partition in the materialized view.

    CREATE TABLE sales_mv_q1_2006 PARALLEL COMPRESS
    AS SELECT * FROM sales_prod_cust_mv
    WHERE 1 = 0;
    
    ALTER SESSION ENABLE PARALLEL DML;
    
    INSERT /*+ PARALLEL smv */ INTO sales_mv_q1_2006 smv
    SELECT /*+ PARALLEL s PARALLEL c */ s.time_id
    , s.prod_id
    , p.prod_name
    , s.cust_id
    , cust_first_name
    , c.cust_last_name
    , SUM(s.amount_sold)
    , SUM(s.quantity_sold)
    FROM sales_q1_2006 s
    , products p
    , customers c 
    WHERE s.cust_id = c.cust_id
    AND s.prod_id = p.prod_id
    GROUP BY s.time_id
    , s.prod_id
    , p.prod_name
    , s.cust_id 
    , c.cust_first_name
    , c.cust_last_name;
    
    COMMIT;
    
  6. Gather statistics.

    BEGIN
      DBMS_STATS.GATHER_TABLE_STATS('SH','SALES_MV_Q1_2006');
    END;
    
  7. Exchange the partitions.

    ALTER TABLE sales
    EXCHANGE PARTITION sales_q1_2006
    WITH TABLE sales_q1_2006
    INCLUDING INDEXES WITHOUT VALIDATION;
     
    ALTER TABLE sales_prod_cust_mv
    EXCHANGE PARTITION p2006q1
    WITH TABLE sales_mv_q1_2006
    INCLUDING INDEXES WITHOUT VALIDATION;
    
  8. Inform the database that the materialized view is fresh again.

    ALTER MATERIALIZED VIEW sales_prod_cust_mv CONSIDER FRESH;
    

Note that because this scenario uses a prebuilt table and, because the constraints are not RELY constraints, the query rewrite feature will work only with the query_rewrite_integrity parameter set to STALE_TOLERATED.

Using Rolling Windows to Offload Data

A particularly effective way of removing and archiving your data is through the use of a rolling window. An example of using a rolling window is when the data warehouse stores the most recent 36 months of sales data. A new partition can be added to the sales table for each new month, and an old partition can be removed from the sales table. This way, you will always maintain 36 months of data in the warehouse.

Example: Using a Rolling Window

The following example shows a rolling window for the sales table in the sh schema.

To use a rolling window:

  1. Add the sales for December 2005.

    ALTER TABLE sales
    ADD PARTITION sales_12_2005 VALUES LESS THAN ('01-JAN-2006');
    

    Note that you must rebuild any existing indexes.

  2. Drop the partition for 1999.

    ALTER TABLE sales
    DROP PARTITION sales_1999;