13 October 2012

Optimising Data Pump Export and Import Performance

Background

Recently I was tasked to come up with ways to significantly improve expdp & impdp performance. The client had a 300GB database (total of 76GB dump files) and it took 2.5 hours or longer to export the database - 5 hours or longer to import it. The existing job was already using parallelism of 8.

The customer was concerned over timings because they were considering data pump as a means of doing a database upgrade (fresh database) on a separate system. Furthermore they wanted to be able to use data pump for refresh of test data in some of their test databases. They wanted the option to exclude some of the data objects too.

The system was AIX-based and database was running in an LPAR with 5 CPU cores assigned to it. It was also connected to a SAN, which could sustain 1.5GB/s read and writes from this system with an average of 2-3ms latencies; which in fairness was one of the most capable I/O sub systems that I have encountered.

Throughout I used the two parameters metrics=y and trace=1FF0300 in order to generate trace information and some (rather rudimentary) metrics.

The problem

I started out simple and chose to unload the largest table in the database with different levels of parallelism, which produced the following results:

Processes Time Estimated MB/s unload of 19.40GB (write only)
8 2m39.70s 169.12
16 1m27.95s 451.68
32 1m11.63s 728.23
64 1m12.79s 698.09

This confirmed a couple things:
  1. The problem with unload performance did not appear to necessarily be data size related; this test unloaded 25% of the raw data set in terms of size
  2. More parallelism is better on this system – up to at least 32 processes; I didn't bother with 48, 40 etc. as the unload performance issues appeared not to be hampered by I/O or even substantially be parallelism given the findings above
Also, at 32 and 64 processes we saw substantial amounts of library cache load lock and enq: TQ - DDL contention while the 5 cores were on average 20% busy.

Then I started off a new unload of the full database in order to determine the source of the waits. I found that things went very well until the dump came to a particular table. It turns out that the table had a BLOB. The BLOB data was as follows:
 
select 
  max(dbms_lob.GETLENGTH(lob_content)) max_len
, avg(dbms_lob.getlength(lob_content)) avg_len
, min(dbms_lob.getlength(lob_content)) min_len
, sum(CASE WHEN dbms_lob.GETLENGTH(lob_content) <= 4000 THEN 1
ELSE 0 END) lt4k
, sum(CASE WHEN dbms_lob.GETLENGTH(lob_content) > 4000 THEN 1
ELSE 0 END) gt4k
, sum(CASE WHEN dbms_lob.GETLENGTH(lob_content) IS NULL THEN 1 ELSE
0 END) isnull
from "TEST"."TESTTABLE";

   MAX_LEN    AVG_LEN    MIN_LEN       LT4K       GT4K     ISNULL  
---------- ---------- ---------- ---------- ---------- ----------  
      8000 7389.40765          1      31459     327232     641309  
With an 8KB block size and ENABLE STORAGE IN ROW CHUNK 8192 it meant that as each LOB locator was read it resulted in "db file sequential read" wait events (single-block reads) for approximately 91% of non-null LOBs, which was observed in v$session/v$session_wait (p1/p2/p3) for the data pump worker - and by correlating with dba_extents using file_id and block_id.

Worse, as product enhancement 5943346 (PARALLELISM OF DATAPUMP JOB ON TABLE WITH LOB COLUMN) indicates, parallelism for the whole dump and even at LOB table level went out the window because data pump serializes the dump when it comes to a LOB table. This was also observed as all sessions bar one data pump worker were sitting idle.

I also ran tests on the LOB table whereby I forced the access method by using ACCESS_METHOD with EXTERNAL_TABLE and DIRECT_PATH and I made the following findings:

  • One session still exported the table regardless of parallelism
  • The "db file sequential read" wait event was evident in both cases for the LOB segment read itself

In short, it didn't make a difference - but had to be tried. In actual fact I also tried both forced access methods on the full database dump minus the LOB table. In both cases timings were somewhat worse than just leaving data pump to choose access method on a per-table basis (DIRECT_PATH also failed for a queue table: ORA-31696: unable to export/import TABLE_DATA:"SYSMAN"."MGMT_NOTIFY_QTABLE" using client specified DIRECT_PATH method).

At this point I had split the problem into two - 1) the whole database minus LOB table and 2) the LOB table.

The whole database minus LOB table

After further tests I concluded that the following changes helped with export data pump:
  • parallel=32, compared to 8, reduced unload times from approximately 40 minutes to 12 minutes
  • parallel_execution_message_size (data pump workers are fed by PX slaves) was the default of 2152 bytes and increasing it to 16384 reduced the unload by another approximately 5% percent
Furthermore, setting both db_block_checking and db_block_checksum (not really recommended for obvious reasons) to false also yielded another few percent.

The LOB table

The key to improving the LOB table unload performance would be to in some way parallelise the load because only one session is unloading the data even when using a parallel parameter of more than 1. After spending some time considering this I tried a few different bucketing schemes - one was using the PK (using just one of the two columns) and NTILE for 32 processes but offered significant upfront overhead in terms of running a query to identify the buckets.

Instead, I came up with a more scalable scheme of identifying the bucket by modulo on dbms_rowid.rowid_block_number on the table. This led to a very simple perl script (see below). This worked out very well and I managed to unload the whole table with 32 processes in approximately 11 minutes, which was a considerable improvement on the more than 1.5 hours spent on the LOB table in a normal unload.

The whole database minus LOB table import

The target system for import was unfortunately a test system with only 2 CPU cores and it was felt on performance - and therefore the statistics below are not a like-for-like comparison between export and import on the system.

I ran an initial test import without the troublesome LOB table and the import took 2.5 hours. I was running out of time on my assignment and therefore decided to make a number of changes in one go (never really recommended, but needs must) as follows:

alter database flashback off;
alter database noarchivelog;

(from 2152)
alter system set parallel_execution_message_size=16384 scope=spfile;

(from 1G)
alter system set pga_aggregate_target=6G scope=both;

(from 32M)
alter system set streams_pool_size=128m scope=both;

The flashback and noarchivelog, which for full database imports would be fair options to turn off during import given that your rollback in case of error is quite simple, certainly reduced the import times considerably. However, the pga_aggregate_target proved to be the most important change in the overall scheme of things because indexes were built towards the end of the job and took 3 times longer than actually creating the tables and importing the data in this test.

For lack of better index build timings I used DBA_OBJECTS and DBA_INDEXES as follows:

 select avg(nvl((i.last_analyzed - o.created) * 86400, 0)) avg_ctime, max(nvl((i.last_analyzed - o.created) * 86400, 0)) max_ctime, min(nvl((i.last_a  
 nalyzed - o.created) * 86400, 0)) min_ctime, sum(nvl((i.last_analyzed - o.created) * 86400, 0)) rec_build_s, (max(i.last_analyzed) - min(o.created))  
 *86400 total_dur_s, min(o.created) start_t, max(i.last_analyzed) end_t, count(*) indexes  
  from dba_objects o, dba_indexes i  
 where o.owner IN ('TEST')  
   and o.object_type='INDEX'  
   and o.object_name = i.index_name  
 order by created;  
AVG_CTIME   MAX_CTIME  MIN_CTIME REC_BUILD_S TOTAL_DUR_S START_T             END_T               INDEXES  
---------- ---------- ---------- ----------- ----------- ------------------- ------------------- ----------  
2.10726644        512          0        3654        4191 2012-06-22 12:26:42 2012-06-22 13:36:33       1734

What I also found was that there was no overlap on index creation timings so the indexes were created sequentially. During import indexes are created/built one at a time but parallelism (PX slaves) is used during the rebuild for larger indexes. So there is a degree of parallelism involved in building one index at a time in order to maximise CPU usage and I/O on a multi-core system (aren't they all these days..?)

Furthermore by increasing pga_aggregate_target as much as I could on the system (without causing swapping) I managed to reduce the amount of disk sorts from 171 to 37, which cut a further couple of hundred seconds off the import.

 Statistic                   Total   per Second   per Trans  
 -------------------------------- ------------------ -------------- -------------  
 physical read total bytes      330,878,612,480  52,208,283.7  5,497,966.4  
 physical write total bytes     405,243,809,792  63,942,131.7  6,733,638.1  
 redo size                      127,460,689,524  20,111,616.7  2,117,920.5  
 sorts (disk)                                37           0.0          0.0  
 sorts (memory)                       1,754,519         276.8         29.2  
 sorts (rows)                     1,400,989,947     221,057.8     23,279.2  
 log switches (derived)                     129         73.28  

In the end, the total import time was reduced from 2.5 hours to 1 hour and 12 minutes.

The LOB table import

After various tests it quickly became apparent to me that creating the table up front without any constraints (and that included the implicit not null constraints) and then importing each chunk one at a time appeared very fast indeed. The total time spent on importing all 32 buckets was approximately 11 minutes.

time impdp \'/ as sysdba\' tables=TEST.TESTTABLE dumpfile=TEST_DPUMP:TESTTABLE001 logfile=TEST_DPUMP:imp_TESTTABLE001 TABLE_EXISTS_ACTION=append
time impdp \'/ as sysdba\' tables=TEST.TESTTABLE dumpfile=TEST_DPUMP:TESTTABLE101 logfile=TEST_DPUMP:imp_TESTTABLE101 TABLE_EXISTS_ACTION=append
...

It was then a matter of adding the constraints and indexes afterwards (dump DDL using DBMS_METADATA), which took only a few minutes - and could use index rebuild parallelism as well, as needed.

Conclusion

There is certainly some performance improvements to be made on the standard data pump jobs. They do require a fair bit of testing and experimentation in a given environment but may be well worth it if data pump performance is a concern. Hopefully, the article will also have provided some troubleshooting and ideas for solutions and workarounds to some of the most common data pump performance problems.

I should also mention that the following Oracle Support notes are good starting points for understanding performance related issues with Data Pump and they are well worth reading. There are a number of both 10g and 11g related bugs that can seriously affect performance and it's well worth having a look around rather than just accepting slow performance.
  • Master Note for Data Pump [ID 1264715.1]
  • Checklist For Slow Performance Of DataPump Export (expdp) And Import (impdp) [ID 453895.1]
  • Parallel Capabilities of Oracle Data Pump [ID 365459.1]

Export script: parallel_dump.pl

 #!/usr/bin/perl  
 use strict;  
 my $PARALLEL=32;  
 my $PFILE="parallel_table_exp.par";  
 my @row;  
 my $sp_flash_time=`sqlplus -S / as sysdba <<EOF  
 set pages 9999 lines 172  
 col x format a172  
 set trimout on  
 set trimspool on  
 select 'row:' || TO_CHAR(SYSDATE, 'dd/mon/yyyy hh24:mi:ss') x from dual  
 /  
 EOF`;  
 my $rc = $?;  
 if ($rc) {  
  print "sqlplus failed - exiting";  
  exit $?;  
 }  
 my $flash_time;  
 foreach ( split (/\n/, $sp_flash_time) ) {  
  chomp;  
  if ( /^row:/ ) {  
   s/^row://g;  
   $flash_time = $_;  
  }  
 }  
 if ( !defined ($flash_time) ) {  
  print "Unable to get database time - exiting";  
 }  
 print "Using flash time of: " . $flash_time . "\n";  
 open(PARFILE, ">", $PFILE);  
 print PARFILE "flashback_time=\"to_timestamp('${flash_time}', 'dd/mon/yyyy hh24:mi:ss')\"\n";  
 print PARFILE "parallel=1\n";  
 print PARFILE "exclude=statistics\n";  
 close(PARFILE);  
 $SIG{CHLD} = 'IGNORE';  
 my $shard = 0;  
 my $cmd;  
 foreach ($shard = 0 ; $shard < $PARALLEL ; $shard++) {  
   $cmd = "expdp \\\'/ as sysdba\\\' tables=TEST.TESTTABLE dumpfile=TEST_DPUMP:RT${shard}%U logfile=TEST_DPUMP:RT${shard}";  
   $cmd .= " parfile=${PFILE}";  
   $cmd .= " query=TEST.TESTTABLE:\'\"where mod(dbms_rowid.rowid_block_number(rowid), ${PARALLEL}) = " . $shard . "\"\'";  
   $cmd .= " &";  
   print "Starting: $cmd\n";  
   my $cpid = system($cmd);  
 }  

15 comments:

pafke said...

I ran into this BLOB very slow dump and realized that is was neccesary to split the dump into smaller pieces.
The modulo approach solved the problem. Next thing I ran into was that we had multiple tables with blobs.
I used this statement to as a base to scatter the jobs over multiple tables:
select segment_name, bytes, owner,
(select table_name from dba_lobs a where d.segment_name = a.segment_name) as tablename, d.tablespace_name
from dba_segments d
where segment_type = 'LOBSEGMENT'
order by bytes desc;

Geno Pais said...

Tak Morten!! Very usefull..

Unknown said...

Great post morten.
Curious... What was your MBRC set to?

Chink said...

Thank you for article, I tested the export of LOB using the perl script and it worked like a charm.

rockingh0rse said...

Very good write-up. Very informative and nicely presented.
Top job!

DaDBm com said...

Hi Morten
Thanks, very useful post.
As an addition, converting LOBs to SecureFile LOBs can considerably improve export performance.

-- Kirill Loifman, www.dadbm.com

palheart said...

dear dublin and all

first.. thank u for the interesting article..

i realy need a help here.. i start working with a new banking solution called temenos t24.. in the database i have :

- XMLTYPE fields: 86054
- CLOB fields: 267369
- VARCHAR2: 663984

the expdp takes about 6 hours
the impdp takes about 4 hours

we r in the implementation phase so we dont have data yet, the dump file's size is 5GB.

we r using oracle 11.2.0.4 on ibm AIX server.

can anyone help in the expdp and impdp time...

Ravi Kosaraju said...

Hi Morten
Nice use of DIY parallelism. Works well for our situation. We had a datapump job with a single parallel thread taking upwards of 24 hours. Using 32 parallel threads, we can now export that table in under 1 hour.

Thank you
- Ravi

Unknown said...

What a great solution. I'm dealing with a 1.1TB database where one single table is nearly 600GB and most of the data was in a single BLOB field. It was killing the expdp. First time we tried it the export took 45 hours. Leave the table out and it took 1 hour (ok, with some additional streams too). So now I'm working on this parallel method to get the massive table out in segments. I'm hopeful now we can get the whole thing done in pieces in just a few hours.

Thanks for the good work.

sathis said...

Thank you Morten..your solution helped to reduce the export time tremendously for our database with 125GB of CLOB. Actually datapump session estimation says 30 to 40 hrs for this CLOB and progressed very slowly. Then found your solution and followed, it completed in 1 hr !!!!.

Thanks again for the great post.

B. 'Aqila Mujahid said...
This comment has been removed by the author.
B. 'Aqila Mujahid said...

Thank you, Morten, for this great solution! It works like a charm. I'm wondering if the script could be modified to use a network_link option for impdp, or do you discourage streaming the data rather than writing it out to a physical file? Either way your process is super fast (probably faster than the time it takes to copy the dump files between servers).

Cheers :)

Andreas Lindholm said...

Thanks alot for this! Worked like a charm on my 2,2tb lobtable, Decreased the export time from 48 hours to around 5!!!

For some reasons i run into a few of these:

ORA-31626: job does not exist
ORA-31633: unable to create master table "SYS.SYS_EXPORT_TABLE_26"
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 95
ORA-06512: at "SYS.KUPV$FT", line 1020
ORA-00955: name is already used by an existing object

Solved it by finding the logfiles missing and re-exported manually that part.
"where mod(dbms_rowid.rowid_block_number(rowid), 32) = 12"

/Andy

Tukaram Lonar said...

Thanks a lot , this blog helped me to export 480 GB LOB object in 3.5 hour, import took just 3 hour , if i just start export without this script which was taking 24 hour just for export, inovative script which helped me to do parallelism on LOB object.

Maxime Dumont said...

When I try the impdp on my lob table without any constraint and index, I realized that there was a locking issue (tm: contention / exclusive lock) hold by one of the impdp session. So the others sessions were waiting after the session holding the lock. I found on MOS (1467662.1) that I have to use the parameter "DATA_OPTIONS=DISABLE_APPEND_HINT" to solve this locking issue.

And now I can import with 8 impdp jobs without any problem.