How to Compare Two Tables – Keep Learning Keep Living

Hi,

Few days ago, I joined Chris Saxon’s webinar about “how to compare two tables”. Chris is one of the amazing guy in Oracle community and he shared different methods about how to compare two tables and also, strengths and weakness of the methods. before reading my post, I strongly suggest you to watch it on youtube:

I won’t go into details of the methods because Chris did this already 🙂

Until today, I always used double minus method and I must confess that never thought about “group by” method. I mostly don’t need to find duplicate rows in one table that’s because I use double minus. biggest downside is LOB objects. I must remove lob columns before comparing the data sets. Same thing applies to “group by” method too. I must admit that it was an unnecessary workload to write all the columns and remove LOB columns etc (with SQL Macro, we can save so much time while working).

So, I get a screen shot that main methods that Chris discussed:

As you would guess, group by method is the fastest method because it reads table once and generate data we want. I don’t count “full outer join” because all those null column break it and it would be even harder to write a NVL for every each of them.

So, I want to use group by method anymore to find differences between two tables but as you can see it does not support LOBs because we cannot group data on lob columns. what can we do? A saver is “Hash” functions. we can generate a hash data for our LOB data an use hash data to compare lobs.

Here is an example:

create table tmp1 as select* from v$sql;
exec dbms_Session.sleep(10); -- wait a little bit to change some data in v$sql
create table tmp2 as select* from v$sql;

with t1 as (select 0 c1, 1 c2, sql_Text, sql_id, dbms_crypto.hash(sql_fulltext, 2) sql_fulltext from tmp1),
  t2 as (select 1 c1, 0 c2, sql_Text, sql_id, dbms_crypto.hash(sql_fulltext, 2) sql_fulltext from tmp2)
select sql_Text, sql_id,sql_fulltext
from   (
         select * from t1
         union all
         select * from t2
       )
group by sql_Text, sql_id,sql_fulltext
having sum(c1) <> sum(c2) ;

create table tmp1 as select* from v$sql;

exec dbms_Session.sleep(10); -- wait a little bit to change some data in v$sql

create table tmp2 as select* from v$sql;

with t1 as (select 0 c1, 1 c2, sql_Text, sql_id, dbms_crypto.hash(sql_fulltext, 2) sql_fulltext from tmp1),

t2 as (select 1 c1, 0 c2, sql_Text, sql_id, dbms_crypto.hash(sql_fulltext, 2) sql_fulltext from tmp2)

select sql_Text, sql_id,sql_fulltext

from (

select * from t1

union all

select * from t2

)

group by sql_Text, sql_id,sql_fulltext

having sum(c1) <> sum(c2) ;

In my example, I copied V$SQL into two tables with some time delay so that I will have both similar and different rows. I use dbms_crypto.hash function to generate hash value of a LOB column. second parameter 2 means use MD5 to generate hash. if you have more clob data you can use bigger hash algorithms but of course it will take more time:

HASH_MD4 CONSTANT PLS_INTEGER := 1;
HASH_MD5 CONSTANT PLS_INTEGER := 2;
HASH_SH1 CONSTANT PLS_INTEGER := 3;
HASH_SH256 CONSTANT PLS_INTEGER := 4;
HASH_SH384 CONSTANT PLS_INTEGER := 5;
HASH_SH512 CONSTANT PLS_INTEGER := 6;

you can use any hash algorithms.

Of course, again, it is not easy to find all lob columns then add hash function around them and then compare data but thanks to SQL Macro, we can automate this:

create or replace function sf_get_row_diff(p_table dbms_tf.table_t, p_table2 dbms_tf.table_t) return clob sql_macro as
  v_col clob;
  v_group_col clob;
begin

  for i in 1..p_table.column.count loop
    if p_table.column(i).description.type in (dbms_tf.TYPE_clob, dbms_tf.TYPE_blob) then
      v_col := v_col  || ',dbms_crypto.hash('||p_table.column(i).description.name || ',2) as ' ||p_table.column(i).description.name;
    else
      v_col := v_col  || ','||p_table.column(i).description.name;
    end if;

    v_group_col := v_group_col || ','||p_table.column(i).description.name;
  end loop;
  v_col := ltrim(v_col , ',');
  v_group_col := ltrim(v_group_col, ',');

  return 'with t1 as (select 0 c1, 1 c2, ' || v_col ||' from p_Table),
               t2 as (select 1 c1, 0 c2, ' || v_col ||' from p_Table2)
          select ' || v_group_col ||'
          from   (
                    select * from t1
                    union all
                    select * from t2
                 )
          group by ' || v_group_col ||'
          having sum(c1) <> sum(c2)';
end;
/

create or replace function sf_get_row_diff(p_table dbms_tf.table_t, p_table2 dbms_tf.table_t) return clob sql_macro as

v_col clob;

v_group_col clob;

begin

for i in 1..p_table.column.count loop

if p_table.column(i).description.type in (dbms_tf.TYPE_clob, dbms_tf.TYPE_blob) then

v_col := v_col || ',dbms_crypto.hash('||p_table.column(i).description.name || ',2) as ' ||p_table.column(i).description.name;

else

v_col := v_col || ','||p_table.column(i).description.name;

end if;

v_group_col := v_group_col || ','||p_table.column(i).description.name;

end loop;

v_col := ltrim(v_col , ',');

v_group_col := ltrim(v_group_col, ',');

return 'with t1 as (select 0 c1, 1 c2, ' || v_col ||' from p_Table),

t2 as (select 1 c1, 0 c2, ' || v_col ||' from p_Table2)

select ' || v_group_col ||'

from (

select * from t1

union all

select * from t2

)

group by ' || v_group_col ||'

having sum(c1) <> sum(c2)';

end;

This sql macro function will generate columns automatically and also add hash function for LOB columns. All we need to do is give two tables / common table expressions (like with clause aliases). on this sample code, I assume that both data sets / tables has same columns with same data type. before generating SQL statemen, I loop through the columns and if there are any CLOB or BLOB then I add “dbms_crypto.hash” function around them. I also use actual lob column name as alias to it.

While generating group by columns, I cannot use dbms_crypto statement because I use it in a subselect. so I also have a second column list (v_group_col) which only stores column names and I use them in group by and select list.

-- output suppressed
select * from sf_get_row_diff(tmp1, tmp2);
...

-- output suppressed

select * from sf_get_row_diff(tmp1, tmp2);

...

Now I have all the different rows between two tables. As I said you can use common table expression too:

-- output suppressed
with t1 as (select a,b,c from mytable),
     t1 as (select a,b,c from yourtable),
select * from sf_get_row_diff(t1,t2);
...

-- output suppressed

with t1 as (select a,b,c from mytable),

t1 as (select a,b,c from yourtable),

select * from sf_get_row_diff(t1,t2);

...

even if it is fast, comparing lob columns could be very slow because big lobs means more time to generate hash and many rows means much much more time. so we might want to exclude LOB columns from comparison. A simple parameter could help:

create or replace function sf_get_row_diff(p_table dbms_tf.table_t, p_table2 dbms_tf.table_t, p_compare_lobs number := 1) return clob sql_macro as
  v_col clob;
  v_group_col clob;
begin

  for i in 1..p_table.column.count loop
    if p_table.column(i).description.type in (dbms_tf.TYPE_clob, dbms_tf.TYPE_blob)  then
      if p_compare_lobs = 1 then
        v_col := v_col  || ',dbms_crypto.hash('||p_table.column(i).description.name || ',2) as ' ||p_table.column(i).description.name;
        v_group_col := v_group_col || ','||p_table.column(i).description.name;
      end if;
    else
      v_col := v_col  || ','||p_table.column(i).description.name;
      v_group_col := v_group_col || ','||p_table.column(i).description.name;
    end if;

  end loop;
  v_col := ltrim(v_col , ',');
  v_group_col := ltrim(v_group_col, ',');

  return 'with t1 as (select 0 c1, 1 c2, ' || v_col ||' from p_Table),
               t2 as (select 1 c1, 0 c2, ' || v_col ||' from p_Table2)
          select ' || v_group_col ||'
          from   (
                    select * from t1
                    union all
                    select * from t2
                 )
          group by ' || v_group_col ||'
          having sum(c1) <> sum(c2)';
end;
/

create or replace function sf_get_row_diff(p_table dbms_tf.table_t, p_table2 dbms_tf.table_t, p_compare_lobs number := 1) return clob sql_macro as

v_col clob;

v_group_col clob;

begin

for i in 1..p_table.column.count loop

if p_table.column(i).description.type in (dbms_tf.TYPE_clob, dbms_tf.TYPE_blob) then

if p_compare_lobs = 1 then

v_col := v_col || ',dbms_crypto.hash('||p_table.column(i).description.name || ',2) as ' ||p_table.column(i).description.name;

v_group_col := v_group_col || ','||p_table.column(i).description.name;

end if;

else

v_col := v_col || ','||p_table.column(i).description.name;

v_group_col := v_group_col || ','||p_table.column(i).description.name;

end if;

end loop;

v_col := ltrim(v_col , ',');

v_group_col := ltrim(v_group_col, ',');

return 'with t1 as (select 0 c1, 1 c2, ' || v_col ||' from p_Table),

t2 as (select 1 c1, 0 c2, ' || v_col ||' from p_Table2)

select ' || v_group_col ||'

from (

select * from t1

union all

select * from t2

)

group by ' || v_group_col ||'

having sum(c1) <> sum(c2)';

end;

if p_compare_lobs columns is 1 (which is default) then it will add lob columns into compare and if anything else, it will not add lob columns into compare. you can even understand this from the timing:

select * from sf_get_row_diff(tmp1, tmp2, 1);
...
Elapsed: 00:00:06.663

select * from sf_get_row_diff(tmp1, tmp2, 0);
...
Elapsed: 00:00:02.145

select * from sf_get_row_diff(tmp1, tmp2, 1);

...

Elapsed: 00:00:06.663

select * from sf_get_row_diff(tmp1, tmp2, 0);

...

Elapsed: 00:00:02.145

So, from now on, this function is more than enough to compare two tables for me. We can also add an excluded column list too as a forth parameter, so we can remove those columns from compare like update_date or create_user columns which mostly contain metadata and not actual business data. I won’t create an example for it too but you can see an example on one of my previous posts: https://mustafakalayci.me/2020/11/02/which-column-has-changed-in-every-row/

Many thanks to Chris for bringing up the topic and provide effective solutions. I learned a lot and added something new to me.

Wish you all healthy and happy days.

EDIT:

Ok, let’s not take the easy way out 🙂 here is another example based on my first design:

with m1 as (select 1 a, 10 b, sysdate -2000 create_date from dual 
            union all 
            select 2 a, 20 b, sysdate -2000 create_date from dual ),
     m2 as (select 1 a, 10 b, sysdate +2000 create_date from dual 
            union all 
            select 2 a, 21 b, sysdate +2000 create_date from dual 
            union all 
            select 3 a, 30 b, sysdate +2000 create_date from dual )
select *
from sf_get_row_diff(m1, m2, p_compare_lobs => 1);

         A          B CREATE_DATE        
---------- ---------- -------------------
         1         10 2029-04-18 11:27:58
         1         10 2018-05-06 11:27:58
         2         20 2018-05-06 11:27:58
         2         21 2029-04-18 11:27:58
         3         30 2029-04-18 11:27:58

with m1 as (select 1 a, 10 b, sysdate -2000 create_date from dual

union all

select 2 a, 20 b, sysdate -2000 create_date from dual ),

m2 as (select 1 a, 10 b, sysdate +2000 create_date from dual

union all

select 2 a, 21 b, sysdate +2000 create_date from dual

union all

select 3 a, 30 b, sysdate +2000 create_date from dual )

select *

from sf_get_row_diff(m1, m2, p_compare_lobs => 1);

A B CREATE_DATE

---------- ---------- -------------------

1 10 2029-04-18 11:27:58

1 10 2018-05-06 11:27:58

2 20 2018-05-06 11:27:58

2 21 2029-04-18 11:27:58

3 30 2029-04-18 11:27:58

so, I want to compare two tables (I used CTE (common table expression) here) m1 and m2 and all rows are different because they have different create_date values. Sometimes we may not want to compare “all” columns for example create_Date might be different for the rows and we might want to exclude it. so how can I adapt this into my sql macro? Simply we can pass the columns that we don’t want to use. First we need to add a new parameter for excluded columns and then remove those columns from select/group by etc list.

create or replace function sf_get_row_diff(p_table dbms_tf.table_t, p_table2 dbms_tf.table_t, 
                                           p_compare_lobs number := 1, 
                                           p_excluded_columns dbms_tf.columns_t := null) return clob sql_macro as
  v_col clob;
  v_group_col clob;
begin

  for i in 1..p_table.column.count loop
    continue when p_table.column(i).description.name member of p_excluded_columns;
    
    if p_table.column(i).description.type in (dbms_tf.TYPE_clob, dbms_tf.TYPE_blob)  then
      if p_compare_lobs = 1 then
        v_col := v_col  || ',dbms_crypto.hash('||p_table.column(i).description.name || ',2) as ' ||p_table.column(i).description.name;
        v_group_col := v_group_col || ','||p_table.column(i).description.name;
      end if;
    else
      v_col := v_col  || ','||p_table.column(i).description.name;
      v_group_col := v_group_col || ','||p_table.column(i).description.name;
    end if;

  end loop;
  v_col := ltrim(v_col , ',');
  v_group_col := ltrim(v_group_col, ',');

  return 'with t1 as (select 0 c1, 1 c2, ' || v_col ||' from p_Table),
               t2 as (select 1 c1, 0 c2, ' || v_col ||' from p_Table2)
          select ' || v_group_col ||'
          from   (
                    select * from t1
                    union all
                    select * from t2
                 )
          group by ' || v_group_col ||'
          having sum(c1) <> sum(c2)';
end;
/

create or replace function sf_get_row_diff(p_table dbms_tf.table_t, p_table2 dbms_tf.table_t,

p_compare_lobs number := 1,

p_excluded_columns dbms_tf.columns_t := null) return clob sql_macro as

v_col clob;

v_group_col clob;

begin

for i in 1..p_table.column.count loop

continue when p_table.column(i).description.name member of p_excluded_columns;

if p_table.column(i).description.type in (dbms_tf.TYPE_clob, dbms_tf.TYPE_blob) then

if p_compare_lobs = 1 then

v_col := v_col || ',dbms_crypto.hash('||p_table.column(i).description.name || ',2) as ' ||p_table.column(i).description.name;

v_group_col := v_group_col || ','||p_table.column(i).description.name;

end if;

else

v_col := v_col || ','||p_table.column(i).description.name;

v_group_col := v_group_col || ','||p_table.column(i).description.name;

end if;

end loop;

v_col := ltrim(v_col , ',');

v_group_col := ltrim(v_group_col, ',');

return 'with t1 as (select 0 c1, 1 c2, ' || v_col ||' from p_Table),

t2 as (select 1 c1, 0 c2, ' || v_col ||' from p_Table2)

select ' || v_group_col ||'

from (

select * from t1

union all

select * from t2

)

group by ' || v_group_col ||'

having sum(c1) <> sum(c2)';

end;

I wrote whole code again but I only added 2 new lines, 3th and 10th lines. p_excluded_columns parameter will be the list of columns that we want to exclude from comparison. In the for loop which we generate the group by columns, I continue the loop if the current column is a member of columns array. let’s run the query again with excluded columns:

with m1 as (select 1 a, 10 b, sysdate -2000 create_date from dual 
            union all 
            select 2 a, 20 b, sysdate -2000 create_date from dual ),
     m2 as (select 1 a, 10 b, sysdate +2000 create_date from dual 
            union all 
            select 2 a, 21 b, sysdate +2000 create_date from dual 
            union all 
            select 3 a, 30 b, sysdate +2000 create_date from dual )
select *
from sf_get_row_diff(m1, m2, p_compare_lobs => 1, p_excluded_columns=>columns(create_date));

         A          B
---------- ----------
         2         20
         3         30
         2         21

with m1 as (select 1 a, 10 b, sysdate -2000 create_date from dual 
            union all 
            select 2 a, 20 b, sysdate -2000 create_date from dual ),
     m2 as (select 1 a, 10 b, sysdate +2000 create_date from dual 
            union all 
            select 2 a, 21 b, sysdate +2000 create_date from dual 
            union all 
            select 3 a, 30 b, sysdate +2000 create_date from dual )
select *
from sf_get_row_diff(m1, m2, p_compare_lobs => 1, p_excluded_columns=>columns(create_date, b));

         A
----------
         3

with m1 as (select 1 a, 10 b, sysdate -2000 create_date from dual

union all

select 2 a, 20 b, sysdate -2000 create_date from dual ),

m2 as (select 1 a, 10 b, sysdate +2000 create_date from dual

union all

select 2 a, 21 b, sysdate +2000 create_date from dual

union all

select 3 a, 30 b, sysdate +2000 create_date from dual )

select *

from sf_get_row_diff(m1, m2, p_compare_lobs => 1, p_excluded_columns=>columns(create_date));

A B

---------- ----------

2 20

3 30

2 21

with m1 as (select 1 a, 10 b, sysdate -2000 create_date from dual

union all

select 2 a, 20 b, sysdate -2000 create_date from dual ),

m2 as (select 1 a, 10 b, sysdate +2000 create_date from dual

union all

select 2 a, 21 b, sysdate +2000 create_date from dual

union all

select 3 a, 30 b, sysdate +2000 create_date from dual )

select *

from sf_get_row_diff(m1, m2, p_compare_lobs => 1, p_excluded_columns=>columns(create_date, b));

----------

as you see, create_Date column is not included into comparison and row with a value 1 is not in the result because it is same on both tables when create_Date is excluded.

quite nice but for the rows with a value 2, which one is on which table? for few rows it is easy to identify but if there are over a million rows in the tables and thousands are different, then it will be hard to understand which table they are in. Can we add a new column to show table? of course, we just need to change our sql template in return command.

create or replace function sf_get_row_diff(p_table dbms_tf.table_t, p_table2 dbms_tf.table_t, 
                                           p_compare_lobs number := 1, 
                                           p_excluded_columns dbms_tf.columns_t := null) return clob sql_macro as
  v_col clob;
  v_group_col clob;
begin

  for i in 1..p_table.column.count loop
    continue when p_table.column(i).description.name member of p_excluded_columns;
    
    if p_table.column(i).description.type in (dbms_tf.TYPE_clob, dbms_tf.TYPE_blob)  then
      if p_compare_lobs = 1 then
        v_col := v_col  || ',dbms_crypto.hash('||p_table.column(i).description.name || ',2) as ' ||p_table.column(i).description.name;
        v_group_col := v_group_col || ','||p_table.column(i).description.name;
      end if;
    else
      v_col := v_col  || ','||p_table.column(i).description.name;
      v_group_col := v_group_col || ','||p_table.column(i).description.name;
    end if;

  end loop;
  v_col := ltrim(v_col , ',');
  v_group_col := ltrim(v_group_col, ',');

  return 'with t1 as (select 0 c1, 1 c2, ''First Table'' as tbl,' || v_col ||' from p_Table),
               t2 as (select 1 c1, 0 c2, ''Second Table'' as tbl,' || v_col ||' from p_Table2)
          select Min(Tbl) Which_Table, ' || v_group_col ||'
          from   (
                    select * from t1
                    union all
                    select * from t2
                 )
          group by ' || v_group_col ||'
          having sum(c1) <> sum(c2)';
end;
/

with m1 as (select 1 a, 10 b, sysdate -2000 create_date from dual 
            union all 
            select 2 a, 20 b, sysdate -2000 create_date from dual ),
     m2 as (select 1 a, 10 b, sysdate +2000 create_date from dual 
            union all 
            select 2 a, 21 b, sysdate +2000 create_date from dual 
            union all 
            select 3 a, 30 b, sysdate +2000 create_date from dual )
select *
from sf_get_row_diff(m1, m2, p_compare_lobs => 1, p_excluded_columns=>columns(create_date));

WHICH_TABLE           A          B
------------ ---------- ----------
First Table           2         20
Second Table          3         30
Second Table          2         21

create or replace function sf_get_row_diff(p_table dbms_tf.table_t, p_table2 dbms_tf.table_t,

p_compare_lobs number := 1,

p_excluded_columns dbms_tf.columns_t := null) return clob sql_macro as

v_col clob;

v_group_col clob;

begin

for i in 1..p_table.column.count loop

continue when p_table.column(i).description.name member of p_excluded_columns;

if p_table.column(i).description.type in (dbms_tf.TYPE_clob, dbms_tf.TYPE_blob) then

if p_compare_lobs = 1 then

v_col := v_col || ',dbms_crypto.hash('||p_table.column(i).description.name || ',2) as ' ||p_table.column(i).description.name;

v_group_col := v_group_col || ','||p_table.column(i).description.name;

end if;

else

v_col := v_col || ','||p_table.column(i).description.name;

v_group_col := v_group_col || ','||p_table.column(i).description.name;

end if;

end loop;

v_col := ltrim(v_col , ',');

v_group_col := ltrim(v_group_col, ',');

return 'with t1 as (select 0 c1, 1 c2, ''First Table'' as tbl,' || v_col ||' from p_Table),

t2 as (select 1 c1, 0 c2, ''Second Table'' as tbl,' || v_col ||' from p_Table2)

select Min(Tbl) Which_Table, ' || v_group_col ||'

from (

select * from t1

union all

select * from t2

)

group by ' || v_group_col ||'

having sum(c1) <> sum(c2)';

end;

with m1 as (select 1 a, 10 b, sysdate -2000 create_date from dual

union all

select 2 a, 20 b, sysdate -2000 create_date from dual ),

m2 as (select 1 a, 10 b, sysdate +2000 create_date from dual

union all

select 2 a, 21 b, sysdate +2000 create_date from dual

union all

select 3 a, 30 b, sysdate +2000 create_date from dual )

select *

from sf_get_row_diff(m1, m2, p_compare_lobs => 1, p_excluded_columns=>columns(create_date));

WHICH_TABLE A B

------------ ---------- ----------

First Table 2 20

Second Table 3 30

Second Table 2 21

I added a new column, in with clauses to be able to identify the tables. Now, we can see the which table that row is in on the first column. You may wonder why I didn’t write the actual table name on the column. because I couldn’t 🙂 varchar2 data is restricted in sql macro so I couldn’t read the table name from “p_table” parameter, so I decided to go with pre defined value.

one more thing to add, what would happen if we exclude all the columns? we will get an error about wrong sql statement of course because select list, group by list will be empty. we can add a small check after the loop:

    ...
    end loop;

    if v_group_col is null then
      return 'select ''there are no columns left. choose at least one column'' from dual';
    end if;
    
    v_col := ...
    ...

...

end loop;

if v_group_col is null then

return 'select ''there are no columns left. choose at least one column'' from dual';

end if;

v_col := ...

...

here is the result if we exclude all columns from now on:

with m1 as (select 1 a, 10 b, 9999 c from dual 
            union all 
            select 2 a, 20 b, 9999 c from dual ),
     m2 as (select 1 a, 10 b, 5555 c from dual 
            union all 
            select 2 a, 21 b, 5555 c from dual 
            union all 
            select 3 a, 30 b, 5555 c from dual )
select *
from sf_get_row_diff(m1, m2, p_compare_lobs => 1, p_excluded_columns => columns(a,c,b));

'THEREARENOCOLUMNSLEFT.CHOOSEATLEASTONECOLUMN'
-----------------------------------------------------
there are no columns left. choose at least one column

with m1 as (select 1 a, 10 b, 9999 c from dual

union all

select 2 a, 20 b, 9999 c from dual ),

m2 as (select 1 a, 10 b, 5555 c from dual

union all

select 2 a, 21 b, 5555 c from dual

union all

select 3 a, 30 b, 5555 c from dual )

select *

from sf_get_row_diff(m1, m2, p_compare_lobs => 1, p_excluded_columns => columns(a,c,b));

'THEREARENOCOLUMNSLEFT.CHOOSEATLEASTONECOLUMN'

-----------------------------------------------------

there are no columns left. choose at least one column

thanks for reading, any comments and questions are welcomed.

Leave a Reply Cancel reply