Monday 24 March 2014

ORACLE: sql - JOIN vs IN vs EXISTS and JOIN vs NOT IN vs NOT EXISTS the logical difference

ORACLE: sql - JOIN vs IN vs EXISTS  and JOIN vs NOT IN vs NOT EXISTS the logical difference

There is a common misconception that IN behaves equaliy to EXISTS or JOIN in terms of returned results.

This is simply not true. To see why not, let's review what each statement does.



IN:

Returns true if a specified value matches any value in a subquery or a list.

Exists:

Returns true if a subquery contains any rows.

Join:

Joins 2 resultsets on the joining column.



If not read carefully it looks pretty same so far, doesn't it.

The difference comes when you take into account the "dreaded" THREE-VALUED LOGIC.

Let's review that also:


-- this is true
SELECT 1 from dual  WHERE 1 = 1;

-- this is false
SELECT 1 from dual  WHERE 1 = 0;

-- this is unknown - it is usually expected to be false, but that only shows
-- misunderstanding of nulls. It's not false it's only treated as false in the filter
SELECT 1 from dual  WHERE 1 = NULL;


-- this is also unknown - but logicaly it would seem it would be true
-- but unknown compared to unknown equals uknown and it is treated as false in the filter
SELECT 1 from dual  WHERE NULL = NULL;

The where will return a row only if the condition evaluates to true which UNKNOWN isn't.



So let's demonstrate this with some code. I've commented it heavily so it can be self explaining.

I've also shown a parser bug in the IN Query. Well... it's a bug if you ask me. It shouldn't behave like that.



------------------------------------------------------------------
-- Prepare tables and data
------------------------------------------------------------------
CREATE TABLE t1 (id number, title VARCHAR(20), somenumCol number);

CREATE TABLE t2 (id number, t1Id number, someData VARCHAR(20));


INSERT INTO t1
SELECT 1, 'title 1', 5 from dual UNION ALL
SELECT 2, 'title 2', 5 from dual UNION ALL
SELECT 3, 'title 3', 5 from dual UNION ALL
SELECT 4, 'title 4', 5 from dual UNION ALL
SELECT null, 'title 5', 5 from dual UNION ALL
SELECT null, 'title 6', 5 from dual;

INSERT INTO t2
SELECT 1, 1, 'data 1' from dual UNION ALL
SELECT 2, 1, 'data 2' from dual UNION ALL
SELECT 3, 2, 'data 3' from dual UNION ALL
SELECT 4, 3, 'data 4' from dual UNION ALL
SELECT 5, 3, 'data 5' from dual UNION ALL
SELECT 6, 3, 'data 6' from dual UNION ALL
SELECT 7, 4, 'data 7' from dual UNION ALL
SELECT 8, null, 'data 8' from dual UNION ALL
SELECT 9, 6, 'data 9' from dual UNION ALL
SELECT 10, 6, 'data 10' from dual UNION ALL
SELECT 11, 8, 'data 11' from dual ;

------------------------------------------------------------------
-- we want to get all data in t1 that has a child row in t2
------------------------------------------------------------------

-- join gives us more rows than we need, because it joins to every child row

SELECT    t1.*
FROM    t1
        JOIN t2 ON t1.id = t2.t1Id  ;


1    title 1    5
1    title 1    5
2    title 2    5
3    title 3    5
3    title 3    5
3    title 3    5
4    title 4    5  

 
-- distinct would solve that but it's not pretty nor efficient
SELECT    DISTINCT t1.*
FROM    t1
        JOIN t2 ON t1.id = t2.t1Id  ;


4    title 4    5
2    title 2    5
3    title 3    5
1    title 1    5


-- now this is a weird part where somenumCol is a column in t1
-- but the parser doesn't seem to mind that

 
SELECT    t1.*
FROM    t1
WHERE    t1.id IN (SELECT somenumCol FROM t2) ;


NO Rows Returned

-- here in and exists both get correct results
SELECT    t1.*
FROM    t1
WHERE    t1.id IN (SELECT t1id FROM t2);


1    title 1    5
2    title 2    5
3    title 3    5
4    title 4    5


SELECT    t1.*
FROM    t1
WHERE    exists (SELECT * FROM t2 WHERE t1.id = t2.t1id) ;

1    title 1    5
2    title 2    5
3    title 3    5
4    title 4    5
------------------------------------------------------------------
-- we want to get all data in t1 that doesn't have a child row in t2
------------------------------------------------------------------

-- join gives us the correct result

SELECT    t1.*
FROM    t1
        LEFT JOIN t2 ON t1.id = t2.t1Id
WHERE    t2.id IS NULL  ;


    title 6    5
    title 5    5


-- IN doesn't get correct results.
-- That's because of how IN treats NULLs and the Three-valued logic
-- NULL is treated as an unknown, so if there's a null in the t2.t1id
-- NOT IN will return either NOT TRUE or NOT UNKNOWN. And neither can be TRUE.
-- when there's a NULL in the t1id column of the t2 table the NOT IN query will always return an empty set. 


SELECT    t1.*
FROM    t1
WHERE    t1.id NOT IN (SELECT t1id FROM t2)  ;


NO Rows Returned 
 
-- NOT EXISTS gets correct results

SELECT    t1.*
FROM    t1
WHERE    NOT EXISTS (SELECT * FROM t2 WHERE t1.id = t2.t1id);


    title 6    5
    title 5    5

We can see that it's best to use EXISTS because it always behaves as the user would think it does.