java - Query two related tables (Joins) -
this first table in hive- contains information item purchasing.
create external table if not exists table1 (this main table through comparisons need made) ( item_id bigint, created_time string, buyer_id bigint )
and data in above first table
**item_id** **created_time** **buyer_id** 220003038067 2012-06-21 1015826235 300003861266 2012-06-21 1015826235 140002997245 2012-06-14 1015826235 200002448035 2012-06-08 1015826235 260003553381 2012-06-07 1015826235
this second table in hive- contains information items purchasing.
create external table if not exists table2 ( user_id bigint, purchased_item array<struct<product_id: bigint,timestamps:string>> )
and data in above table-
**user_id** **purchased_item** 1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"}, {"product_id":300003861266,"timestamps":"1340271857000"}, {"product_id":140002997245,"timestamps":"1339694926000"}, {"product_id":200002448035,"timestamps":"1339172659000"}, {"product_id":260003553381,"timestamps":"1339072514000"}]
i have reduced data 1 buyer_id(user_id) make problem simple understand.
problem statement-
i need compare table2
table1
, means need see if user_id
table2
, buyer_id
table1
(as both same thing) gets matched, purchased_item
in table2 array of product_id(same item_id) , timestamps(same created_time) should same item_id
, created_time
in table1
particular user_id(buyer_id) , possible (means purchased_item
, item_id
, created_time
) not same or product_id , timestamps missing table2
after comparing table1
.
by mean count of product_id
, timestamps
in table2
should same count of item_id
, created_time
in table1 particular buyer_id(user_id) , content should same. if not same or entry missing table2
, need print result, particular item_id
, created_time
missing table2
or product_id
, timestamps
not same after comparing table1
.
so example in table1 buyer_id 1015826235
have 5 item_id
, 5 created_time
, in table2 should have 5 product_id
, 5 timestamps
same table1 same user_id(buyer_id)
in 1 row. if not same or entry missing need print result showing missing or data wrong.
so make more clear-
purchased_item
array of struct in table2
, contains 2 things product_id
, timestamps
.
if user_id
, buyer_id
gets matched product_id
in table2
should matched item_id
in table1
, timestamps
in table2
should matched created_time
in table1
.
updated
hiveql sql query question:-
q 1) find user_id table2 product_id or timestamp not same item_id or created_time after comparing table1 on buyer_id.
query wrote first question. query right?
a 1) select table2.user_id table2 table1.user_id = table2.buyer_id , (table1.item_id <> table2.product_id or unix_timestamp(table1.created_time) <> table2.timestamps) q 2) find `buyer_id(user_id)` , `item_id` , `created_time` missing `table2` after comparing `table1` on `buyer_id`. 2) not sure.
i suggest not use "string" data type created_time , timestamp because makes comparisons harder. instead of use date or timestamp.
and question: think big problem here using strings alone!
i'm oracle user there should in hive:
to_date({string},{format})
as used
unix_timestamp({string})
another thing: when have strucs, shouls address fields this: table2.purchased_item[{address}].product_id , not table2.product_id unknown.
and 1 more suggestion:
trunc({date},{format ex: 'ss' sseconds})
when created_time , time_stamp not in same time ticks(may 0.001 seconds difference because of difference insert time if insert or sysdate each of them) better truncate date seconds or milli-seconds or whatever think better.
one more thing: use nvl() or convert null values here well, becuase if have such problems possible have null values in table causes problems in queries, nvl() function convert null like.
hope helps.
Comments
Post a Comment