sql - Aggregating rows when selecting distinct rows with "bad" data -
the title bit vague, let me tell want , show what's wrong data.
i need sum cell values (sizes) of unique rows in table. data sort-of "bad". working mapi , message size returned mapi can differ non-deterministically same message content.
i have 2 similar tables/tasks: messages , attachments. it's easier deal attachments, since sha1 hash correlates attachment size. attachment data looks (tab-delimited, excel-ready):
hash size 0x0015a93fffb1726e6647f94f47f4998ac699a455 97 0x0015a93fffb1726e6647f94f47f4998ac699a455 97 0x0020ce1810b56cc17f6cfd0a0b6121592825f7e5 85 0x0086077dd1e4af6a38014ab505105b05f8f5311e 62 0x0086077dd1e4af6a38014ab505105b05f8f5311e 62 0x00a6510eedb20a27b00b23416cf755715647a351 85 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777 0x011c9258a0e16e25662e8f3bee8a1462c175c117 87777
this query use total size of duplicate attachments on table:
select sum(at.sz) size ( select distinct hash, sz = size * ((sum(1) on (partition hash)) - 1) attachment ) @
(i subtracting 1 count of attachments obtain true duplicate count: total attachments – 1 = # of dups attachment)
now, real trouble messages, since mapi reports message size hidden crap , message size identical content + attachments can different (notice messages same hash, size differs):
hash size 0x001b05f4d041eb0e61addea45a1f7afad5c088ab 3248 0x001b05f4d041eb0e61addea45a1f7afad5c088ab 3248 0x001b05f4d041eb0e61addea45a1f7afad5c088ab 3260 0x001b05f4d041eb0e61addea45a1f7afad5c088ab 3260 0x001b05f4d041eb0e61addea45a1f7afad5c088ab 3264 0x001b05f4d041eb0e61addea45a1f7afad5c088ab 3266 0x00510d4b3a9d42c8d8e236950b7fbbd5fb716698 12040 0x00d92c8eb7e718ef418bc9d843d760fd1f4b208b 3657 0x00d92c8eb7e718ef418bc9d843d760fd1f4b208b 3659 0x00d92c8eb7e718ef418bc9d843d760fd1f4b208b 3671 0x011683250cc8d779ec3f33c23209c26e27e25e73 5143 0x016d424b6b4b7ddb4856d1660ecbfebdd3ecacca 3531 0x016d424b6b4b7ddb4856d1660ecbfebdd3ecacca 3543 0x016d424b6b4b7ddb4856d1660ecbfebdd3ecacca 3549 0x016d424b6b4b7ddb4856d1660ecbfebdd3ecacca 3559
with data this, query above fails return sensible, since count of distinct pairs not = 1.
i fine choosing 1 of sizes same hash , use in summation.
with hashdups ( select hash, (count(*)-1)*max(size) size messages group hash ) select sum(size) size hashdups
the same construct of course work attachments.
i don't understand varying size issue. perhaps following more accurate. assumes reported size each "duplicate" real, (how can duplicates have different size?). can figure out min , max duplicate size each hash value subtracting out min , max sizes total sum. can compute min , max total dup size. final true value should lie within 2 extremes.
with hashdups ( select hash, sum(size)-max(size) mindupsize, sum(size)-min(size) maxdupsize mapi group hash ) select sum(minsupsize) mindupsize, sum(maxdupsize) maxdupsize hashdups
Comments
Post a Comment