[Dovecot] Duplicate Attachments....

Charles Marcus CMarcus at Media-Brokers.com
Thu Jun 1 16:18:16 EEST 2006


Timo Sirainen wrote:
> On Thu, 2006-06-01 at 07:45 -0400, Charles Marcus wrote:
>> I have been looking for a good, open source imap server that
>> doesn't store multiple copies of the same attachment - but instead,
>> stores a checksum, and whenever a message is stored with a
>> duplicate attachment, the attachment is stored only once, and
>> simply referenced by some kind of link to other emails.

> This is planned for dbox format in maybe a couple of months. I think the
> plan was to do this in deliver agent so that the delivered mail's
> attachment is shared between the mail's recipients.

Very good to hear! Were you planning to support this with both dbox 
storage options ('one mail per file' and 'multiple mails per file')?

> I'm not sure if you're suggesting that checksum should be taken from the
> attachment and it be used to see if it already happens to exist, and if
> so use it. Actually I'm not sure if that was also what I was supposed to
> do anyway. :)

That is the way I had imagined it working - but of course, what is 
possible in my imagination and what is possible in reality almost always 
collide head on with a resulting explosion on a par with a supernova... ;)

> I think that could anyway be a good idea, but how about hash collisions?
> I could just ignore that since they would practically never happen. Hash
> + attachment size would be even safer.

Sounds great to me. I cannot 'imagine' the odds of both a hash collision 
AND an exact duplicate size at the same time, but there goes my 
imagination again...

> The only truly safe way would be to read the whole attachment from
> disk and compare it byte-by-byte, but that'd just slow it down
> unneededly.. Perhaps it should be an option.

As one who likes options, if this isn't that hard to do, then yes - and 
maybe you could even have this be some kind of background process that 
occurs, or a nightly 'clean-up' job.

For example - store the attachments individually when they first come 
in, then every night at 3:00am, do a precise comparison on all of the 
attachments that came in that day and delete_duplicate->add_link on all 
duplicates found.

This tool could also be extended and used as a 'conversion' tool, to run 
on an existing mailstore.

Wow, now I'm getting excited, imagining our current 150GB+ storage being 
reduced to 1GB or less... !!!

-- 

Best regards,

Charles


More information about the dovecot mailing list