[Dovecot] (Single instance) attachment storage

Ed W lists at wildgooses.com
Tue Aug 24 15:42:34 EEST 2010


  Hi

> The idea is to have dbox and mdbox support saving attachments (or MIME
> parts in general) to separate files, which with some magic gives a
> possibility to do single instance attachment storage. Comments welcome.

This is a really interesting idea.  I have previously given it some 
thought.  My 2p

1) Being able to ask "the server" if it has an attachment matching a 
specific hash would be useful for a bunch of other reasons. This result 
needs to be (crytographically) unique and hence the hash needs to be a 
good hash (MD5/SHA or better) of the complete attachment, ideally after 
decoding

2) It might be useful to be able to find attachments with a specific 
hash regardless of whether the attachment has been spat out separately 
(think of a use case where we want to be able to spot a 2KB footer gif 
which on it's own isn't worth worrying about, but some offline scan 
later discovers 90% of emails contain this gif and we wish to split it 
off as a policy decision).

3) Storing attachments by hash may be interesting for use with 
specialist filesystems, eg an interesting direction that dbox could take 
might be to store the headers and message text in some (compressed?) 
format with high linear read rates and most attachments in a some 
key/value storage system?

4) Many modern IMAP clients are starting to download attachments on 
demand. Need to be able to supply only parts of the email efficiently 
without needing to pull in the blobs.  Stated another way, it's 
desirable not to peek inside the blobs to be able to fetch arbitrary 
mime parts

5) It's going to be easy to break signed emails...  Need to be careful

6) In many cases this isn't a performance win... It's still a *great* 
feature, but two disk seeks outweigh a lot of linear read speed.

7) When something gets corrupted... It's worth pondering about how we 
can audit and find unreferenced "blobs" later?


Some of the use cases I have for these features (just in case you 
care...).  We have a feature which is a bit like the opposite of one of 
these services for sending big attachments.  When users email arrives we 
remove all attachments that meet our criteria and replace them with 
links to the files.  This requires being able to give users a coded link 
which can later be decoded to refer to a specific attachment.  If this 
change offered us additional ways to find attachments by hash or 
whatever then it would be extremely useful

Another feature we offer is a client application which compresses and 
reduces bandwidth when sending/receiving emails.  We currently don't try 
and hash bits of email, but it's an idea I have been mulling over for 
IMAP users where we typically see the data sent via SMTP, then uploaded 
to the imap "sent items", then often downloaded again when the client 
polls the sent items for new messages (durr).  Being able to see if we 
have binary content which matches a specific hash could be extremely 
interesting


I'm not sure if with your current proposal I can do 100% of the above?  
For example it's not clear if 4) is still possible?  Also without a 
"guaranteed" hash we can't use the hash as a lookup key in a key/value 
storage system (which implies another mapping of keys to keys is 
required). Can we do an (efficient) offline scan of messages looking for 
duplicated hash keys (ie can the server calculate hashes for all 
attachment parts ahead of time)

Sounds extremely interesting.  Look forward to seeing this develop!

Cheers

Ed W


More information about the dovecot mailing list