[Dovecot] (Single instance) attachment storage

Timo Sirainen tss at iki.fi
Tue Aug 24 17:14:44 EEST 2010


On Tue, 2010-08-24 at 13:42 +0100, Ed W wrote:
> Hi
> 
> > The idea is to have dbox and mdbox support saving attachments (or MIME
> > parts in general) to separate files, which with some magic gives a
> > possibility to do single instance attachment storage. Comments welcome.
> 
> This is a really interesting idea.  I have previously given it some 
> thought.  My 2p
> 
> 1) Being able to ask "the server" if it has an attachment matching a 
> specific hash would be useful for a bunch of other reasons. 

If you have a hash 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525, you can see
if it exists with:

ls /attachments/35/16/hashes/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525

> This result 
> needs to be (crytographically) unique and hence the hash needs to be a 
> good hash (MD5/SHA or better) of the complete attachment, 

Currently it uses SHA1, but this can be changed anytime. I didn't bother
to make it configurable. The hash's security isn't a huge issue since it
does byte-by-byte comparison anyway.

> ideally after decoding

The hash is after decoding base64, if attachment is saved decoded, and
that happens if it can be re-encoded exactly as it was.

> 2) It might be useful to be able to find attachments with a specific 
> hash regardless of whether the attachment has been spat out separately 
> (think of a use case where we want to be able to spot a 2KB footer gif 
> which on it's own isn't worth worrying about, but some offline scan 
> later discovers 90% of emails contain this gif and we wish to split it 
> off as a policy decision).

I guess that would be possible, but it would require reading and parsing
all of the mail files. That could take a while. The finding part
wouldn't be all that much work, but separating attachments out of
already saved mails is kind of annoying.

> 3) Storing attachments by hash may be interesting for use with 
> specialist filesystems, eg an interesting direction that dbox could take 
> might be to store the headers and message text in some (compressed?) 
> format with high linear read rates and most attachments in a some 
> key/value storage system?

The attachment I/O is done via filesystem API, so this would be possible
easily by just writing FS API backend for a key-value database.

> 4) Many modern IMAP clients are starting to download attachments on 
> demand. Need to be able to supply only parts of the email efficiently 
> without needing to pull in the blobs.  Stated another way, it's 
> desirable not to peek inside the blobs to be able to fetch arbitrary 
> mime parts

This is already done .. in theory anyway. I'm not sure yet if some
prefetching code causes the attachments to be read unnecessarily. Should
test it.

> 5) It's going to be easy to break signed emails...  Need to be careful

Yeah, I wasn't planning on breaking them.

> 6) In many cases this isn't a performance win... It's still a *great* 
> feature, but two disk seeks outweigh a lot of linear read speed.

Sure, not a performance win. But that's not what it was meant for. :)
But if only >1MB (or so) attachments were stored separately that should
get rid of the worst offenders without impacting performance much.

> 7) When something gets corrupted... It's worth pondering about how we 
> can audit and find unreferenced "blobs" later?

Dovecot logs an error when it finds something unexpected. But there's
not a whole lot it can do then. And finding such broken attachments ..
well, I guess this'll already do it:

doveadm fetch -A body all > /dev/null

> Some of the use cases I have for these features (just in case you 
> care...).  We have a feature which is a bit like the opposite of one of 
> these services for sending big attachments.  When users email arrives we 
> remove all attachments that meet our criteria and replace them with 
> links to the files.  This requires being able to give users a coded link 
> which can later be decoded to refer to a specific attachment.  If this 
> change offered us additional ways to find attachments by hash or 
> whatever then it would be extremely useful

I'm not sure if this change will help much. If the attachment changes
(especially in size) there will be problems..

> Another feature we offer is a client application which compresses and 
> reduces bandwidth when sending/receiving emails.  We currently don't try 
> and hash bits of email, but it's an idea I have been mulling over for 
> IMAP users where we typically see the data sent via SMTP, then uploaded 
> to the imap "sent items", then often downloaded again when the client 
> polls the sent items for new messages (durr).  Being able to see if we 
> have binary content which matches a specific hash could be extremely 
> interesting

Related to that, I've been thinking of a transparent caching Dovecot
proxy.

> I'm not sure if with your current proposal I can do 100% of the above?  
> For example it's not clear if 4) is still possible?  Also without a 
> "guaranteed" hash we can't use the hash as a lookup key in a key/value 
> storage system (which implies another mapping of keys to keys is 
> required). 

Yeah, attachment-instance-key -> attachment-key -> attachment data
lookup would be the only safe way to do this.

> Can we do an (efficient) offline scan of messages looking for 
> duplicated hash keys (ie can the server calculate hashes for all 
> attachment parts ahead of time)

Well .. the way it works is that you have files:

hash-guid
hash2-guid2
hashes/hash
hashes/hash2

If two attachments have the same hash but different content, you'll end
up with:

hash-guid1
hash-guid2
hashes/hash

Where hash-guid1 and hash-guid2 are different files, and only one of
them is hard linked to hashes/hash. To find duplicates, you can stat()
all files and find which have identical hash but different inode.



More information about the dovecot mailing list