Zimbra Next Generation Modules/Zimbra NG HSM/Item Deduplication

Revision as of 13:10, 30 August 2017 by Jorge de la Cruz (talk | contribs) (1 revision imported: Zimbra NG)
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.



Zimbra NG HSM - Item Deduplication

What is Item Deduplication

Item Deduplication is a technique that allows to save disk space by storing a single copy of an item and referencing it multiple times instead of storing multiple copies of the same item and referencing each copy only once.

This might seem a minor improvement, in theory, but in practical use can make a huge difference. Think about that user, the one that improperly sends nice and unnecessary 15Mb "motivational" or "funny" presentations to a-hundred-and-something-recipient-all-in-the-"to:"-field.

Item Deduplication in Zimbra

Item Deduplication is performed by Zimbra at the moment of storing a new item in the Primary Volume.

When a new item is being created its "message ID" is compared to a list of cached items, and in case of a match a hardlink to the cached message's BLOB is created instead of a whole new BLOB for the message.

The dedupe cache is managed in Zimbra 8 through the following config attributes:

zimbraPrefDedupeMessagesSentToSelf

Used to set the deduplication behaviour for sent-to-self messages.

<attr id="144" name="zimbraPrefDedupeMessagesSentToSelf" type="enum" value="dedupeNone,secondCopyifOnToOrCC,dedupeAll" cardinality="single" 
optionalIn="account,cos" flags="accountInherited,domainAdminModifiable">
  <defaultCOSValue>dedupeNone</defaultCOSValue>
  <desc>dedupeNone|secondCopyIfOnToOrCC|moveSentMessageToInbox|dedupeAll</desc>
</attr>

zimbraMessageIdDedupeCacheSize

Number of cached Message IDs.

<attr id="334" name="zimbraMessageIdDedupeCacheSize" type="integer" cardinality="single" optionalIn="globalConfig" min="0">
  <globalConfigValue>3000</globalConfigValue>
  <desc>
    Number of Message-Id header values to keep in the LMTP dedupe cache.
    Subsequent attempts to deliver a message with a matching Message-Id
    to the same mailbox will be ignored.  A value of 0 disables deduping.
  </desc>
</attr>

zimbraPrefMessageIdDedupingEnabled

Manage deduplication at Account or COS-level.

<attr id="1198" name="zimbraPrefMessageIdDedupingEnabled" type="boolean" cardinality="single" optionalIn="account,cos" flags="accountInherited"
 since="8.0.0">
  <defaultCOSValue>TRUE</defaultCOSValue>
  <desc>
    Account-level switch that enables message deduping.  See zimbraMessageIdDedupeCacheSize for more details.
  </desc>
</attr>

zimbraMessageIdDedupeCacheTimeout

Timeout for each entry in the dedupe cache.

<attr id="1340" name="zimbraMessageIdDedupeCacheTimeout" type="duration" cardinality="single" optionalIn="globalConfig" since="7.1.4">
  <globalConfigValue>0</globalConfigValue>
  <desc>
    Timeout for a Message-Id entry in the LMTP dedupe cache. A value of 0 indicates no timeout.
    zimbraMessageIdDedupeCacheSize limit is ignored when this is set to a non-zero value.
  </desc>
</attr>

(older Zimbra versions might use different attributes or lack some of them)

Item Deduplication and Zimbra NG HSM

The Zimbra NG HSM module features a "doDeduplicate" operation that parses a target volume to find and deduplicate any duplicated item.

Doing so you will save even more disk space, as while Zimbra's automatic deduplication is bound to a limited cache, Zimbra NG HSM deduplication will also find and take care of multiple copies of the same email regardless of any cache or timing.

Running the "doDeduplicate" operation is also highly suggested after a migration or a large data import in order to optimize your storage usage.

Running a Volume Deduplication

Via the Zimbra Next Generation Modules Administration Zimlet

To run a volume deduplication via the Zimbra Next Generation Modules Administration Zimlet simply click on the "Zimbra NG HSM" tab select the volume you wish to deduplicate and press the "Deduplicate" button:


Via the Zimbra Next Generation Modules CLI

zimbra@mailserver:~$ zxsuite powerstore doDeduplicate

command doDeduplicate requires more parameters

Syntax:
   zxsuite powerstore doDeduplicate {volume_name} [attr1 value1 [attr2 value2...]]

PARAMETER LIST

NAME              TYPE           EXPECTED VALUES    DEFAULT
volume_name(M)    String[,..]                       
dry_run(O)        Boolean        true|false         false

(M) == mandatory parameter, (O) == optional parameter

Usage example:

zxsuite powerstore dodeduplicate secondvolume
Starts a deduplication on volume secondvolume

To list all available volumes, you can use the `zxsuite powerstore getAllVolumes` command.


"doDeduplicate" stats

The "doDeduplicate" operation is a valid target for the "monitor" command, meaning that you can watch the command's statistics while it's running through the `zxsuite powerstore monitor [operationID]` command.

Sample Output

Current Pass (Digest Prefix):  63/64
 Checked Mailboxes:             148/148
 Deduplicated/duplicated Blobs: 64868/137089
 Already Deduplicated Blobs:    71178
 Skipped Blobs:                 0
 Invalid Digests:               0
 Total Space Saved:             21.88 GB
  • "Current Pass (Digest Prefix)" - The "doDeduplicate" command will analyze the BLOBS in groups based on the first characted of their digest (name).
  • "Checked Mailboxes" - The number of mailboxes analyzed for the current pass.
  • "Deduplicated/duplicated Blobs" - Number of BLOBS deduplicated by the current operation / Number of total duplicated items on the volume.
  • "Already Deduplicated Blobs" - Number of deduplicated blobs on the volume (duplicated blobs that have been deduplicated by a previous run).
  • "Skipped Blobs" - BLOBs that have not been analyzed, usually because of a read error or missing file.
  • "Invalid Digests" - BLOBs with a bad digest (name different from the actual digest of the file).
  • "Total Space Saved" - Amount of disk space freed by the doDeduplicate operation.


Looking at the sample output above we can see that:

  • The operation is running the second to last pass on the last mailbox
  • 137089 duplicated BLOBs have been found, 71178 of which have already been deduplicated previously.
  • The current operation deduplicated 64868 BLOBs, for a total disk space saving of 21.88GB

Zimbra Next Generation Modules

logo.png

Latest Version: 8.8

Zimbra Next Generation Modules Resources

Here you can find useful resources for your Zimbra NG Modules



Try Zimbra

Try Zimbra Collaboration with a 60-day free trial.
Get it now »

Want to get involved?

You can contribute in the Community, Wiki, Code, or development of Zimlets.
Find out more. »

Looking for a Video?

Visit our YouTube channel to get the latest webinars, technology news, product overviews, and so much more.
Go to the YouTube channel »

Jump to: navigation, search