One chunk or two? TSM 6.2 data deduplication
Recap!
Back in TSM 6.1 server side deduplication became available. The data had to reside on disk to gain this benefit (obviously) and the dedupe process was post process (meaning it wasn’t performed at time of ingest). This functionality works very well in the right circumstance and storage benefits can be gained. IBM in version 6.2 have added a client side dedupe that extends the functionality of the server side dedupe.
Dedupe on the client!?
If you are technical, reading the IBM explanation will make you sit up and then mutter “gosh that’s clever”. If you are not technical you will recognize the features usefulness.
Firstly what is client side dedupe? It is the ability for the TSM client to identify chunks of data that have been sent to the TSM server previously and so not send them again. The result being greatly reduced backup traffic which in turn reduces backup windows.
In order to implement the new functionality of client side dedupe you will need to be running version 6.2 server and client as the client dedupe is an extension of the server dedupe code. This means that the TSM client API has been updated to cater for the new exchange of data hash queries.
So here is a simplified explanation of how it works:
1. Client inspects file to backup and chunks it up.
2. Client queries the TSM server to see if any of the chunks already exist in the server hash table
3. If it does exist then a local cache of the hashes will be built on the client to speed up the queries in the future
4. The new chunks plus hashes for the old chunks will be sent to the server.
As can be seen by the process above the dedupe for the client end is in-line so reducing the landing space needed on the disk pool at the server end. Another funky feature is that client side and server side deduplcation share the same hash table, or pool of data chunks. So the benefit being that if some nodes reside on slow network links they can share the chunks that have been populated by nodes that are local to the TSM server. The explanation above mentions a local cache of the hash tables this is another attempt at reducing network traffic by preventing the process being too chatty. Hash queries will first go to the local cache and if they are not satisfied they will be forwarded to the TSM server.
What about my API clients I hear you cry!?! Well you can get the advantages for agents that use the TSM API however there are a few caveats when using TDP agents that haven’t been updated to understand what the API is doing to it’s data. It is best to test it on development system or if all else fails read the manual!
I will leave you with a table that summaries the server side and client side deduplication methods:
| Server-side Deduplication | Client-side Deduplication | |
| Method used | Server removes redundant data chunks | Client queries server to remove redundant data chunks before sending |
| Conserves Network Bandwidth? | No | Yes |
| Data supported | Backup, Archive, API, HSM | Backup, Archive, API |
| Scope | Data in the same storage pool | Data in the same storage pool |
…mutter, mutter, mutter, those clever IBMers, mutter, mutter…






