Identifying Bottlenecks

We all know too well that nothing in the world of IT works 100% of the time. Issues are inevitable and troubleshooting is a key part of any hands on role in the industry. One of the more common tasks when fault finding is identifying performance issues in the form of a bottlenecks, especially when it comes to backups. This can often be a tedious and time consuming task when it involves multiple systems. Thankfully Veeam has simplified this process by building the capability right into the product making the whole process much less of a headache.

While a backup job is running, Veeam monitors the data flow across all involved components in the background, end to end to detect any inefficiencies that may slow things down. This is broken down into the following categories:

SourceSource disk reader component responsible for retrieving data from the source storage.
ProxyVMware/offhost backup proxy component responsible for processing VM data.
Source WAN acceleratorWAN accelerator deployed on the source site. Used for backup copy and replication jobs working through WAN accelerators.
NetworkNetwork queue writer component responsible for getting processed VM data from the VMware/offhost backup proxy and sending it over the network to the backup repository or another VMware/offhost backup proxy.
Target WAN AcceleratorWAN accelerator deployed on the target site. Used for backup copy and replication jobs working through WAN accelerators.
TargetTarget disk writer component (backup storage or replica datastore).

The process of detecting inefficiencies is dynamic meaning that it is possible for the identified bottleneck to change at different stages of the backup job. This can be easily seen by looking at the SUMMARY section of any job.

It is important however to understand that a bottleneck does not necessarily mean there is a problem. In reality there will always be some form of bottleneck along the data path, it just depends whether that bottleneck is severe enough to have an impact on performance. The bottleneck displayed in the job simply shows the weakest point in the path.

So how do you know whether the determined bottleneck is impacting performance? This is where resource usage comes in. Each of the categories listed above are measured as a percentage based on workload – 0 being idle and 100 being fully utilised. Any components that are close to or 100% utilised are typically deemed a legitimate bottleneck. To check the current utilisation while a job is running, hover the mouse over the current bottleneck.

It’s worth noting that these values don’t specify the root cause of the problem but rather help narrow your search allowing you to focus your time in a specific area.

So we’ve covered how to locate bottlenecks while jobs are running but what about jobs that run after hours? After all, it is very common for backups to be scheduled off peak in the early hours of the morning. To check the resource usage of a job that has already finished we can check the logs.

Navigate to the log directory which by default will be C:\ProgramData\Veeam\Backup then open the folder matching the name of the backup job. Each virtual machine included in the job has its own set of log files. The one we are after is named Task.<VM Name>.<GUID>.log. Within the file, search for the word pex. This should result in an output that looks similar to this example:

[AP] (2bb27f23) output: --pex:10;11333009408;4045406208;7237271552;4045406208;1965451568;14;38;79;29;38;76;133403763513980000

This may look a little difficult to decipher as each set of figures relates back to the overall job progress and performance. The first value after “pex” is the percent of the task that has completed. The next 5 values relate to the processed data in bytes – we won’t be covering these here since we are focusing on the values relating to bottlenecks. These values can be found between the 6th and 12th semicolon which in this example will be 14;38;79;29;38;76;

Number after 6th semicolonSource read utilisation % at the source storage
Number after 7th semicolonSource processing utilisation % at the proxy
Number after 8th semicolonSource write utilisation % at the source network
Number after 9th semicolonTarget read utilisation % at the target network
Number after 10th semicolonTarget processing utilisation % at the target repository
Number after 11th semicolonTarget write utilisation % at the target storage

As the job progresses new values will be added to the log in the same format. Depending on your environment, these values may change and it is worth looking for a pattern across multiple readings rather than putting too much focus on any particular one.

The great thing here is that the same process also works for other job types too such as replication and backup copy jobs!

Leave a comment

Create a website or blog at WordPress.com

Up ↑