Projects


Tivoli Storage Manager Daily Health Check Script

The TSM Daily Health Check Script is a Perl based script that was designed to check the health of the TSM server on a daily basis. The purpose of this script is not to alert on issues that need to be acted on immediately but instead to identify potential issues and allow proactive prevention. The current version checks 37 different functional areas of the TSM server. It is configured based on the type of role the TSM server is providing. For example, if the TSM server is a Library Manager, the script will check for tape drives or paths offline, inventory discrepancies, etc.

Figure 1.
Figure 1.

Figure 1 shows the summary of the items that are currently checked. This is displayed when the health check script completes if it is run in the foreground. If it is possible to email from the server, this could also become the body of the email. When an item is preceded by an exclamation point, that signifies that there was at least one warning associated with the item. As shown at the bottom of the summary, there is a more detailed report that can be reviewed for each item. This is the main report that should receive the most attention when checking the daily health of the TSM server. If the health check script is installed in a shared library environment, it is installed on the TSM server acting as the library manager. This allows for a centralization of all the reports in one location. By default, the health check script retains 30 days of archived health check reports. This can be changed in the script configuration file on a per TSM instance basis. Although the report output will vary greatly based on the purpose of the TSM server, below is a sample run of the report shown with a series of screen images.

Figure 2.
Figure 2.

Figure 2 shows the beginning of the report. It starts with a banner that identifies the TSM instance name, hostname, and the date/ time the report was generated. It shows the first six items that are primarily related to the TSM DB, active log, archive log, and whether or not the backup is occurring for DR related items. It is possible to monitor the mirror log and failover log if present. For the environment shown it is N/A. Item four checks to see how frequently the TSM Database is being backed up. It should have at least one full or incremental backup daily. Item five checks to see that DB snapshots are also occuring daily if the config file specifies a value for the frequency. And finally, item six checks to see if the volume history and device configuration are being backed up daily to at least two locations.

Figure 3.
Figure 3.

Figure 3. Shows items seven through ten. Item seven looks through the summary table for any failed processes in the prior twenty four hour period. Since there may be numerous failed processes, there is a value that limits how many are reported on. The health check takes the extra step of taking the failed process number and performing a query on the activity log. This information is available in the detailed query output files named with the number of the item. Item eight looks for untimely migrations based on a start hour and duration value set in the config file. Item nine checks for storage pools above a percentage specified in the config file. And item ten does a check on the backup stgpool processes which take care of the DR copy storage pools. A value can be set to alert on in gigabytes. Multiple primary storage pool copy storage pool groupings can be specified.

Figure 4.
Figure 4.

Figure 4. Shows items eleven through seventeen. Item eleven gives a summary of how much data was moved through the server in the prior twenty four hour period. This is useful for getting a high level view of anything unusually high or low. Items twelve, thirteen, and fourteen report on the status of the automated tape library (ATL). For this TSM instance, they are not applicable but would give very valuable information including tapes being present in the I/O door and descrepencies for a comparison of the ATL inventory against the libvolumes table. Item fifteen looks for storage pool volumes that do not have readwrite access. Item sixteen identifies any libvolumes that have no owner and are private. And finally, item seventeen provides counts of storage pool volumes that are reclaim value specified in the config file.

Figure 5.
Figure 5.

Figure 5. Shows items eighteen through twenty two. Item eighteen looks for storage pool volumes that have encountered more read/ write errors than the threshold set in the config file. Item nineteen evaluates the the scheduled client backup and archive events in the previous 24 hour hours and reports on any exceptions. Item twenty does basically the same as nineteen except it looks at the administrative scheduled events for exceptions. Item twenty one is designed for environments that leverage the event logging functionality in TSM and the USEREXIT. It identifies events that should be monitored. If the USEREXIT functionality is generating a syslog entry on the TSM server which gets picked up by Netcool/ rsyslog, these events are not being monitored. Item twenty two allows monitoring for the activity of specific messages in the TSM activity log. It is not meant to be used to monitor for issues that would require a reactive response. It is instead intended to be used in an ad-hoc manner to look for any message activity for one to around ten messages that might be of special interest.

Figure 6.
Figure 6.

Figure 6. Shows items twenty three through twenty six. Item twenty three provides the ability to look at TSM server options that are set in the dsmserv.opt and also in the status table and see if they are not set as specified by the config file. It is useful when making sure that a server is conforming with standards for a build and security. Item twenty four makes comparisons between volumes of a type FILE device class against what is present in the directory of the underlying file system. It identifies orphaned or misplaced files. Item twenty five identifies backups from the previous cycle that exceeded a specified threshold. This may be indicative of improperly set exclude options for the backup client. Item twenty six shows volumes that are below a specified value of utilization. This is often caused by reclamation issues or colocation not being set up properly.

Figure 7.
Figure 7.

Figure 7. Shows items twenty seven through thirty two. Item twenty seven identifies TSM clients with an occupancy exceeding a threshold set in the config file. This is useful for identifying clients that either have an with exclude statements or they are not expiring their data. Often when the issue is related to expiration, it is with DP for Oracle clients where all backup objects are active copies and the RMAN components needs to delete these objects based on client side controlled policies. Item twenty eight allows a more proactive way of identifying a storage pool that is about to exceed its maximum scratch value. This is related to sequential access storage pools and FILE device class storage pools. Setting the MAXSCRATCH value in TSM is important to be able to accurately reflect storage pool utilization. So, set it correctly and monitor it rather than "set it and forget" by making it a value 9999. Item twenty nine identifies administrative schedules that are not active. This is valuable to make sure that daily housekeeping scripts are running as needed. Item twenty nine identifies locked client nodes and administrative accounts. This is valuable for admin accounts since they may be running admin schedules or be executing server-server type functionality. Items thirty one and two deal with adminsitrative accounts with expired passwords or that have not accessed the server in more than a value specified in the config file.

Figure 8.
Figure 8.

Figure 8. Shows items thirty three through thirty seven. Item thirty three identifies clients that have no backup schedule associated with them. Item thirty four checks the status of node replicated environments. It does this by comparing the amount of occupancy values on the source and target on a node level and warning if the sum exceeds a threshold specified in the config file. Item thirty five checks that the license information is valid. Item thirty six checks that all DISK device class volumes are varied on. And finally, item thirty seven checks to see if any storage pool volumes are unavailable.