Details

      Description

      We should support some configurable alerts for HA. At a minimum, if the free space remaining on the volume for the service directory, log files, journal, HALog files, or snapshots falls below a minimum threshold then force the service into an "OPERATOR" state. Those alerts might be simple stubs in the Configuration file or automatically provided with minimum thresholds for key disk locations. Other kinds of alerts could also make sense. We monitor a lot of things and could provide reporting on those that are critical.

      However, this sort of thing is best performed using a dedicated monitoring package such as nagios.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        I have added performance counters for the "Volumes" for the HAJournal. These are reported out through the web interface on the /counters page as illustrated below. While in this example, all storage for the HAJournalServer is on the same volume, the counters are reported for each relevant directory. Using this REST-ful API, it is trivial to create a nagios or similar integration that monitors the Volumes under /counters (or that directly accesses http://localhost:8090/counters?path=%2FVolumes).

        / Volumes	 ...
        / Volumes / Data Volume Bytes Available	59,861,090,304
        / Volumes / HALog Volume Bytes Available	59,861,090,304
        / Volumes / Service Volume Bytes Available	59,861,090,304
        / Volumes / Snapshot Volume Bytes Available	59,861,090,304
        / Volumes / Temp Volume Bytes Available	59,861,090,304
        

        Exposing these performance counters does not directly force the service into an OPERATOR state. However, it does allow flexible and configurable frameworks (such as nagios) to alert an operator in time to take corrective action.

        Committed revision r7649.

        Show
        bryanthompson bryanthompson added a comment - I have added performance counters for the "Volumes" for the HAJournal. These are reported out through the web interface on the /counters page as illustrated below. While in this example, all storage for the HAJournalServer is on the same volume, the counters are reported for each relevant directory. Using this REST-ful API, it is trivial to create a nagios or similar integration that monitors the Volumes under /counters (or that directly accesses http://localhost:8090/counters?path=%2FVolumes ). / Volumes ... / Volumes / Data Volume Bytes Available 59,861,090,304 / Volumes / HALog Volume Bytes Available 59,861,090,304 / Volumes / Service Volume Bytes Available 59,861,090,304 / Volumes / Snapshot Volume Bytes Available 59,861,090,304 / Volumes / Temp Volume Bytes Available 59,861,090,304 Exposing these performance counters does not directly force the service into an OPERATOR state. However, it does allow flexible and configurable frameworks (such as nagios) to alert an operator in time to take corrective action. Committed revision r7649.
        Hide
        bryanthompson bryanthompson added a comment -

        I have updated the HAJournalServer wiki page to indicate that deployments must establish monitoring for these volumes. See https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=HAJournalServer#Monitoring

        Show
        bryanthompson bryanthompson added a comment - I have updated the HAJournalServer wiki page to indicate that deployments must establish monitoring for these volumes. See https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=HAJournalServer#Monitoring
        Hide
        bryanthompson bryanthompson added a comment -

        This ticket is closed. The approach for operator alerts is to use an external monitoring system (such as nagios) to monitor critical values and then report those values to an operator. I have deferred putting explicit testing of those values and forcing the service into the OPERATOR state for now.

        Show
        bryanthompson bryanthompson added a comment - This ticket is closed. The approach for operator alerts is to use an external monitoring system (such as nagios) to monitor critical values and then report those values to an operator. I have deferred putting explicit testing of those values and forcing the service into the OPERATOR state for now.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: