Manages the caching, persistence life-cycle and retrieval of results from task execution.

What follows is a class-by-class desrciption of the role of each of the classes used by the Job Manager.

MxTargetState_t

This class implements an enumeration type that describes the state of a job on a particular target. It contains two additional pieces of information about the state: is it a terminal state, and is it an abnormal end state. Abnormal end states are a proper subset of the terminal states. These are used by the MxJob class to determine when a Job has completed, and whether it completed successfully or failed. The fromValue and getValue methods are used to persist the enumeration in the repository as an integer.

MxTargetOutput

This class stores the output and some status for a job on a particular target system. It is not persisted in the repository. The attributes of this class which must be persisted are shared with and initialized by the values in the MxTargetStatus object it is a part of (hostname, exit code, job ID). The stdOut and stdErr attributes are saved in a file when the parent MxTargetStatus object is persisted.

The methods used by the parent object for persistence are fileInFrom (initialize the object from a file) fileOutTo (write the output and error attribute values to files) and removeFiles (remove the files used to persist the object). Note that the directory containing the files is passed in by the caller. On HP-UX and Linux systems the files are in /var/opt/mx/output/task-name for scheduled tasks and in /var/opt/mx/output/runnow for "run now" tasks. On Windows, the corresponding directories are \Program Files\HP\SIM\output\task-name and \Program Files\HP\SIM\output\runnow. The file names in these directories are of the form job-id.cms-name.target-name.out and job-id.cms-name.target-name.err for the standard output and standard error respectively.

This object is passed via RMI from the DTF to the Domain manager as part of the target status when the job completes on a target system.

MxTargetStatus

This class stores the complete status for a job on a target system. It contains a MxTargetOutput object.

One peculiarity of this class is the way the exception information (if present) is persisted. When passed via RMI from the DTF, the MxTargetStatus object contains an actual exception object (myException). When the MxTargetStatus object is persisted, the exception class and message text (localized for the locale of the CMS) are saved. When the object is restored from the repository, the values for these attributes are restored, but no exception object is constructed, so the myException attribute is still null. Because of this, the get methods (getExceptionClass, getExceptionText) return either information from the exception itself (if it exists) or from the corresponding attributes (myExceptionText, myExceptionClass). This way the users of this class that only use these get methods don’t need to know if there is an exception object associated with the class.

The recover method is provided to restore the object to a consistent (non-running) state after it is restored from the repository. This could occur if the object were persisted in a non-terminal state (while the job was running) and the Domain Manager daemon (containing the Job Manager’s cache) were stopped and restarted. When the Job Manager starts it reads all the jobs and recovers them (updates and rewrites them) to a consistent state using this method.

The restoreOutput and saveOutput methods are called by the containing MxJob object when it is restored/saved from the repository to operate on the associated MxTargetOutput object. The removeOutput method is called when the containing Job is removed.

MxJobState_t

This class implements an enumeration type that describes the state of the job as a whole (for all targets). It also contains an attribute specifying if the state is terminal (ending).

The getValue and fromValue methods are used to persist the object in integer form.

MxJobLifetime_t

This class implements an enumeration type that describes the lifetime of a job (at what point in its lifecycle it gets removed). It also contains an attribute specifying whether the Job must be written to the repository when it completes. Jobs with shorter lifetimes are not ever written to the repository, to avoid database updates for Jobs of only short-term interest to the user.

MxJob

This class contains all the status for a Job. It may contain one or more MxTargetStatus objects (via the myTargetStatusesBy attributes). It also has a reference to the Runnable Task and the Source Task associated with the Job. The Job may have a longer lifetime than the Runnable Task (that is why the constructor saves many attributes from the Runnable Task), but the Source Task will always exist while the Job does. When the Source Task is removed, any Jobs associated with it are removed (MxJobManager removeJobsForTask method).

The constructor for this class sets the target count for the Job, which is used in determining the “percent complete” value (getPercentComplete, myCalculatedPercentComplete) and in determining the state of the Job based on the state of the added Target Statuses. For tools that run on the CMS (Application Launch and MSA), this count does not correspond to the actual number of targets from the users point of view, but it does describe the number of Target Status objects expected. Most other attributes initialized in the constructor were added because they were needed by the UI, and are not used within the class. The Job lifetime is also determined at this point, which in turn helps determine whether the Job is written to the repository (isPersisted method).

The use of tables of MxTargetStatus objects indexed by hostname and node ID predates the addition of object IDs to Gryphon. One or the other of these is probably is not needed anymore.

The Job object also may contain an exception if the Job did not complete successfully. See the discussion of the exception attributes in MxTargetStatus.

The MxJob object is responsible for maintaining its state to reflect the state of the Job considering the MxTargetStatus objects that have been added from completed targets. The methods involved in this are addTargetStatus method (called by MxDTFJobImpl and MxAutomationJobImpl when reporting a new or changed Target Status), the recover method (called by MxJobManager when reading Jobs from the repository on startup), and the updateJobTargetState method called by the other two methods.

Note that since the status for a target may be added more than once (this currently happens only for Automation jobs), the addTargetStatus method must be able to handle this.

Despite their similar names, the addTargetStatus and updateStatus methods have very different affects. The addTargetStatus method adds a MxTargetStatus object for a new target system associated with the job, or updates the status for an existing target system. The method indirectly affects the state of the Job and its "percent complete" (unless a specific value for this attribute has been set) but does not change any of its other attributes. The updateStatus method only affects the state of the job, and affects the Job state, start and end times, percent complete, and exception attributes. It is used by the DTF, whose MxDTFJobImpl class maintains the consistency of the Job state and the target states. The corresponding class for Automation Tools, MxAutomationJobImpl does not attempt to maintain this consistency, so for Automation tasks this is maintained through addTargetStatus.

The recover method is called after reading a Job and it associated target statuses from the repository when the Job manager starts. Since the Job may have been running when the Domain Manager was last terminated, the job may be in an inconsistent state. This method makes the Job state consistent with the Target Statuses. Note that the Job Manager initializeCache method must update the Job in the repository after calling this method.

The “percent complete” mechanism for Jobs is intended to allow some Jobs (currently only Automation jobs) to set the “percent complete” value (setPercentComplete) but also to allow other Jobs (currently only DTF jobs) to maintain their own “percent complete” based on the Target Statuses added if the “percent complete” value has never been explicitly set (see the updateJobTargetState method).

The MxJob class, like the classes it contains, has method to create, restore from and remove the files associated with the Job output (saveOutput, restoreAllOutput, restoreOutput and removeFiles). Since the Job object knows part of the path (see the getDirName method) to the directory where the output files are stored, all requests related to the output file must go through the Job object. It is passed the top-level directory, appends to it the path for this specific job and passes the resulting directory to the Target Status object(s). The Job object manages creating and removing the directories it knows about.

MxDTFJobImpl

This class provides the interface between the Domain Manager and the DTF to manage the execution of a particular DTF Job. It is created with a reference to the DTF which it uses to forward requests to execute, cancel or kill the Job. While the Job is active this Job maintains a thread that receives remote events from the DTF containing updates to the status of the Job or its Targets (processEventPayload method). When the Job completes the internal thread is terminated, since no more updates will be received. The events are also forwarded to the Job Manager. The number of threads used in the Domain Manager for this purpose is limited by the mx.properties configuration variable MX_DTF_MAX_NUM_RUNNING_JOBS, which by default is 10.

MxAutomationJobImpl

Since automation jobs are executed within the Domain Manager, the operation of this class is much simpler that that of MxDTFJobImpl. No event are involved in updating this class, so this class must generate events to send to the Job Manager (the updateJobStatus and updateTargetStatus methods).

MxJobManager

The states a Job goes through during its lifecycle are described in the diagram below. The labels on the lines are the methods involved in changing the state of the Job. The note text boxes describe some of the Job Manager attributes affected. These states are not explicitly encoded in the Job object, but are useful in understanding what the Job Manager is doing. All Jobs go through the Created, Updated, Saved and Deleted states (although not all types of Jobs actually get written to the repository in the Saved state), but use of the remaining states depends on the Job lifetime.

Jobs with a lifetime of NEXT_JOB_COMPLETION or UNCACHE_OR_NTH_JOB_COMPLETION or AGED_OUT_OR_NTH_JOB_COMPLETION go to through the Superceded state when a sufficient number of new Jobs for the same task completes. Jobs with a lifetime of AGED_OUT_OR_NTH_JOB_COMPLETION go through the Aged state if the Job has not been Superceded or Deleted when a sufficient time has passed after the Job completes. A Job of any lifetime can enter the Uncached state, but only Jobs for which the isPersisted method returns true are actually saved and can return to the Saved state via the call to the getJob method. Other Jobs, which are not persisted, enter the Unchached state and disappear. The states in this diagram are generally descriptive of what is going on, but not completely accurate in the usual state diagram sense.

The Tables Updated column lists the Job Manager global tables that are updated on entry to the state. The To State column specifies a possible new state entered when the method in the Method column is called.


State             Tables Updated          Method                To State

Start                  --                 addJob                Created

Created      ourJobIDs, ourCache          updateTargetStatus    Updated
                ourTaskToJobMap             InRepository

Updated                --                 updateTargetStatus    Updated
                                            InRepository

Updated                --                 processCompleted      Completed
                                            Jobs

Completed      ourDateToJobMap            updateJobIn           Saved
                                            Repository

Saved                  --                 testCanUnload         Uncached

Uncached               --                 getJob                Saved

Saved                  --                 removeOldJobs         Aged

Aged                   --                 deleteJob             Deleted

Saved                  --                 deleteOldJobsForTask  Superceded

Superceded             --                 deleteJob             Deleted

Deleted        ourTaskToJobMap                --                  --
               ourDateToJobMap
               ourJobIDs, ourCache

JobEventListener

This internal static class is used to process events from the MxDTFJobImpl and MxAutomationJobImpl classes. It places events on a list and notifies the Job Manager’s internal thread, which runs the waitForEvent method.

CachedJobLoader

This class is required by the ItemCacher class used by the Job Manager to implement the Job cache. An instance of this class is passed to ItemCacher when the cache is created in the initializeCache method.

The get method of this class is responsible for retrieving an object (a Job) to be placed into the cache after it has been uncached. It just reads it from the repository using the Job ID (the key of the cache) passed in, and reads the TaskID from the Task Manager to get a copy of the TaskID with both the name and the GUID. Only the GUID is stored in the repository, but the task name is needed for the Job Manager’s tables. After reading the Job from the repository, it is wrapped in a CompletedJobImpl object. The purpose of this class is to emulate an instance of MxDTFJobImpl or MxAutomationJobImpl so that a completed job that is read from the repository looks the same to the rest of the Job Manager as a job that has just completed.

When a MxJob is read from the repository along with the associated MxTargetStatus objects, the MxTargetOutput objects (if any) are not created, so the output files for the Job are not read. When the target status is retrieved from the client interface, the lookupTargetStatus method calls the restoreOutput method of MxJob to read the files into Java strings so that it will be available on the client side. This is because the output from a Job may be quite large (megabytes) and it is best to minimize the time the Job output is represented as strings in Java.

JobCacheCallback

This is another class required by the ItemCacher class. An instance of this class is passed to the ItemCacher constructor created in the initializeCache method.

The testCanUnload method is used by ItemCacher to determine if it can uncache an object because of either its age relative to other items in the cache, or because of the time it as been in the cache. The only thing preventing a Job from being unloaded is if it has not yet completed and been written to the repository.

The unloading method is used by the ItemCacher to perform the tasks needed to save an object when it is uncached. Since Jobs are saved when they are created (addJob method) and updated when they complete (updateJobInRepository method) and are not allowed to be uncached until they have been written (testCanUnload method) nothing has to be done at this time.

Key Attributes

I will point out the more important attributes of the Job Manager.

Even though the Job Manager is called a “manager” it does not fit the pattern of the tool manager, tool box manager, user manager etc., in that it does not derive from the MxObjectManager class and defer most or all repository operations to it. The Job Manager has its own reference to the Persistent Data Manager (ourRepository) and explicitly reads and writes Jobs and Target Statuses to it. Herein lies a bug that needs to be fixed for performance. The PDM can handle “secondary objects” which are attributes of an object implementing MxPersistentObjectIfc that themselves implement MxPersistentObjectIfc (such as the myTargetStatusArray attribute of MxJob, which contains an array of such objects (MxTargetStatus)). When the primary object (MxJob) is written to the repository, the associated MxTargetStatus objects are written as well. The problem is that the Job Manager receives updates to the MxTargetStatus objects independent of updated to the MxJob object, and must update the secondary objects in the repository independent of the MxJob object. The result is some unnecessary updates. For example, if a Job runs on two targets, the status for each target will get updated (updateTargetStatusInRepository method) as it is received, and when the Job itself completes it will be written to the repository (updateJobInRepository method) which will also write the two Target Status objects again, even though they have not changed. The fix seems like to remove myTargetStatusArray as a persistent attribute of MxJob (PERSISTENT_ATTRIBUTES) and rely on the Job Manager to keep the Target Status objects updated. This problem was discovered late in the release cycle so it was not fixed because of the limited impact of the defect. Other key attributes include ourCache, which has references to the Jobs kept in the cache (actually MxJobIfc references), and several other tables that must be kept consistent with the cache:

  • ourJobIDs, used to enumerate all the JobIDs being stored and to determine if a JobID exists without loading it into the cache.
  • ourDateToJobMap, containing only completed Jobs by completion date and by JobID.
  • ourTaskToJobMap, containing only jobs associated with a scheduled task (not “run now” tasks). It is used to enforce limits on how many Jobs to keep around at once for a scheduled task.
  • ourJobLock, the lock used to protect all these tables.

Initialization

The facility for running the Job Manager without the use of the repository dates from early days of Gryphon. It is probably pretty useless now, because of how intertwined the database is with every aspect of the system (e.g. GUIDs). It would simplify things (initializeCache) if the associated code were removed (isDebugRepositoryDisabled and initializeRepositoryFlag methods).

The initializeCache method reads every Job in the repository (but does not load them into the cache) to initialize its secondary tables listed in the bulleted list above, and to recover any Jobs (and their Target Statuses) that were running to a consistent state. It updates the Job and Target status in the repository if any changes were made during the recovery.

Because this recovery could result in several Jobs appearing to complete at the same time, it must handle duplicate end times for Jobs in ourDateToJobMap.

addJob

This method enters a newly created Job into the Job Managers table and cache. A Job is only added to the Task table if it is based on a scheduled Task. Task names for “run now” Tasks are created dynamically and never re-used, so there will never be another Job for that task. Some jobs should never be visible to users listing the running Jobs, but need to be maintained by the Job Manager and their status returned when specifically requested. Currently only Jobs for Web Launch tasks without status URLs fall into this category: these jobs have the UNCACHE_UNLISTED lifetime specified. Note that the Job is added to the repository at this time, but not updated until all its targets complete.

deleteJob

This method only allows Jobs which have completed and been written to the repository to be deleted.

getJob

This method contains debug code that verifies a Job ID is not in any of the other tables (and removes it if found) if it is not in the cache. It was added after problems with inconsistencies were detected and fixed. I have never seen a debug log where this code was executed.

Note that after getting the Job from the cache, which possibly involved reading it from the repository, the MxJob restoreAllOutput method is called to fill in the contents of the Target output if needed.

getJobIfc

Unlike other public methods in this class, this method is not called from the client interface. It is used for automation jobs to get a reference to an object that can be used to update the Job and Target status (MxAutomationJobImpl) while passing events to the Job Manager about the updates so it can take action as needed.

listJobs

This method is available from the client interface, but is not currently used. It is left over from the old days when few Jobs existed at any time in the system. With Nimbus Jobs being retained for a month or more, there could be quite a few jobs in the repository but not in the cache. Calling this method loads all those jobs into the cache, possibly causing quite a performance hit. Maybe this should be removed.

getTargetStatus

This is one of the methods that relies on the getTargetStatus method of MxJob returning a clone instead of the actual object. This allows this method to set the target output of the object being returned to null if needed without affecting the copy in the cache. The problem is, the cloning is always performed, instead of only when it is actually needed, causing (according to Geoff) a significant performance impact. Mike says that the cloning should only be performed when needed, and should be done at the Controller level, not the manager. This issue came up during the code review, but I decided it was too major to fix at that time.

processEventPayload

This method is run by the Job Manager internal thread to handle events containing Jobs or Target Statuses as payloads. In general (especially from automation tasks) we can get many updates from a Job or Target Status before it is actually complete. The payload is only queued for further processing (ourTargetUpdateList or ourJobObjectList) if it has reached a terminal state.

The waitForEvent method, which calls this method, also calls processCompletedJobs or processCompletedTargets if any were queued by this method.

processCompletedJobs

This method is run by the Job Manager internal thread to handle an event indicating a Job has completed. It fudges the Job completion time to prevent duplicates in the table indexed by Completion date. It writes the Job (and its Target Statuses) to the repository and adds it to ourDateToJobMap based on the completion time. It deletes any older Jobs for the same task that need to be removed and marks the Job as being saved in the repository so that it can be deleted or uncached if needed.

processCompletedTargets

This method is also run by the Job Manager internal thread to handle events indicating a Target has completed. There could be multiple targets for the same job being processed at once, so to improve performance by decreasing the number of calls to the repository manager, all Targets for the same Job are written in one call to updateTargetStatusInRepository.

deleteOldJobsForTask

This method runs in the internal thread to remove Jobs that should be removed because enough newer Jobs have completed for the same task. This method is the user of ourTaskToJobMap to find other Jobs for the same task. It also uses ourDateToJobMap to get the Jobs for the task ordered by completion date.

updateJobInRepository / updateTargetStatusInRepository / addJobToRepository

These methods hide from the rest of the Job Manager the existence of Jobs that do not actually get written to the repository.

removeOldJobs

This method is an important user of ourDateToJobMap. It defers to the filterJobsPerTask the identification of the jobs for each task, and then it removes them. This method has package visibility so it can be called from the MxJobCleanupConstruction, which is run every hour.