What follows is a class-by-class desrciption of the role of each of the classes used by the Job Manager.
MxJob
class to determine when a Job has completed, and whether it
completed successfully or failed. The fromValue
and
getValue
methods are used to persist the enumeration in the
repository as an integer.
MxTargetStatus
object it is a part of (hostname, exit code, job ID).
The stdOut
and stdErr
attributes are saved in a file
when the parent MxTargetStatus
object
is persisted.
The methods used by the parent object for persistence are fileInFrom
(initialize the object from a file) fileOutTo
(write the output and
error attribute values to files) and removeFiles
(remove the files
used to persist the object). Note that the directory containing the files is
passed in by the caller. On HP-UX and Linux systems the files are in
/var/opt/mx/output/task-name
for scheduled tasks and in
/var/opt/mx/output/runnow
for "run now" tasks. On Windows, the
corresponding directories are
\Program Files\HP\SIM\output\task-name
and
\Program Files\HP\SIM\output\runnow
. The file names in
these directories are of the form
job-id.cms-name.target-name.out
and
job-id.cms-name.target-name.err
for the standard
output and standard error respectively.
This object is passed via RMI from the DTF to the Domain manager as part of the target status when the job completes on a target system.
MxTargetOutput
object.
One peculiarity of this class is the way the exception information (if present)
is persisted. When passed via RMI from the DTF, the MxTargetStatus
object contains an actual exception object (myException
). When the
MxTargetStatus
object is persisted, the exception class and message
text (localized for the locale of the CMS) are saved. When the object is
restored from the repository, the values for these attributes are restored,
but no exception object is constructed, so the myException
attribute is still null. Because of this, the get methods
(getExceptionClass
, getExceptionText
) return either
information from the exception itself (if it exists) or from the corresponding
attributes (myExceptionText
, myExceptionClass
). This
way the users of this class that only use these get methods don’t need to know
if there is an exception object associated with the class.
The recover
method is provided to restore the object to a consistent
(non-running) state after it is restored from the repository. This could occur
if the object were persisted in a non-terminal state (while the job was running)
and the Domain Manager daemon (containing the Job Manager’s cache) were stopped
and restarted. When the Job Manager starts it reads all the jobs and recovers
them (updates and rewrites them) to a consistent state using this method.
The restoreOutput and saveOutput methods are called by the containing
MxJob
object when it is restored/saved from the repository to
operate on the associated MxTargetOutput
object. The
removeOutput
method is called when the containing Job is removed.
The getValue
and fromValue
methods are used to persist
the object in integer form.
MxTargetStatus
objects (via the myTargetStatusesBy
attributes). It also has a reference to the Runnable Task and the Source Task
associated with the Job. The Job may have a longer lifetime than the Runnable
Task (that is why the constructor saves many attributes from the Runnable Task),
but the Source Task will always exist while the Job does. When the Source Task
is removed, any Jobs associated with it are removed (MxJobManager
removeJobsForTask method).
The constructor for this class sets the target count for the Job, which is used
in determining the “percent complete” value (getPercentComplete
,
myCalculatedPercentComplete
) and in determining the state of the
Job based on the state of the added Target Statuses. For tools that run on the
CMS (Application Launch and MSA), this count does not correspond to the actual
number of targets from the users point of view, but it does describe the number
of Target Status objects expected. Most other attributes initialized in the
constructor were added because they were needed by the UI, and are not used
within the class. The Job lifetime is also determined at this point, which in
turn helps determine whether the Job is written to the repository
(isPersisted
method).
The use of tables of MxTargetStatus
objects indexed by hostname and
node ID predates the addition of object IDs to Gryphon. One or the other of
these is probably is not needed anymore.
The Job object also may contain an exception if the Job did not complete
successfully. See the discussion of the exception attributes in
MxTargetStatus
.
The MxJob
object is responsible for maintaining its state to
reflect the state of the Job considering the MxTargetStatus
objects
that have been added from completed targets. The methods involved in this are
addTargetStatus
method (called by MxDTFJobImpl
and
MxAutomationJobImpl
when reporting a new or changed Target Status),
the recover method (called by MxJobManager
when reading Jobs from
the repository on startup), and the updateJobTargetState
method
called by the other two methods.
Note that since the status for a target may be added more than once (this
currently happens only for Automation jobs), the addTargetStatus
method must be able to handle this.
Despite their similar names, the addTargetStatus
and
updateStatus
methods have very different affects. The
addTargetStatus
method adds a MxTargetStatus
object
for a new target system associated with the job, or updates the status for
an existing target system. The method indirectly affects the state of the Job
and its "percent complete" (unless a specific value for this attribute has been
set) but does not change any of its other attributes. The updateStatus
method only affects the state of the job, and affects the Job state, start and
end times, percent complete, and exception attributes. It is used by the DTF,
whose MxDTFJobImpl
class maintains the consistency of the Job state
and the target states. The corresponding class for Automation Tools,
MxAutomationJobImpl
does not attempt to maintain this consistency,
so for Automation tasks this is maintained through addTargetStatus
.
The recover
method is called after reading a Job and it associated
target statuses from the repository when the Job manager starts. Since the Job
may have been running when the Domain Manager was last terminated, the job may
be in an inconsistent state. This method makes the Job state consistent with the
Target Statuses. Note that the Job Manager initializeCache
method
must update the Job in the repository after calling this method.
The “percent complete” mechanism for Jobs is intended to allow some Jobs
(currently only Automation jobs) to set the “percent complete” value
(setPercentComplete
) but also to allow other Jobs (currently only
DTF jobs) to maintain their own “percent complete” based on the Target Statuses
added if the “percent complete” value has never been explicitly set (see the
updateJobTargetState
method).
The MxJob
class, like the classes it contains, has method to create,
restore from and remove the files associated with the Job output
(saveOutput
, restoreAllOutput
, restoreOutput
and removeFiles
). Since the Job object knows part of the path
(see the getDirName
method) to the directory where the output
files are stored, all requests related to the output file must go through the
Job object. It is passed the top-level directory, appends to it the path for
this specific job and passes the resulting directory to the Target Status
object(s). The Job object manages creating and removing the directories it knows
about.
MxDTFJobImpl
This class provides the interface between the Domain Manager and the DTF to
manage the execution of a particular DTF Job. It is created with a reference to
the DTF which it uses to forward requests to execute, cancel or kill the Job.
While the Job is active this Job maintains a thread that receives remote events
from the DTF containing updates to the status of the Job or its Targets
(processEventPayload
method). When the Job completes the internal
thread is terminated, since no more updates will be received. The events are
also forwarded to the Job Manager. The number of threads used in the Domain Manager
for this purpose is limited by the mx.properties
configuration variable
MX_DTF_MAX_NUM_RUNNING_JOBS
, which by default is 10.
MxAutomationJobImpl
Since automation jobs are executed within the Domain Manager, the operation of
this class is much simpler that that of MxDTFJobImpl
. No event are
involved in updating this class, so this class must generate events to send to
the Job Manager (the updateJobStatus
and
updateTargetStatus
methods).
MxJobManager
The states a Job goes through during its lifecycle are described in the diagram
below. The labels on the lines are the methods involved in changing the state
of the Job. The note text boxes describe some of the Job Manager attributes
affected.
These states are not explicitly encoded in the Job object, but are useful in
understanding what the Job Manager is doing. All Jobs go through the Created,
Updated, Saved and Deleted states (although not all types of
Jobs actually get written to the repository in the Saved state), but use of the
remaining states depends on the Job lifetime.
Jobs with a lifetime of NEXT_JOB_COMPLETION
or
UNCACHE_OR_NTH_JOB_COMPLETION
or
AGED_OUT_OR_NTH_JOB_COMPLETION
go to through the Superceded
state when a sufficient number of new Jobs for the same task completes. Jobs
with a lifetime of AGED_OUT_OR_NTH_JOB_COMPLETION
go through the
Aged state if the Job has not been Superceded or Deleted
when a sufficient time has passed after the Job completes. A Job of any lifetime
can enter the Uncached state, but only Jobs for which the
isPersisted
method returns true
are actually saved and
can return to the Saved state via the call to the getJob
method. Other Jobs, which are not persisted, enter the Unchached state
and disappear. The states in this diagram are generally descriptive of what is
going on, but not completely accurate in the usual state diagram sense.
The Tables Updated column lists the Job Manager global tables that are
updated on entry to the state. The To State column specifies a possible
new state entered when the method in the Method column is called.
State Tables Updated Method To State
Start -- addJob Created
Created ourJobIDs, ourCache updateTargetStatus Updated
ourTaskToJobMap InRepository
Updated -- updateTargetStatus Updated
InRepository
Updated -- processCompleted Completed
Jobs
Completed ourDateToJobMap updateJobIn Saved
Repository
Saved -- testCanUnload Uncached
Uncached -- getJob Saved
Saved -- removeOldJobs Aged
Aged -- deleteJob Deleted
Saved -- deleteOldJobsForTask Superceded
Superceded -- deleteJob Deleted
Deleted ourTaskToJobMap -- --
ourDateToJobMap
ourJobIDs, ourCache
JobEventListener
This internal static class is used to process events from the
MxDTFJobImpl and MxAutomationJobImpl
classes. It places
events on a list and notifies the Job Manager’s internal thread, which runs the
waitForEvent
method.
CachedJobLoader
This class is required by the ItemCacher
class used by the Job
Manager to implement the Job cache. An instance of this class is passed to
ItemCacher
when the cache is created in the
initializeCache
method.
The get
method of this class is responsible for retrieving an
object (a Job) to be placed into the cache after it has been uncached. It just
reads it from the repository using the Job ID (the key of the cache) passed in,
and reads the TaskID from the Task Manager to get a copy of the TaskID with both
the name and the GUID. Only the GUID is stored in the repository, but the task
name is needed for the Job Manager’s tables. After reading the Job from the
repository, it is wrapped in a CompletedJobImpl
object. The purpose
of this class is to emulate an instance of MxDTFJobImpl
or
MxAutomationJobImpl
so that a completed job that is read from the
repository looks the same to the rest of the Job Manager as a job that has
just completed.
When a MxJob
is read from the repository along with the associated
MxTargetStatus
objects, the MxTargetOutput
objects
(if any) are not created, so the output files for the Job are not read. When
the target status is retrieved from the client interface, the
lookupTargetStatus
method calls the restoreOutput
method of MxJob
to read the files into Java strings so that it
will be available on the client side. This is because the output from a Job
may be quite large (megabytes) and it is best to minimize the time the Job
output is represented as strings in Java.
JobCacheCallback
This is another class required by the ItemCacher
class. An instance
of this class is passed to the ItemCacher
constructor created in
the initializeCache
method.
The testCanUnload
method is used by ItemCacher
to
determine if it can uncache an object because of either its age relative to
other items in the cache, or because of the time it as been in the cache. The
only thing preventing a Job from being unloaded is if it has not yet completed
and been written to the repository.
The unloading
method is used by the ItemCacher to perform the tasks
needed to save an object when it is uncached. Since Jobs are saved when they are
created (addJob
method) and updated when they complete
(updateJobInRepository
method) and are not allowed to be uncached
until they have been written (testCanUnload
method) nothing has to
be done at this time.
Key Attributes
I will point out the more important attributes of the Job Manager.
Even though the Job Manager is called a “manager” it does not fit the pattern of
the tool manager, tool box manager, user manager etc., in that it does not
derive from the MxObjectManager
class and defer most or all
repository operations to it. The Job Manager has its own reference to the
Persistent Data Manager (ourRepository
) and explicitly reads and
writes Jobs and Target Statuses to it. Herein lies a bug that needs to be
fixed for performance. The PDM can handle “secondary objects” which are
attributes of an object implementing MxPersistentObjectIfc
that
themselves implement MxPersistentObjectIfc
(such as the
myTargetStatusArray
attribute of MxJob
, which contains
an array of such objects (MxTargetStatus
)). When the primary object
(MxJob
) is written to the repository, the associated
MxTargetStatus
objects are written as well. The problem is that
the Job Manager receives updates to the MxTargetStatus
objects
independent of updated to the MxJob object, and must update the secondary
objects in the repository independent of the MxJob
object. The
result is some unnecessary updates. For example, if a Job runs on two targets,
the status for each target will get updated
(updateTargetStatusInRepository
method) as it is received, and when
the Job itself completes it will be written to the repository
(updateJobInRepository
method) which will also write the two
Target Status objects again, even though they have not changed. The fix seems
like to remove myTargetStatusArray
as a persistent attribute of
MxJob
(PERSISTENT_ATTRIBUTES) and rely on the Job Manager to keep
the Target Status objects updated. This problem was discovered late in the
release cycle so it was not fixed because of the limited impact of the defect.
Other key attributes include ourCache, which has references to the Jobs kept in
the cache (actually MxJobIfc references), and several other tables that must be
kept consistent with the cache:
ourJobIDs
, used to enumerate all the JobIDs being stored and to determine
if a JobID exists without loading it into the cache.
ourDateToJobMap
, containing only completed Jobs by completion date and by JobID.
ourTaskToJobMap
, containing only jobs associated with a scheduled task (not
“run now” tasks). It is used to enforce limits on how many Jobs to keep around
at once for a scheduled task.
ourJobLock
, the lock used to protect all these tables.
Initialization
The facility for running the Job Manager without the use of the repository dates
from early days of Gryphon. It is probably pretty useless now, because of how
intertwined the database is with every aspect of the system (e.g. GUIDs). It
would simplify things (initializeCache
) if the associated code were
removed (isDebugRepositoryDisabled
and
initializeRepositoryFlag
methods).
The initializeCache
method reads every Job in the repository
(but does not load them into the cache) to initialize its secondary tables
listed in the bulleted list above, and to recover any Jobs (and their Target
Statuses) that were running to a consistent state. It updates the Job and
Target status in the repository if any changes were made during the recovery.
Because this recovery could result in several Jobs appearing to complete at the
same time, it must handle duplicate end times for Jobs in ourDateToJobMap.
addJob
This method enters a newly created Job into the Job Managers table and cache. A
Job is only added to the Task table if it is based on a scheduled Task. Task
names for “run now” Tasks are created dynamically and never re-used, so there
will never be another Job for that task. Some jobs should never be visible to
users listing the running Jobs, but need to be maintained by the Job Manager and
their status returned when specifically requested. Currently only Jobs for Web
Launch tasks without status URLs fall into this category: these jobs have the
UNCACHE_UNLISTED
lifetime specified. Note that the Job is added to
the repository at this time, but not updated until all its targets complete.
deleteJob
This method only allows Jobs which have completed and been written to the
repository to be deleted.
getJob
This method contains debug code that verifies a Job ID is not in any of the
other tables (and removes it if found) if it is not in the cache. It was added
after problems with inconsistencies were detected and fixed. I have never seen
a debug log where this code was executed.
Note that after getting the Job from the cache, which possibly involved reading
it from the repository, the MxJob restoreAllOutput
method is called
to fill in the contents of the Target output if needed.
getJobIfc
Unlike other public methods in this class, this method is not called from the
client interface. It is used for automation jobs to get a reference to an object
that can be used to update the Job and Target status
(MxAutomationJobImpl
) while passing events to the Job Manager about
the updates so it can take action as needed.
listJobs
This method is available from the client interface, but is not currently used.
It is left over from the old days when few Jobs existed at any time in the
system. With Nimbus Jobs being retained for a month or more, there could be
quite a few jobs in the repository but not in the cache. Calling this method
loads all those jobs into the cache, possibly causing quite a performance hit.
Maybe this should be removed.
getTargetStatus
This is one of the methods that relies on the getTargetStatus
method of MxJob
returning a clone instead of the actual object.
This allows this method to set the target output of the object being returned
to null if needed without affecting the copy in the cache. The problem is, the
cloning is always performed, instead of only when it is actually needed, causing
(according to Geoff) a significant performance impact. Mike says that the cloning
should only be performed when needed, and should be done at the Controller level, not
the manager. This issue came up during the code review, but I decided it was too
major to fix at that time.
processEventPayload
This method is run by the Job Manager internal thread to handle events
containing Jobs or Target Statuses as payloads. In general (especially from
automation tasks) we can get many updates from a Job or Target Status before
it is actually complete. The payload is only queued for further processing
(ourTargetUpdateList
or ourJobObjectList
) if it has
reached a terminal state.
The waitForEvent
method, which calls this method, also calls
processCompletedJobs
or processCompletedTargets
if any
were queued by this method.
processCompletedJobs
This method is run by the Job Manager internal thread to handle an event
indicating a Job has completed. It fudges the Job completion time to prevent
duplicates in the table indexed by Completion date. It writes the Job (and its
Target Statuses) to the repository and adds it to ourDateToJobMap
based on the completion time. It deletes any older Jobs for the same task that
need to be removed and marks the Job as being saved in the repository so that
it can be deleted or uncached if needed.
processCompletedTargets
This method is also run by the Job Manager internal thread to handle events
indicating a Target has completed. There could be multiple targets for the same
job being processed at once, so to improve performance by decreasing the number
of calls to the repository manager, all Targets for the same Job are written in
one call to updateTargetStatusInRepository
.
deleteOldJobsForTask
This method runs in the internal thread to remove Jobs that should be removed
because enough newer Jobs have completed for the same task. This method is the
user of ourTaskToJobMap
to find other Jobs for the same task. It also uses
ourDateToJobMap
to get the Jobs for the task ordered by completion date.
updateJobInRepository / updateTargetStatusInRepository / addJobToRepository
These methods hide from the rest of the Job Manager the existence of Jobs that
do not actually get written to the repository.
removeOldJobs
This method is an important user of ourDateToJobMap
. It defers to the
filterJobsPerTask
the identification of the jobs for each task, and then it
removes them. This method has package visibility so it can be called from the
MxJobCleanupConstruction
, which is run every hour.