ViewVC logotype

Contents of /nl.nikhef.pdp.dynsched/trunk/lrmsinfo-generic.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 2009 - (show annotations) (download)
Fri Oct 8 12:52:21 2010 UTC (11 years, 7 months ago) by templon
File MIME type: text/plain
File size: 5480 byte(s)
added header lines

1 $Id$
2 Source: $URL$
4 The output of the LRMS-specific part needs to contain a snapshot
5 of the state of the LRMS. This state should be as faithful as
6 possible; 'massaging' of the state should be left to higher-level
7 programs such as the ERT system (which handles mapping of unix
8 group names to VO names). Placing the massaging at a higher
9 level and keeping the LRMS-specific part pristine has two main
10 values:
12 1) the massaging is uniform across LRMS types, so one can at least
13 hope that there won't be some LRMS bias in the estimates
15 2) if the LRMS tool reports the real information, it might well be
16 useful for some purpose besides predicting ERTs.
18 ==========================================================
20 The required format of this file is described below.
24 nactive 240
25 nfree 191
26 now 1119073982
27 schedCycle 120
28 {'queue': 'atlas', 'start': 1119073982.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 345600.0, 'qtime': 1119073781.0, 'jobid': '612049.tbn20.nikhef.nl'}
29 {'queue': 'qlong', 'start': 1119060911.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 259200.0, 'qtime': 1119060774.0, 'jobid': '612043.tbn20.nikhef.nl'}
30 {'queue': 'atlas', 'start': 1119060910.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 345600.0, 'qtime': 1119060759.0, 'jobid': '612039.tbn20.nikhef.nl'}
31 {'queue': 'qlong', 'start': 1119136200.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 259200.0, 'qtime': 1119135972.0, 'jobid': '612176.tbn20.nikhef.nl'}
32 {'queue': 'dzero', 'start': 1119268211.0, 'state': 'running', 'group': 'dzero', 'user': 'dzero004', 'maxwalltime': 345600.0, 'qtime': 1119268047.0, 'jobid': '612241.tbn20.nikhef.nl'}
34 ===========================
35 The last structure between "{}" characters is repeated one line
36 for each job currently either executing or waiting in the queue. Here
37 are some explanations for the semantics of the values:
39 nactive is the number of job slots that are actually capable of
40 running jobs at the snapshot time (let's call the snapshot time t0 for
41 brevity). by 'actually capable of running jobs' i mean that at t0,
42 what is the maximum number of jobs that could be running on the
43 system. so nactive counts all jobs slots, empty or occupied, but does
44 not count the job slots on CPUs that are 'down' or 'offline'. So it's
45 not the theoretical maximum number of job slots in your farm (unless
46 ALL your WNs are working), it's the number that are 'up'.
48 nfree is the number of these active job slots that at t0 do not have
49 an assigned job. they can potentially accept a new job at t0 (or
50 at least at the start of the next scheduling cycle).
52 Note these numbers don't have anything to do with VOs (unless each
53 node happens to be exclusively assigned to a single VO). They are
54 aggregates of all job slots that are being controlled by a single
55 LRMS.
57 'now' is a timestamp in seconds of when the queue was inspected. The
58 only constraint here is that 'now' has to be in the same units, and
59 have the same zero reference, as do all the times in the per-job lines
60 (like 'qtime' or 'start'). In the PBS version provided, 'now'
61 is in local time seconds, meaning seconds since midnight
62 Jan 1st 1970 local time. Again as long as the units are seconds
63 and all times have the same reference point, the actual reference
64 point does not matter.
66 'schedCycle' is the cycle time of your batch scheduler; how often does
67 it start a new scheduling pass? As of this writing at NIKHEF it is
68 120 seconds, meaning a new scheduling attempt is started every 120
69 seconds.
71 Each line thereafter reports the info for a single job.
73 {'queue': 'qlong', 'start': 1119060911.0, 'state': 'running', \
74 'cpucount': 1, 'group': 'atlsgm', 'user': 'atlsm003', \
75 'maxwalltime': 259200.0, 'qtime': 1119060774.0, \
76 'jobid': '612043.tbn20.nikhef.nl'}
78 This has a structure { 'key1' : 'attr1', 'key2' : 'attr2' } and
79 is written in this particular format because it is the string
80 representation of a python 'dictionary' (same as perl 'hash'),
81 making the input parsing for the other part very easy. The
82 order of the various keys is irrelevant, you could write
83 {'key2' : 'attr2', 'key1' : 'attr1' } if you wanted.
85 Not all the fields are required but they should be consistent.
86 All jobs should have a 'qtime' since they must have entered the
87 queue at some point. If a job is in state 'running' it better
88 have a 'start' time; if it is 'queued' then 'start' should be
89 absent.
91 Here is a bit of explanation of the various fields:
93 In the example above, the local PBS jobid is 612043.tbn20.nikhef.nl ;
94 this just has to be a unique string (no two jobs should have the same
95 string).
97 qtime is the timestamp when it entered the queue, with the same ref
98 point as 'now'. now - qtime will tell you how long it has been since
99 the job entered the queue (submitted). maxwalltime is the maximum
100 amount of real time the execution of a job in this queue may take in
101 seconds). 'user' and 'group' are the pool account ids under which the
102 job runs.
104 'cpucount' is how many CPUs are assigned to this job.
106 'state' can be either 'queued', 'running', 'pending', or 'done'.
107 'pending' means it is in the queue but has been placed on 'hold'.
109 'start' is the time stamp for when the job actually started to
110 execute. Again needs to be measured in the same coords as 'now'.
111 Finally 'queue' gives the name of the queue in which this job is
112 running (like 'qlong').


Name Value
svn:eol-style native
svn:keywords Id URL

ViewVC Help
Powered by ViewVC 1.1.28