1 |
This file contains release notes and a change history for the |
2 |
lcg-info-dynamic-scheduler information provider. |
3 |
|
4 |
Release 2.3.4 |
5 |
Make it possible to build RPM via Makefile (ETICS changed). |
6 |
|
7 |
Release 2.3.2 and 2.3.3 |
8 |
ETICS compatibility |
9 |
|
10 |
Release 2.3.1 |
11 |
Fixes for savannah bugs 25031, 25867, 27171, 27172, 38195. |
12 |
|
13 |
Release 2.3.0 |
14 |
Change dynamic scheduler so that it prints out JobSlots and FreeCPU info in CE views. |
15 |
|
16 |
Release 2.2.2 |
17 |
lcg-info-dynamic-scheduler no longer prints ACBRs (fixed bug in 2.2.1? unsure) |
18 |
|
19 |
Release 2.2.1 |
20 |
The dynamic scheduler was changed to cease printing the GlueCEAccessControlBaseRule. |
21 |
2.2.0 did not work since GIP considers all changes to multivalued attributes (like |
22 |
ACBRs) to be significant. |
23 |
|
24 |
Release 2.2.0 |
25 |
The dynamic scheduler was changed in order to deal with the DENY |
26 |
tags being used in the short-term solution (June 2007) for job priorities. |
27 |
The dynamic scheduler does the following with ACBRs placed on VOViews: |
28 |
- it discards any ACBR that does not begin with either "VO:" or "VOMS:" |
29 |
- if there is more than one ACBR left in the list, it only uses the last one |
30 |
in the list, and prints a warning message to standard error and to syslog |
31 |
- it allows multiple DENY tags |
32 |
- there is no checking on the consistency between the ACBR and DENY tags in a view. |
33 |
|
34 |
Release 2.1.0 |
35 |
|
36 |
lrms.py was changed in order to support caching of search results. |
37 |
Most of the time spent in lcg-info-dynamic-scheduler was due to |
38 |
queries like "find all jobs from group 'lhcb', in state 'waiting', |
39 |
for queue 'qlong'. Queries like this are now cached for future use, |
40 |
and can also be supplied *before* use, like they now are for |
41 |
lcg-info-dynamic-scheduler. That program now generates slices |
42 |
of the job list for the various combinations queue/group/state |
43 |
that will be needed while running the program. |
44 |
|
45 |
There were previously two different 'return a list of matching job' |
46 |
functions, with different interfaces. These now have a unified |
47 |
interface so that result caching can be supported. This does break |
48 |
backwards compatibility for lrms.py. |
49 |
|
50 |
Release 2.0.0 |
51 |
|
52 |
Rather massive changes in parsing logic, to be able to handle VOViews |
53 |
with VOMS FQANs. |
54 |
|
55 |
VOMS FQANs are handled both by the input routines, which know what |
56 |
to do with them when reading the static LDIF file, as well as |
57 |
the group mapping logic, that knows how to associate FQANs |
58 |
with unix groups. To this end, the vomap construct in the |
59 |
lcg-info-dynamic-scheduler config file now supports lines like |
60 |
|
61 |
lhcbsgm:/VO=lhcb/GROUP=/lhcb/ROLE=lcgadmin |
62 |
|
63 |
in addition to the original lines like |
64 |
|
65 |
atlgrid : atlas |
66 |
|
67 |
which would map group 'atlgrid' to "VO : atlas". |
68 |
|
69 |
The parsing of the GlueCEUniqueID and GlueVOView blocks has |
70 |
also changed rather drastically, so that previous problems with |
71 |
numbers, dashes, etc in queue names and hostnames are no longer a |
72 |
problem. Instead of parsing the GlueCEUniqueID field to get the |
73 |
queue name, the program now reads GlueCEName and uses that for |
74 |
the queue name. |
75 |
|
76 |
Also, the file vomaxjobs-generic (documentation) was added, |
77 |
and the rest of the documentation and example files was |
78 |
substantially upated for the new release. |
79 |
|
80 |
Otherwise no changes since 1.6.3. |
81 |
|
82 |
For people using the test suite: the versions of the test output |
83 |
included in 2.0.0 will cause tests of older versions to fail. This |
84 |
is unavoidable since the old parsing logic was based on the order |
85 |
in which blocks appeared in the ldif file, while the new version |
86 |
uses python 'dicts' which have an unpredictable order when |
87 |
iterated. To make the order predictable (for purposes of test |
88 |
harness), the keys are sorted before the program starts to print. |
89 |
The older versions do not sort the output before printing, hence |
90 |
tests of the old versions with the new files will fail. |
91 |
|
92 |
Release 1.6.3 |
93 |
|
94 |
Fix for GGUS bug 10155 -- had to do with YAIM adding unnecessary lines like |
95 |
alice:alice |
96 |
to the [vomap] stanza. The program did not expect to get these lines so it |
97 |
of course did something rather silly with them, resulting in the behavior |
98 |
reported in the GGUS bug. |
99 |
|
100 |
Release 1.6.1 |
101 |
|
102 |
Bug fix for lcg-info-dynamic-scheduler; fix regexp matching |
103 |
GlueCEUniqueID. the regexp in 1.6.0 missed |
104 |
- CEs with a "-" character in the hostname |
105 |
- queue names with underscores, uppercase letters, and numbers |
106 |
|
107 |
There are examples of each of these classes on the production system, |
108 |
so this upgrade is critical. |
109 |
|
110 |
Release 1.6.0 |
111 |
|
112 |
- changes to parsing of static LDIF file to pick up gLite CEs with "blah" |
113 |
instead of "jobmanager". Note this is largely untested!! |
114 |
- added test suite to prevent bug regression |
115 |
- some changes to build system (three targets increases aggravation) |
116 |
- some changes to pbsServer classes to assist in debugging. |
117 |
- some changes to vomaxjobs-maui to assist in debugging/testing; |
118 |
also fixed various unreported bugs discovered during testing. |
119 |
- Change mapping of pbs/torque job states in pbs classes; up til now |
120 |
was either queued (Q) or running (any other states). Now we have: |
121 |
|
122 |
From the qstat (torque 2.0.0p4) man page: |
123 |
|
124 |
C - Job is completed after having run (mapped to 'done') |
125 |
E - Job is exiting after having run. (mapped to 'running') |
126 |
H - Job is held. (mapped to 'pending') |
127 |
Q - job is queued, eligible to run or routed. (mapped to 'queued') |
128 |
R - job is running. (mapped to 'running') |
129 |
T - job is being moved to new location. (mapped to 'pending') |
130 |
W - job is waiting for its execution time (mapped to 'queued') |
131 |
|
132 |
Release 1.5.2: |
133 |
|
134 |
pbs package: Fix to vomaxjobs-maui to deal with cases where there is |
135 |
extra 'warning' output near the top of the command output from diagnose -g. |
136 |
|
137 |
generic package: fix bug with logging; undefined variable caused fatal program |
138 |
exit while trying to print warning message. |
139 |
|
140 |
Release 1.5.1: |
141 |
|
142 |
fix dependency problems with RPMs. |
143 |
|
144 |
Release 1.5.0 |
145 |
|
146 |
* add RELEASE (this file) to docs dir in generic package RPM |
147 |
|
148 |
* Minor change to build system to make tag events in ChangeLogs |
149 |
easier to read. |
150 |
|
151 |
* lcg-info-dynamic-scheduler: |
152 |
|
153 |
- It is possible (e.g. by dramatically reducing MAXPROC config in Maui) for |
154 |
a VO to have more running jobs in the LRMS than allowed by MAXPROC. |
155 |
In this case a negative value was reported for FreeSlots. Fixed. |
156 |
|
157 |
- implemented logging to syslog |
158 |
|
159 |
* vomaxjobs-maui: |
160 |
|
161 |
- adapt to handle MAXPROC specifications like MAXPROC=soft,hard |
162 |
The code reports the 'hard' limit, since this is relevant when the |
163 |
system is not full, and this is when it's needed. Maui uses the |
164 |
soft limit on a full system, but in this case the info provider will |
165 |
drop FreeSlots to zero as soon as jobs remain in the queued state |
166 |
instead of executing immediately. |
167 |
|
168 |
Release 1.4.3 |
169 |
|
170 |
* lcg-info-dynamic-scheduler: |
171 |
|
172 |
- fix for Savannah bug 14946: overflow of conversion of response time |
173 |
values from float (internal) to int (output representation). Now prints the |
174 |
magic value of 2146060842 as an upper limit. |
175 |
|
176 |
Release 1.4.2 |
177 |
|
178 |
* pbsServer.py: |
179 |
|
180 |
- included Steve Traylen's patch to deal with jobs for which the |
181 |
uid/gid printed by 'qstat' is not listed in the in running machine's |
182 |
pw DB. This can happen when the CE is not the same physical |
183 |
machine as the actual LRMS server. |
184 |
|
185 |
|
186 |
Estimated Response Time Info Providers (v 1.4.1) |
187 |
------------------------------------------------ |
188 |
|
189 |
This information provider is new in LCG 2.7.0 and is |
190 |
contained in two RPMs, lcg-info-dynamic-scheduler-generic |
191 |
and lcg-info-dynamic-scheduler-pbs. Sites using torque/pbs |
192 |
as an LRMS and Maui as a scheduler are fully supported by |
193 |
this configuration; those using other schedulers and/or |
194 |
LRMS systems will need to provide the appropriate back-end |
195 |
plugins. |
196 |
|
197 |
For sites meeting the following criteria, the system should |
198 |
work out of the box with no modifications whatsoever: |
199 |
|
200 |
LRMS == torque |
201 |
scheduler == maui |
202 |
vo names == unix group names of that vo's pool accounts |
203 |
|
204 |
Documentation on what to do if this is not the case can be |
205 |
found in the file |
206 |
|
207 |
lcg-info-dynamic-scheduler.txt |
208 |
|
209 |
in the doc directory |
210 |
|
211 |
/opt/lcg/share/doc/lcg-info-dynamic-scheduler |
212 |
|
213 |
There is also documentation in this directory indicating |
214 |
the requirements on the backend commands you will need to |
215 |
provide in the case that you are using a different |
216 |
scheduler or LRMS. Tim Bell at CERN can help for people |
217 |
using LSF. |
218 |
|
219 |
|
220 |
|