On 12/02/2013 12:16 PM, Susanne Oberhauser-Hirschoff wrote:
Robert Schweikert <rjschwei(a)suse.com> writes:
2.) The staging approach
I can only speak from experience and thus this might sound a little
lame, sorry. I have seen two implementations of the staging model in
action at companies that produce large software suites. In both cases
I consider the approach as a failure.
The problem in both cases is that the number of staging
trees/branches/projects has an ever increasing slope, thus consuming
ever more manpower to manage the ever increasing number of staging
While the original problem of "how do we deal with unknown
adverse interactions between updates" remains unresolved. The
"solution" to this problem taken in one case was to have intermediate
staging trees where "known risky updates" were tested together. Yes,
staging trees upon staging trees. But this only solves the problem
superficially as the target tree will move ahead and thus the staging
tree by definition is always out of date. Unless the target tree is
frozen until a particular staging tree is merged. Anyway it is a maze
that requires potentially a lot of people.
The other problem with the staging model is that the "potentially
risky interactions" knowledge is an implicit set of interactions that
the staging tree managers happen to know. This is not expressed
anywhere and thus makes it difficult for other people to learn. We
have this problem today and from my point of view this will not be
resolved with more staging trees.
The staging model will not catch adverse interactions reliably. The
reason is that by definition the staging tree is always out of date,
unless the target tree is frozen and after one staging tree is
accepted all other staging trees get rebuilt. This is not conducive to
What I like in your little umm rant ;) is the notion of *interactions*
that need to be tested.
Not ranting, not even a little bit or with a wink. All I am going to be
ale to bring to the table are the experiences I have had in the past
with the "staging model". I will not have time to help with the
implementation and it is unlikely that I will have the time to volunteer
to chaperone a staging branch if the decision should be made to go that
route. Ultimately those willing to do the work and those willing to be
chaperones decide. I will not complain about the decision as I am not
going to be able to do the work.
The staging model has fundamental design problems and one can implement
processes and procedures to alleviate the impact of those fundamental
design issues. Only time will tell if this is ultimately sufficient if
we continue to experience the growth we have seen.
What I'd like to see in this whole discussion in addition is 'cadenced
flow' and 'integration decision'.
Cadence becomes a big deal and coordination of stage projects and their
order of merge into factory also becomes a big deal.
When you add these, the resulting staged flow will actually get both bug
fixes and package or subsystem upgrades available fast and in stable
quality, continously (bold claim) here's how:
Well we can argue about the fast part ;)
Let's assume there is some code stable base, call it Factory.
The goal is to get updates in there, reliably, regularly, to get it to
the next level of being a stable code base.
For "leaf packages" that is simple: build the package, test it's
functionality, then release it.
Well, I do not think it is that simple. One could argue that Perl is a
leaf package. But we have perl-bootloader and thus and update to Perl
could break perl-bootloader which in turn would be a bad thing with
pretty far reaching effects. Father, perl-bootloader does not stand unto
it's own, it uses Perl modules that should definitely be considered as
A similar argument can be made for KIWI, which depends on a lot of leaf
packages but, KIWI is very important to create our ISO images. Thus, the
line for leaf packages is blurry at best.
For for lack of a better word "intermediate packages" (say libpng), you
start like above, testing the functionality, until from that perspective
it's good to release. But then you also need to get it integrated with
"the rest of the world", (maybe 50-100 packages for libpng, 44 on my
Then there is these "multi-scope packages", like NetworkManager or the
bluetooth stack, which affect several whole integration scopes, desktops
in this case. they have interactions in two stages (first the desktop,
then all other desktops).
There is the "transversal packages", affecting almost everything, like
And then there is packages affecting few other packages that nonetheless
have a lot of interactions that need to be tested (kernel, xorg, ...)
Now supposed there is a cascade of staging projects, which potentially
'release', say, every week or every other week (that's the
They build a tree structure, something like this:
\u o \l \l
\t o \i \e
\o l \b \a
gcc \- s \s \f
/ \x r/
K/\G \d e/
D/ \N \e k
The number of nodes from the root (Factory) to a branch corresponds to
the interactions that need to be tested for what goes into that branch.
This gives growing rings of scope for interaction testing and
integration succes. Successfull build and automatic tests are
necessary, sometimes even sufficient for interaction test and
integration success. They propagate automatically, to give a
'tentative' next build. That, however, does not affect the 'last known
good' build --- that last known good state remains available, too.
Yes, however, what is being neglected is that there is a fundamental
problem with the cadence. The cadence itself is influenced by the
process, through rebuild times and other snafus that are inevitable.
For illustration purposes lets say the cadence for the autotools staging
is every other Monday, and the desktops get their say every other
Wednesday on the off weeks, i.e. auto tools goes weeks 1, 3, 5 and so on
and the desktops merge in weeks 2, 4, 6, and so on. At the beginning of
week 2 the autotools merge has to be completed in order to give the
desktop staging tree sufficient time to rebuild to meet it's merge
window on Wednesday. During this time (Monday of week 2 until end of
Wednesday in week 2) nothing else is allowed to be merged into factory
or the desktop staging tree would have to be rebuilt again. That's all
fine but we have a time problem....
We only have a certain number of days available in a year. Lets be
optimistic and say our staging branch chaperones spend 300 days a year
fiddling with the staging branches. This would provide a theoretical
maximum of 300 staging branches, if we can manage to merge one every
day. This however is not possible as the build time for the project
alone dictates that certain changes require build times > 1 day. This
reduces the maximum number of staging branches we can have further. Lets
say we end up with a theoretical maximum of 250 staging branches. Simple
math dictates that with 6k+ packages we will have things in staging
branches that can break independently. Therefore, one developer is stuck
in the same staging tree as another developer that happens to break
something. The "innocent bystander developer", that happened not to
break anything will have to wait not just the regular cadence, but the
cadence plus the fix time of the unrelated breakage. This is not very
satisfying for the developer that didn't break anything. In addition if
the unrelated breakage does not get fixed in time for the merge window
of the staging branch than the given staging branch has to wait for it's
next merge opportunity, which may be weeks away.
One way to alleviate some of the problems is to have a very long cadence
for "traversal packages". Lets say we only allow tool chain updates once
every blue moon. But what if there is a bug in the tool chain or some
other unknown undesired interaction? Now we must have a fix and the tool
chain staging tree must get priority and be merged much sooner than it's
next expected merge window. With this the cadence goes out the window.
As all other staging trees will have to be pushed of their cadence to
accommodate the new tool chain merge.
Thus at each branch, every week a decision can be made: is the
combination of the 'new' stuff good enough already to *pull* (!!) it in
together?or do we --- for the combination! --- have to stick with what
we had so far, the last known good version?
There is no way to know if you can pull things in together because
staging branches do not get cross built against each other. In the
figure above everything is nicely spaced, but that probably does not
reflect the real world. If libs and the desktops are ready at he same
time one can still not merge them into factory at the same time because
they have not built against each other, they have built against
"current" factory. Thus, one would have to send libs for the merge to
the reference branch and rebuild and retest the desktops. Therefore, all
the build and testing effort of the previous desktop staging branch is
out the window and useless. To eliminate the waste in testing and build
one has to wait until the reference branch is "frozen" for the merge
window of a particular staging branch. Then build the staging branch,
then test. Especially the testing is difficult when we talk about the DEs.
I will postulate the following technical requirement:
The only way to protect against adverse interaction is to build and test
a staging tree against a frozen reference tree.
In our case the reference tree would be factory. The technical
requirement for staging work to catch adverse interaction is therefore
But the technical requirement creates a people problem ;) . People hate
the waterfall and hurry up and wait stuff. Therefore, what tends to
happen is that multiple staging branches get merged into the reference
branch based on heuristic historical data of "no adverse interaction
when merging a given set of branches in the past". This data has a
number of problems:
- past behavior does not guarantee future performance
if perl-bootloader or kiwi depend on a new leaf package the
heuristic data of those staging trees is useless as a new set of
interactions is created
- the heuristic knowledge is intrinsic to the chaperone of the
reference branch (granted, this is not necessarily much different
than it is today, but we are looking for improvement and not "the
- the bus factor remains 1, i.e. the chaperone of the reference branch
btw, 'remote' breaks of the
last known good source version, because of some, say, toolchain upgrade,
clearly indicate said root cause, the toolchain upgrade, needs some
love, too. Integration Manager, set priorities... what do you want in
this week's Factory? maybe the new toolchain will not be there yet...
A GNOME release that needs a new NM or bluetooth may need a number of
cycles to get to a shape where it can be merged with KDE.
This will also lead to races. For example a new gcc might race a
desktop integration, i.e. the new gcc works with the last known good
version of the desktops. Then the new desktop integration will need to
do their homework and pull in the new gcc, and until then, the new gcc
will just build the last known good desktop.
Also a new GNOME will have at least to ensure integration issues are
resolved with the current stable KDE (last known good), and hopefully
the KDE guys are good citicens and are willing to spend the time to make
sure this works. They will. They need the GNOME guys to do the same a
few weeks later...
Sometimes for several weeks the integration master needs to pull some
old versions (the last known good combination), just because the new
combination is not ready yet.
Now how can small leaf package updates skip such a major barrier?
e.g. a new gimp?
At each junktion it is clear what needs to be tested. If the new leaf
gtk+ application, gimp, can also be integrated with the 'last known good
version', the one that is still in Factory, then it can be integrated
into that and then thus moves on, ahead of the rest of GNOME, at the
But this implies that gimp has it's own staging branch, thus one is
feeding the "ever expanding number of staging branches" monster. One
cannot pull a part of a staging branch without placing the pulled pieces
into a staging tree of its own and building and testing that staging
tree against a "frozen" reference branch.
So if we tilt the above tree and look at it sideways, it almost looks
like git integration:
new stuff merge success
/ \pull new gimp
/ \ last known good
True, but one still has to build and test the cherry picked stuff, i.e.
that's where the need for yet another staging project is created. This
rests on the basic assumption that only stuff built and tested against
the reference branch can be merged.
Such a system of 'staging' or integration projects gives a clear flow
of both updates and upgrades into a well integrated and tested Factory,
which is 'released' at a weekly cadence.
The cadence also scopes the size of the integration projects: they
should be small enough to allow something like a weekly or at most
Beta users for some integration point or branch can use this a few days
ahead of the release, (zypper rr + zypper dup gets them sane again)
A new kernel can be available for pull for a longer while, until it has
enough love and testing to actually be pulled in as The Kernel, how to
manage kernel beta test is an own excercise when we get a stable
Branch projects can also be used by those happy with a partial
integration (new KDE even if it breaks GNOME or vice versa), or
experimenting early with completely new feature sets (systemd, ...).
I believe this model overcomes some of the issues Robert has brought up
about the traditional 'staging' model. It gives clear responsibilities
and a clear cadence.
No model can solve the problem *who* is going to do the integration
work, but this staging model at least clearly scopes and cadences what
needs to be done and when to keep the flow going.
There are good ideas here to alleviate some of the fundamental issues
inherent in the staging model.
Robert Schweikert MAY THE SOURCE BE WITH YOU
SUSE-IBM Software Integration Center LINUX
Public Cloud Architect
To unsubscribe, e-mail: opensuse-factory+unsubscribe(a)opensuse.org
To contact the owner, e-mail: opensuse-factory+owner(a)opensuse.org