Re: [opensuse-factory] O Factory - Where art Thou?

3 Dec 2013

      On 12/02/2013 12:16 PM, Susanne Oberhauser-Hirschoff wrote:
...
Robert Schweikert  writes:
...
2.) The staging approach
I can only speak from experience and thus this might sound a little
lame, sorry. I have seen two implementations of the staging model in
action at companies that produce large software suites. In both cases
I consider the approach as a failure.
The problem in both cases is that the number of staging
trees/branches/projects has an ever increasing slope, thus consuming
ever more manpower to manage the ever increasing number of staging
projects.
While the original problem of "how do we deal with unknown
adverse interactions between updates" remains unresolved. The
"solution" to this problem taken in one case was to have intermediate
staging trees where "known risky updates" were tested together. Yes,
staging trees upon staging trees. But this only solves the problem
superficially as the target tree will move ahead and thus the staging
tree by definition is always out of date. Unless the target tree is
frozen until a particular staging tree is merged. Anyway it is a maze
that requires potentially a lot of people.
The other problem with the staging model is that the "potentially
risky interactions" knowledge is an implicit set of interactions that
the staging tree managers happen to know. This is not expressed
anywhere and thus makes it difficult for other people to learn. We
have this problem today and from my point of view this will not be
resolved with more staging trees.
The staging model will not catch adverse interactions reliably. The
reason is that by definition the staging tree is always out of date,
unless the target tree is frozen and after one staging tree is
accepted all other staging trees get rebuilt. This is not conducive to
parallel development.
What I like in your little umm rant ;) is the notion of *interactions*
that need to be tested.
Not ranting, not even a little bit or with a wink. All I am going to be 
ale to bring to the table are the experiences I have had in the past 
with the "staging model". I will not have time to help with the 
implementation and it is unlikely that I will have the time to volunteer 
to chaperone a staging branch if the decision should be made to go that 
route. Ultimately those willing to do the work and those willing to be 
chaperones decide. I will not complain about the decision as I am not 
going to be able to do the work.

The staging model has fundamental design problems and one can implement 
processes and procedures to alleviate the impact of those fundamental 
design issues. Only time will tell if this is ultimately sufficient if 
we continue to experience the growth we have seen.
...
What I'd like to see in this whole discussion in addition is 'cadenced
flow' and 'integration decision'.
Cadence becomes a big deal and coordination of stage projects and their 
order of merge into factory also becomes a big deal.
...
When you add these, the resulting staged flow will actually get both bug
fixes and package or subsystem upgrades available fast and in stable
quality, continously (bold claim) here's how:
Well we can argue about the fast part ;)
...
Let's assume there is some code stable base, call it Factory.
The goal is to get updates in there, reliably, regularly, to get it to
the next level of being a stable code base.
For "leaf packages" that is simple: build the package, test it's
functionality, then release it.
Well, I do not think it is that simple. One could argue that Perl is a 
leaf package. But we have perl-bootloader and thus and update to Perl 
could break perl-bootloader which in turn would be a bad thing with 
pretty far reaching effects. Father, perl-bootloader does not stand unto 
it's own, it uses Perl modules that should definitely be considered as 
leaf packages.

A similar argument can be made for KIWI, which depends on a lot of leaf 
packages but, KIWI is very important to create our ISO images. Thus, the 
line for leaf packages is blurry at best.
...
For for lack of a better word "intermediate packages" (say libpng), you
start like above, testing the functionality, until from that perspective
it's good to release.  But then you also need to get it integrated with
"the rest of the world", (maybe 50-100 packages for libpng, 44 on my
system).
Then there is these "multi-scope packages", like NetworkManager or the
bluetooth stack, which affect several whole integration scopes, desktops
in this case.  they have interactions in two stages (first the desktop,
then all other desktops).
There is the "transversal packages", affecting almost everything, like
the toolchain.
And then there is packages affecting few other packages that nonetheless
have a lot of interactions that need to be tested (kernel, xorg, ...)
Now supposed there is a cascade of staging projects, which potentially
'release', say, every week or every other week (that's the "cadence").
They build a tree structure, something like this:
a  t
       \u  o            \l              \l
        \t  o            \i              \e
         \o  l            \b              \a
   gcc    \-  s            \s              \f
    --------------------------------------------------------> Factory
              /                     /
             /\l                  n/
            /  \x                r/
          K/\G  \d              e/
         D/  \N  \e            k
        E/    \O
               \M
                \E
The number of nodes from the root (Factory) to a branch corresponds to
the interactions that need to be tested for what goes into that branch.
This gives growing rings of scope for interaction testing and
integration succes.  Successfull build and automatic tests are
necessary, sometimes even sufficient for interaction test and
integration success.  They propagate automatically, to give a
'tentative' next build.  That, however, does not affect the 'last known
good' build --- that last known good state remains available, too.
Yes, however, what is being neglected is that there is a fundamental 
problem with the cadence. The cadence itself is influenced by the 
process, through rebuild times and other snafus that are inevitable.

For illustration purposes lets say the cadence for the autotools staging 
is every other Monday, and the desktops get their say every other 
Wednesday on the off weeks, i.e. auto tools goes weeks 1, 3, 5 and so on 
and the desktops merge in weeks 2, 4, 6, and so on. At the beginning of 
week 2 the autotools merge has to be completed in order to give the 
desktop staging tree sufficient time to rebuild to meet it's merge 
window on Wednesday. During this time (Monday of week 2 until end of 
Wednesday in week 2) nothing else is allowed to be merged into factory 
or the desktop staging tree would have to be rebuilt again. That's all 
fine but we have a time problem....

We only have a certain number of days available in a year. Lets be 
optimistic and say our staging branch chaperones spend 300 days a year 
fiddling with the staging branches. This would provide a theoretical 
maximum of 300 staging branches, if we can manage to merge one every 
day. This however is not possible as the build time for the project 
alone dictates that certain changes require build times > 1 day. This 
reduces the maximum number of staging branches we can have further. Lets 
say we end up with a theoretical maximum of 250 staging branches. Simple 
math dictates that with 6k+ packages we will have things in staging 
branches that can break independently. Therefore, one developer is stuck 
in the same staging tree as another developer that happens to break 
something. The "innocent bystander developer", that happened not to 
break anything will have to wait not just the regular cadence, but the 
cadence plus the fix time of the unrelated breakage. This is not very 
satisfying for the developer that didn't break anything. In addition if 
the unrelated breakage does not get fixed in time for the merge window 
of the staging branch than the given staging branch has to wait for it's 
next merge opportunity, which may be weeks away.

One way to alleviate some of the problems is to have a very long cadence 
for "traversal packages". Lets say we only allow tool chain updates once 
every blue moon. But what if there is a bug in the tool chain or some 
other unknown undesired interaction? Now we must have a fix and the tool 
chain staging tree must get priority and be merged much sooner than it's 
next expected merge window. With this the cadence goes out the window. 
As all other staging trees will have to be pushed of their cadence to 
accommodate the new tool chain merge.
...
Thus at each branch, every week a decision can be made: is the
combination of the 'new' stuff good enough already to *pull* (!!) it in
together?or do we --- for the combination! --- have to stick with what
we had so far, the last known good version?
There is no way to know if you can pull things in together because 
staging branches do not get cross built against each other. In the 
figure above everything is nicely spaced, but that probably does not 
reflect the real world. If libs and the desktops are ready at he same 
time one can still not merge them into factory at the same time because 
they have not built against each other, they have built against 
"current" factory. Thus, one would have to send libs for the merge to 
the reference branch and rebuild and retest the desktops. Therefore, all 
the build and testing effort of the previous desktop staging branch is 
out the window and useless. To eliminate the waste in testing and build 
one has to wait until the reference branch is "frozen" for the merge 
window of a particular staging branch. Then build the staging branch, 
then test. Especially the testing is difficult when we talk about the DEs.

I will postulate the following technical requirement:

"""
The only way to protect against adverse interaction is to build and test 
a staging tree against a frozen reference tree.
"""

In our case the reference tree would be factory. The technical 
requirement for staging work to catch adverse interaction is therefore 
pretty simple.

But the technical requirement creates a people problem ;) . People hate 
the waterfall and hurry up and wait stuff. Therefore, what tends to 
happen is that multiple staging branches get merged into the reference 
branch based on heuristic historical data of "no adverse interaction 
when merging a given set of branches in the past". This data has a 
number of problems:

- past behavior does not guarantee future performance
   if perl-bootloader or kiwi depend on a new leaf package the
   heuristic data of those staging trees is useless as a new set of
   interactions is created

- the heuristic knowledge is intrinsic to the chaperone of the
   reference branch (granted, this is not necessarily much different
   than it is today, but we are looking for improvement and not "the
   same")

- the bus factor remains 1, i.e. the chaperone of the reference branch
...
btw, 'remote' breaks of the
last known good source version, because of some, say, toolchain upgrade,
clearly indicate said root cause, the toolchain upgrade, needs some
love, too.  Integration Manager, set priorities... what do you want in
this week's Factory?  maybe the new toolchain will not be there yet...
A GNOME release that needs a new NM or bluetooth may need a number of
cycles to get to a shape where it can be merged with KDE.
This will also lead to races.  For example a new gcc might race a
desktop integration, i.e. the new gcc works with the last known good
version of the desktops.  Then the new desktop integration will need to
do their homework and pull in the new gcc, and until then, the new gcc
will just build the last known good desktop.
Also a new GNOME will have at least to ensure integration issues are
resolved with the current stable KDE (last known good), and hopefully
the KDE guys are good citicens and are willing to spend the time to make
sure this works.  They will.  They need the GNOME guys to do the same a
few weeks later...
Sometimes for several weeks the integration master needs to pull some
old versions (the last known good combination), just because the new
combination is not ready yet.
Now how can small leaf package updates skip such a major barrier?
e.g. a new gimp?
At each junktion it is clear what needs to be tested.  If the new leaf
gtk+ application, gimp, can also be integrated with the 'last known good
version', the one that is still in Factory, then it can be integrated
into that and then thus moves on, ahead of the rest of GNOME, at the
next cycle.
But this implies that gimp has it's own staging branch, thus one is 
feeding the "ever expanding number of staging branches" monster. One 
cannot pull a part of a staging branch without placing the pulled pieces 
into a staging tree of its own and building and testing that staging 
tree against a "frozen" reference branch.
...
So if we tilt the above tree and look at it sideways, it almost looks
like git integration:
new stuff                        merge success
          ---------------------------------------------*--------
         /          \
        /            \pull new gimp
       /              \                last known good
   --------------------*-------------------------------R.I.P.
True, but one still has to build and test the cherry picked stuff, i.e. 
that's where the need for yet another staging project is created. This 
rests on the basic assumption that only stuff built and tested against 
the reference branch can be merged.
...
Such a system of 'staging' or integration projects gives a clear flow
of both updates and upgrades into a well integrated and tested Factory,
which is 'released' at a weekly cadence.
The cadence also scopes the size of the integration projects: they
should be small enough to allow something like a weekly or at most
bi-weekly lock-step.
Beta users for some integration point or branch can use this a few days
ahead of the release, (zypper rr + zypper dup gets them sane again)
A new kernel can be available for pull for a longer while, until it has
enough love and testing to actually be pulled in as The Kernel, how to
manage kernel beta test is an own excercise when we get a stable
Factory.
Branch projects can also be used by those happy with a partial
integration (new KDE even if it breaks GNOME or vice versa), or
experimenting early with completely new feature sets (systemd, ...).
I believe this model overcomes some of the issues Robert has brought up
about the traditional 'staging' model.  It gives clear responsibilities
and a clear cadence.
No model can solve the problem *who* is going to do the integration
work, but this staging model at least clearly scopes and cadences what
needs to be done and when to keep the flow going.
There are good ideas here to alleviate some of the fundamental issues 
inherent in the staging model.

Later,
Robert

-- 
Robert Schweikert                           MAY THE SOURCE BE WITH YOU
SUSE-IBM Software Integration Center                   LINUX
Tech Lead
Public Cloud Architect
rjschwei@suse.com
rschweik@ca.ibm.com
781-464-8147
-- 
To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-factory+owner@opensuse.org