On 12/02/2013 12:16 PM, Susanne Oberhauser-Hirschoff wrote:
Robert Schweikert
writes: 2.) The staging approach I can only speak from experience and thus this might sound a little lame, sorry. I have seen two implementations of the staging model in action at companies that produce large software suites. In both cases I consider the approach as a failure.
The problem in both cases is that the number of staging trees/branches/projects has an ever increasing slope, thus consuming ever more manpower to manage the ever increasing number of staging projects.
While the original problem of "how do we deal with unknown adverse interactions between updates" remains unresolved. The "solution" to this problem taken in one case was to have intermediate staging trees where "known risky updates" were tested together. Yes, staging trees upon staging trees. But this only solves the problem superficially as the target tree will move ahead and thus the staging tree by definition is always out of date. Unless the target tree is frozen until a particular staging tree is merged. Anyway it is a maze that requires potentially a lot of people.
The other problem with the staging model is that the "potentially risky interactions" knowledge is an implicit set of interactions that the staging tree managers happen to know. This is not expressed anywhere and thus makes it difficult for other people to learn. We have this problem today and from my point of view this will not be resolved with more staging trees.
The staging model will not catch adverse interactions reliably. The reason is that by definition the staging tree is always out of date, unless the target tree is frozen and after one staging tree is accepted all other staging trees get rebuilt. This is not conducive to parallel development.
What I like in your little umm rant ;) is the notion of *interactions* that need to be tested.
Not ranting, not even a little bit or with a wink. All I am going to be ale to bring to the table are the experiences I have had in the past with the "staging model". I will not have time to help with the implementation and it is unlikely that I will have the time to volunteer to chaperone a staging branch if the decision should be made to go that route. Ultimately those willing to do the work and those willing to be chaperones decide. I will not complain about the decision as I am not going to be able to do the work. The staging model has fundamental design problems and one can implement processes and procedures to alleviate the impact of those fundamental design issues. Only time will tell if this is ultimately sufficient if we continue to experience the growth we have seen.
What I'd like to see in this whole discussion in addition is 'cadenced flow' and 'integration decision'.
Cadence becomes a big deal and coordination of stage projects and their order of merge into factory also becomes a big deal.
When you add these, the resulting staged flow will actually get both bug fixes and package or subsystem upgrades available fast and in stable quality, continously (bold claim) here's how:
Well we can argue about the fast part ;)
Let's assume there is some code stable base, call it Factory.
The goal is to get updates in there, reliably, regularly, to get it to the next level of being a stable code base.
For "leaf packages" that is simple: build the package, test it's functionality, then release it.
Well, I do not think it is that simple. One could argue that Perl is a leaf package. But we have perl-bootloader and thus and update to Perl could break perl-bootloader which in turn would be a bad thing with pretty far reaching effects. Father, perl-bootloader does not stand unto it's own, it uses Perl modules that should definitely be considered as leaf packages. A similar argument can be made for KIWI, which depends on a lot of leaf packages but, KIWI is very important to create our ISO images. Thus, the line for leaf packages is blurry at best.
For for lack of a better word "intermediate packages" (say libpng), you start like above, testing the functionality, until from that perspective it's good to release. But then you also need to get it integrated with "the rest of the world", (maybe 50-100 packages for libpng, 44 on my system).
Then there is these "multi-scope packages", like NetworkManager or the bluetooth stack, which affect several whole integration scopes, desktops in this case. they have interactions in two stages (first the desktop, then all other desktops).
There is the "transversal packages", affecting almost everything, like the toolchain.
And then there is packages affecting few other packages that nonetheless have a lot of interactions that need to be tested (kernel, xorg, ...)
Now supposed there is a cascade of staging projects, which potentially 'release', say, every week or every other week (that's the "cadence").
They build a tree structure, something like this:
a t \u o \l \l \t o \i \e \o l \b \a gcc \- s \s \f --------------------------------------------------------> Factory / / /\l n/ / \x r/ K/\G \d e/ D/ \N \e k E/ \O \M \E
The number of nodes from the root (Factory) to a branch corresponds to the interactions that need to be tested for what goes into that branch.
This gives growing rings of scope for interaction testing and integration succes. Successfull build and automatic tests are necessary, sometimes even sufficient for interaction test and integration success. They propagate automatically, to give a 'tentative' next build. That, however, does not affect the 'last known good' build --- that last known good state remains available, too.
Yes, however, what is being neglected is that there is a fundamental problem with the cadence. The cadence itself is influenced by the process, through rebuild times and other snafus that are inevitable. For illustration purposes lets say the cadence for the autotools staging is every other Monday, and the desktops get their say every other Wednesday on the off weeks, i.e. auto tools goes weeks 1, 3, 5 and so on and the desktops merge in weeks 2, 4, 6, and so on. At the beginning of week 2 the autotools merge has to be completed in order to give the desktop staging tree sufficient time to rebuild to meet it's merge window on Wednesday. During this time (Monday of week 2 until end of Wednesday in week 2) nothing else is allowed to be merged into factory or the desktop staging tree would have to be rebuilt again. That's all fine but we have a time problem.... We only have a certain number of days available in a year. Lets be optimistic and say our staging branch chaperones spend 300 days a year fiddling with the staging branches. This would provide a theoretical maximum of 300 staging branches, if we can manage to merge one every day. This however is not possible as the build time for the project alone dictates that certain changes require build times > 1 day. This reduces the maximum number of staging branches we can have further. Lets say we end up with a theoretical maximum of 250 staging branches. Simple math dictates that with 6k+ packages we will have things in staging branches that can break independently. Therefore, one developer is stuck in the same staging tree as another developer that happens to break something. The "innocent bystander developer", that happened not to break anything will have to wait not just the regular cadence, but the cadence plus the fix time of the unrelated breakage. This is not very satisfying for the developer that didn't break anything. In addition if the unrelated breakage does not get fixed in time for the merge window of the staging branch than the given staging branch has to wait for it's next merge opportunity, which may be weeks away. One way to alleviate some of the problems is to have a very long cadence for "traversal packages". Lets say we only allow tool chain updates once every blue moon. But what if there is a bug in the tool chain or some other unknown undesired interaction? Now we must have a fix and the tool chain staging tree must get priority and be merged much sooner than it's next expected merge window. With this the cadence goes out the window. As all other staging trees will have to be pushed of their cadence to accommodate the new tool chain merge.
Thus at each branch, every week a decision can be made: is the combination of the 'new' stuff good enough already to *pull* (!!) it in together?or do we --- for the combination! --- have to stick with what we had so far, the last known good version?
There is no way to know if you can pull things in together because staging branches do not get cross built against each other. In the figure above everything is nicely spaced, but that probably does not reflect the real world. If libs and the desktops are ready at he same time one can still not merge them into factory at the same time because they have not built against each other, they have built against "current" factory. Thus, one would have to send libs for the merge to the reference branch and rebuild and retest the desktops. Therefore, all the build and testing effort of the previous desktop staging branch is out the window and useless. To eliminate the waste in testing and build one has to wait until the reference branch is "frozen" for the merge window of a particular staging branch. Then build the staging branch, then test. Especially the testing is difficult when we talk about the DEs. I will postulate the following technical requirement: """ The only way to protect against adverse interaction is to build and test a staging tree against a frozen reference tree. """ In our case the reference tree would be factory. The technical requirement for staging work to catch adverse interaction is therefore pretty simple. But the technical requirement creates a people problem ;) . People hate the waterfall and hurry up and wait stuff. Therefore, what tends to happen is that multiple staging branches get merged into the reference branch based on heuristic historical data of "no adverse interaction when merging a given set of branches in the past". This data has a number of problems: - past behavior does not guarantee future performance if perl-bootloader or kiwi depend on a new leaf package the heuristic data of those staging trees is useless as a new set of interactions is created - the heuristic knowledge is intrinsic to the chaperone of the reference branch (granted, this is not necessarily much different than it is today, but we are looking for improvement and not "the same") - the bus factor remains 1, i.e. the chaperone of the reference branch
btw, 'remote' breaks of the last known good source version, because of some, say, toolchain upgrade, clearly indicate said root cause, the toolchain upgrade, needs some love, too. Integration Manager, set priorities... what do you want in this week's Factory? maybe the new toolchain will not be there yet...
A GNOME release that needs a new NM or bluetooth may need a number of cycles to get to a shape where it can be merged with KDE.
This will also lead to races. For example a new gcc might race a desktop integration, i.e. the new gcc works with the last known good version of the desktops. Then the new desktop integration will need to do their homework and pull in the new gcc, and until then, the new gcc will just build the last known good desktop.
Also a new GNOME will have at least to ensure integration issues are resolved with the current stable KDE (last known good), and hopefully the KDE guys are good citicens and are willing to spend the time to make sure this works. They will. They need the GNOME guys to do the same a few weeks later...
Sometimes for several weeks the integration master needs to pull some old versions (the last known good combination), just because the new combination is not ready yet.
Now how can small leaf package updates skip such a major barrier? e.g. a new gimp?
At each junktion it is clear what needs to be tested. If the new leaf gtk+ application, gimp, can also be integrated with the 'last known good version', the one that is still in Factory, then it can be integrated into that and then thus moves on, ahead of the rest of GNOME, at the next cycle.
But this implies that gimp has it's own staging branch, thus one is feeding the "ever expanding number of staging branches" monster. One cannot pull a part of a staging branch without placing the pulled pieces into a staging tree of its own and building and testing that staging tree against a "frozen" reference branch.
So if we tilt the above tree and look at it sideways, it almost looks like git integration:
new stuff merge success ---------------------------------------------*-------- / \ / \pull new gimp / \ last known good --------------------*-------------------------------R.I.P.
True, but one still has to build and test the cherry picked stuff, i.e. that's where the need for yet another staging project is created. This rests on the basic assumption that only stuff built and tested against the reference branch can be merged.
Such a system of 'staging' or integration projects gives a clear flow of both updates and upgrades into a well integrated and tested Factory, which is 'released' at a weekly cadence.
The cadence also scopes the size of the integration projects: they should be small enough to allow something like a weekly or at most bi-weekly lock-step.
Beta users for some integration point or branch can use this a few days ahead of the release, (zypper rr + zypper dup gets them sane again)
A new kernel can be available for pull for a longer while, until it has enough love and testing to actually be pulled in as The Kernel, how to manage kernel beta test is an own excercise when we get a stable Factory.
Branch projects can also be used by those happy with a partial integration (new KDE even if it breaks GNOME or vice versa), or experimenting early with completely new feature sets (systemd, ...).
I believe this model overcomes some of the issues Robert has brought up about the traditional 'staging' model. It gives clear responsibilities and a clear cadence.
No model can solve the problem *who* is going to do the integration work, but this staging model at least clearly scopes and cadences what needs to be done and when to keep the flow going.
There are good ideas here to alleviate some of the fundamental issues inherent in the staging model. Later, Robert -- Robert Schweikert MAY THE SOURCE BE WITH YOU SUSE-IBM Software Integration Center LINUX Tech Lead Public Cloud Architect rjschwei@suse.com rschweik@ca.ibm.com 781-464-8147 -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org