On 10/8/18 12:07 PM, Alberto Planas Dominguez wrote:
> On Monday, October 8, 2018 4:55:33 PM CEST Robert Schweikert wrote:
>> On 10/8/18 7:35 AM, Alberto Planas Dominguez wrote:
>>> On Saturday, October 6, 2018 11:24:46 AM CEST Robert Schweikert wrote:
>>>> On 10/5/18 4:23 AM, Alberto Planas Dominguez wrote:
>
> [Dropping a very unproductive content]
I am really trying hard to leave out "color commentary" and adjectives,
it would be nice to see the effort reciprocated.
>
>>> In any case let me be clear: my goal is to decrease the size of the Python
>>> stack, and my proposal is removing the pyc from the initial first install,
>>> backporting a feature from 3.8 to have the pyc in a different file system.
>>
>> OK, this is different from the original e-mail where it was implied that
>> "image size" was the primary target.
>
> That was my initial motivator, yes. But the goal, I hope, is clear now with my
> previous paragraph.
>
>>> My tests give in my machine a 6.08MB/s of compilation speed. I tested it
>>> installing django with python 3.6 in a venv and doing this:
>>>
>>> # To avoid measure the dir crawling
>>> # find . -name "*.py" > LIST
>>> # time python -m compileall -f -qq -i LIST
>>>
>>> real 0m1.406s
>>> user 0m1.257s
>>> sys 0m0.148s
>>>
>>> # du -hsb
>>> 44812156 .
>>>
>>> # find . -name "__pycache__" -exec rm -rf {} \;
>>> # du -hsb
>>> 35888321 .
>>>
>>> (44812156 - 35888321) / 1.4 ~= 6.08 MB/s
>>>
>>>> But lets put some perspective behind that and look at data rather than
>>>> taking common believes as facts.
>>>>
>>>> On a t2.micro instance in AWS, running the SUSE stock SLES 15 BYOS
>>>> image. The instance was booted (first boot), then the cloud-init cache
>>>> was cleared with
>>>>
>>>> # cloud-init clean
>>>>
>>>> then shutdown -r now, i.e. a soft reboot of the VM.
>>>>
>>>> # systemd-analyze blame | grep cloud
>>>>
>>>> 6.505s cloud-init-local.service
>>>> 1.013s cloud-config.service
>>>>
>>>> 982ms cloud-init.service
>>>> 665ms cloud-final.service
>>>>
>>>> All these services are part of cloud-init
>>>>
>>>> Clear the cloud-init cache so it will re-run
>>>> # cloud-init clean
>>>>
>>>> Clear out all Python artifacts:
>>>>
>>>> # cd /
>>>> # find . -name '__pycache__' | xargs rm -rf
>>>> # find . -name '*.pyc' | xargs rm
>>>> # find . -name '*.pyo' | xargs rm
>>>>
>>>> This should reasonably approximate the state you are proposing, I think.
>>>> Reboot:
>>>>
>>>> # systemd-analyze blame | grep cloud
>>>>
>>>> 7.469s cloud-init-local.service
>>>> 1.070s cloud-init.service
>>>>
>>>> 976ms cloud-config.service
>>>> 671ms cloud-final.service
>>>>
>>>> so a 13% increase for the runtime of the cloud-init-local service. And
>>>> this is just a quick and dirty test with a soft reboot of the VM. Number
>>>> would probably be worse with a stop-start cycle. I'll leave that to be
>>>> dis-proven for those interested.
>>>
>>> This is a very nice contribution to the discussion.
>>>
>>> I tested it in engcloud and I have a 9.3% of overload during the boot. It
>>
>>> spend 0.205s to create the initials pyc needed for cloud-init:
>
>> That would not be sufficient, all pycs in the dependency tree would need
>> to be generated, you cannot just measure the creation of the cloud-init
>> pyc files. cloud-init is going to be one of, if not the first Python
>> processes running in the boot sequence, which implies that no pyc files
>> exist for the cloud-init dependencies.
>
> Of course it is enough for the argument. In fact is a critical part of the
> discussion: you delegate the pyc create when they are needed, and once
> required will be stored in the cache.
>
> When cloud-init is loaded, Python will read all the `import`s and the required
> subtree of pyc will be generated before the execution of _any_ Python code.
> You are not compiling only the pyc from cloud-init, but for all the
> dependencies that are required.
>
> Unless there are some lazy load in cloud-init based on something like
> stevedore (that I do not see), or is full for `import`s inside functions and
> methods, the pyc generation of the required subtree will be the first thing
> that Python will do.
I think we are in agreement, problem appears to be that we have a
different idea about what
"create the initials pyc needed for cloud-init:"
means. For me this arrived as what you stated, i.e. "pyc needed for
cloud-init" which says nothing about dependencies. For you this
statement appears to imply that the pyc files for the dependencies were
also generated in this test.
More explicit and concise communication would certainly help.
>
>>> * With pyc in place
>>>
>>> # systemd-analyze blame | grep cloud
>>>
>>> 1.985s cloud-init-local.service
>>> 1.176s cloud-init.service
>>>
>>> 609ms cloud-config.service
>>> 531ms cloud-final.service
>>>
>>> * Without pyc in place
>>>
>>> # systemd-analyze blame | grep cloud
>>>
>>> 2.190s cloud-init-local.service
>>> 1.165s cloud-init.service
>>>
>>> 844ms cloud-config.service
>>> 528ms cloud-final.service
>>>
>>> The sad thing is that the __real__ first boot is a bit worse:
>>>
>>> * First boot. with pyc in place
>>>
>>> # systemd-analyze blame | grep cloud
>>>
>>> 36.494s cloud-init.service
>>>
>>> 2.673s cloud-init-local.service
>>> 1.420s cloud-config.service
>>>
>>> 730ms cloud-final.service
>>>
>>> Comparing to this real first boot, the pyc cost generation represent the
>>> 0.54% for cloud-init (not in relation with the total boot time). We can
>>> ignore it, as I guess that the images used for EC2 will have some tweaks
>>> to avoid the file system resize, or some other magic that makes the boot
>>> more similar to the second boot.
>>
>> First boot execution of cloud-init is also significantly slower in the
>> Public Cloud. However, not as bad as in your example.
>
> If this is the case, this first boot is the one that will generate the pyc,
> not the later one.
>
>> In any case your
>> second comparison appears to making a leap that I, at this point do not
>> agree with. You are equating the generation of pyc code in a "hot
>> system" to the time it takes to load everything in a "cold system". A
>> calculation of percentage contribution of pyc creation in a "cold
>> system" would only be valid if that scenario were tested. Which we have
>> not done, but would certainly not be too difficult to test.
>
> I do not get the point. At the end we measured the proportion of the time
> Python spend generating the pyc for cloud-init and all the dependencies needed
> for the service, in relation with the overall time that cloud init spend
> during the initialization of the service.
>
> I am not sure what do you mean by hot and clod here, as I removed all the pyc
> from site-packages to have a measure of the relation of the generation of the
> pyc over the time that cloud init uses to start the service.
OK, maybe I can explain this better. We agree that there is a
significant difference in the cloud-init execution time between initial
start up of the VM vs. a reboot, even if the cloud-init cache (not the
pyc files) is cleared. This implies that something is working behind the
scene to our advantage and makes a VM reboot faster w.r.t. cloud-init
execution when compared to the start a new instance scenario. Given that
we do not know what this "makes it work faster" part is, we should not
draw any conclusion that the pyc build will take equally as short/long
of a time on initial start up as it takes in a "reboot the VM" scenario.
This will have to be tested.
>
>>> The cost is amortized, and the corner case, IMHO, is more yours than mine.
>>> Your case is a fresh boot of a just installed EC2 VM. I agree that there
>>> is a penalty of ~10% (or a 0.54% in my SLE12 SP3 OpenStack case), but
>>> this is only for this first boot.
>>
>> Which is a problem for those users that start a lot of instances to
>> throw them away and start new instances the next time they are needed.
>> This would be a typical autoscaling use case or a typical test use case.
>
> Correct. The 0.205s will be added for each new fresh VM. Am I correct to
> assume that also this is an scenario where the resize in the initial boot is
> happening? If so, the overall impact is much less that the 10% that we are
> talking about, and more close to the %0.5 that I measured in OpenStack.
The data I presented as an example was generated with a 10GB image size
for the creation of an instance with a 10GB root volume size. So there
is a, what should be a negligible, contribution from the growpart
script, which is called by cloud-init and runs in a subprocess.
I attribute "negligible" in this case as growpart will exit very fast is
no resizing is required. It still take time for process start up etc.
but again I consider this as negligible.
But you are correct, by increasing the time it takes for other things
cloud-init calls, such as root volume resize, one can decrease the
percentage of time allocated to pyc creation.
If I would want to take this to an extreme I could start a process
during user data processing that runs for several minutes and thus I
could make an argument that pyc creation takes almost no time. However,
that would be misleading.
I think in an effort to arrive as close as reasonably possible at the
"real cost" of the pyc generation for the cloud-init example, we should
minimize the externally executed processes by cloud-init, such as
minimizing the runtime for growpart, which is done by not manipulating
the instance root volume size as compared to the image size.
The other thing that comes into play when comparing different frameworks
is that different modules get loaded by cloud-init and also if we do not
have the exact same config file that would also cause differences as
cloud-init only loads configuration modules that are needed based on the
given configuration, a certain "lazy load" mechanism.
>
>> It is relatively easy to calculate a cost for this with some estimates.
>> If my test for my application needs 2000 (an arbitrary number I picked)
>> test instances, and every test instance takes .2 seconds longer to boot,
>> to use your number, than the total time penalty is ~6.7 seconds. If this
>> test uses an instance type that costs me $10 per hour the slow down
>> costs me ~ $1.1 every time I run my test. So if the test case runs once
>> a week it would amount to ~$57 per year.
>
> Imagine the cost of the resize of the kiwi operation, must be around some
> thousands dollars.
>
> But you are right. If there is a weekly re-escalation of 2000 instances during
> the 54 weeks of a year, you can measure the cost of the pyc generation.
>
> Is in my understanding that CPU cost is cheaper in relation with network
> transfer and storage. Can we measure the savings of network and storage here?
Not in the Public Cloud, there is no data. Network data into the
framework is always free and the size of the root volumes in our images
is already 10GB (30GB in Azure) and thus offers up ample space for the
pyc files. Meaning there is no gain if the actual disk space used by the
packages we install is smaller and we have more empty space in the 10GB
(30GB in Azure) image.
> I have the feeling that are more that 57$ per year.
Nope the cost for this to the user in the Public Cloud is 0, see above.
> We are interchanging CPU
> per network and storage savings, that with those 2000 instances per week
> during a full year will be also measurable.
>
>>>> This is penalizing the majority to cover one specific use case. Sorry it
>>>> is hard for me to see this any other way.
>>>
>>> Again, hardly booting fresh VMs is a majority here.
>>
>> That is an assumption on your part, from my perspective.
>
> Do you really thing that TW and Leap are optimized for boot speed in a cloud
> scenario?
No, but whatever we put into TW inevitably ends up in SLE and there it
does matter.
> If the majority of users are launching all day and night VMs I vote
> to optimize this use case before anything else.
>
>> If we pursue the approach of multiple packages, as suggested in one or
>> two messages in this thread, then we could build Public Cloud images
>> with pyc files included.
>
> Or better, the user can add a `python -m compileall` in the kiwi config.sh,
> that will populate /var/cache for the cloud images only.
Well, I could turn around and state that this is a "hack" and "...we
don't want any of your proposed hacks.." in our image creation process.
Hopefully sounds familiar.... ;)
Anyway, on a more serious note, if we can resolve the security concerns
and properly handle the upgrade mechanism while not generating multiple
packages I am not categorically opposed to such an addition in our
Public Cloud image builds.
>
> I think that we need a productive argumentation here.
And I thought we were having that for the most part. (My contribution to
color commentary)
> All engineering
> decisions are based on a trade-off. Some times the trade-off do not pay by
> itself, but I really think that this is not the case. Or at least your
> arguments so far do not point in this direction.
Sorry, I am not following you here, are you dismissing the example of
paid CPU for large test cases as invalid?
If that is the case, then yes we are not having a productive discussion :( .
>
> If / when the use case is such that the trade-off between space and cpu is so
> critical (scenario that is not the one that you described, I am sorry), the
> opportunities for optimization are in a different place. And we need to
> address those too.
>
> In a classical use case the savings in space in the different places are
> justified IMHO, and the amortization of the pyc generation will be justified
> in less than 2 seconds after the service is running for the first time.
>
> In a cloud use case the user can boot the image, prepare it (avoiding the
> penalty of the resize step and other details), update it and save it as a
> volume that can be cloned those 2000 times, completely avoiding any
> penalization for the pyc generation too.
>
This is not the predominant use case in the Public Cloud, based on
information we have from our partners. I would appreciate if, rather
than proposing solutions about "how to change people's habits", which
almost never work, we'd discuss solutions that can meet the needs of
things being pointed out as topics to consider in arriving at the solution.
Later,
Robert
--
Robert Schweikert MAY THE SOURCE BE WITH YOU
Distinguished Architect LINUX
Team Lead Public Cloud
rjschwei(a)suse.com
IRC: robjo