Mailinglist Archive: opensuse-factory (368 mails)

< Previous Next >
[opensuse-factory] Announcing download.o.o access metrics
  • From: Jimmy Berry <jberry@xxxxxxxx>
  • Date: Fri, 22 Jun 2018 11:39:11 -0500
  • Message-id: <15454138.uO14IR0i6n@boomba.local>
The full announcement can be viewed at release-tools.opensuse.org [1] along
with images.

Adding to the variety of metrics already captured at metrics.o.o [2], I have
added download.o.o access metrics [3]. These metrics are sourced from the
Apache access logs produced by the download.o.o machine. The goal of parsing
the logs was to provide some insight into product adoption and long-term
usage, in addition to overall project health.

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-stacked.png

The logs cover data from 2018-06-20 (and ingested daily going forward) to
2010-01-03 and amount to roughly 24TB of raw data. After exploring a few
tools, like telegraf [4] (since commonly paired with influxdb [5]), they were
found to be lacking in the speed department. For example, telegraf could not
even handle 1000 entries per second [6] which would require well over three
years to parse the data (reduced to over 6 months using concurrency if it
supported that). Influxdb also couldn't handle the raw data (even a single
day) as I had hoped to use it to perform the aggregations. As such, short of
finding a magic tool which would still require customization for the custom
log fields and meaning I opted to write a tool [7].

Given the speed sensitive nature of the problem I tested the primary scripting
language of the openSUSE release tools, python, and compared it to PHP which I
knew is generally faster. A simple test running a "starts with" on each log
file line was an order of magnitude faster in PHP and the difference widened
the more processing that was added. As such I opted for using PHP which was
fast enough for the job while providing scripting language convenience. The
end result was ~500,000 entries per second per core with full concurrency
supported. Using this solution the last 8 years of data was processed and
summarized in ~23 hours using 7 cores of an office machine. Going forward only
the last day needs to be summarized which takes a minute or so.

For those interested the 24TB was summarized to roughly 12GB of data which is
then aggregated to roughly 8MB in influxdb. The 12GB lives on metrics.o.o in
order to aggregate new days against previous data. The tool could be changed
to drop data past the largest aggregation interval (ie a month), but if the
aggregation algorithm is changed it would require the summary data.

For further details about the tool or to review it see metrics/access
directory [8] and README.

One of the areas of interest was the number of beta systems Leap receives. The
release schedule for the last three releases of Leap may be used to annotate
the graphs by enabling the corresponding annotation at the top of the
dashboard. The individual product series may also be isolated by clicking the
product in the legend (ctrl+click to select more than one to isolate). The
time range may also be changed using the tool in the top right (next to
refresh button) or by selecting the area on graph (left click, hold, and drag
to end of area desired). After focusing on 42.2 and 42.3 Beta phase we can see
several thousand systems for both, but less for 42.3. It would be interesting
to know if that reducing is a result of the rolling release model or something
else.

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-beta.png

One item to note is that, SUSE IPs (such as openQA) are not currently filtered
out of the data and as such depending on usage may bump up the beta numbers.
This is something I have not yet explored, but should not be too difficult to
filter assuming an IP list or user-agent.

The extreme long-tail of systems on old products is interesting and would
seemingly indicate either neglected installs, laziness, or fear of updating,
but given around a quarter of openSUSE systems are on releases beyond end-of-
life [9] it is a bit concerning. :/ It may make sense to add an annotation
containing product end of life dates. When compared to the last two versions
of Leap_, Tumbleweed usage amounts to nearly half of one Leap release or a
fifth of systems on supported releases.

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-stacked.png

For those interested, in more details there are three collapsed sections at
the bottom of the dashboard which contain additional breakdowns of the data
and output from the tool. For example, you can see the request counts by
unique system by product. Although the averages are reasonable, the maximums
are extremely high. Such maximums seemingly indicate either spam or heavy UUID
reuse. Changing the aggregation frequency to day shows a very flat series that
seemingly indicates automation.

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-average-unique.png

Another area of interest is the steady increase in ipv6 traffic to roughly 10%
of current unique systems.

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-unique-proto.png

The tool output includes the raw log size the metrics represent for the
current time interval in addition to the number of invalid entries
encountered. From reviewing a large number of the entries marked invalid they
indeed are generally bogus, attack attempts, or incomplete requests. If we see
a large decline in system counts and huge spike in invalid counts that should
be clear there is a problem with the logs or tool going forward, but the most
recent numbers, before the log format was broken, show the lowest invalid
counts.

The invalid log entry counts line up nicely with the big hole in the data.

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-invalid.png

If the time range is change to a year and the aggregation frequency (top left)
is changed to a day we can very clearly see the correlation. It is even clear
that the day before the big hole is the day the error was made as half the
entries are invalid and log size is in between the day before and after.

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-log-size.png

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-invalid-day.png

Similarly, if the unique by product (stacked) is reviewed by day another
pattern exposes itself. A consistent drop in unique counts by nearly 20%. In
other words 20% of systems have weekends. :)

http://release-tools.opensuse.org/image/metrics.opensuse.org-access-stacked-day.png

Also note that one can export the data as CSV in addition to viewing a graph
full screen by clicking on the graph title. I look forward to receiving
feedback and insight after people explore the data.

While reviewing some of the raw log data I discovered a fair number of
interesting and odd entries. I will summarize some of the highlights below
(excluded from mailing list announcement).

Enjoy!

[1]
http://release-tools.opensuse.org/2018/06/22/download.o.o-access-metrics.html
[2] https://metrics.opensuse.org
[3] https://metrics.opensuse.org/d/osrt_access/osrt-access
[4] https://github.com/influxdata/telegraf
[5] https://github.com/influxdata/influxdb
[6] https://github.com/influxdata/telegraf/issues/3539
[7] https://github.com/openSUSE/openSUSE-release-tools/pull/1578
[8] https://github.com/openSUSE/openSUSE-release-tools/tree/master/metrics/
access
[9] https://en.opensuse.org/Lifetime

--
Jimmy


--
To unsubscribe, e-mail: opensuse-factory+unsubscribe@xxxxxxxxxxxx
To contact the owner, e-mail: opensuse-factory+owner@xxxxxxxxxxxx

< Previous Next >