Hi Jimmy, I have to admit that I don't understand which exact statistics you want to get. It would be important for me to have a concrete description for each meassurement you want to do. We can decide then individual if we can provide these numbers. Please create some seperate document for each of them. In case it is critical for the project please use Fate. Otherwise some wiki page or github issue might be sufficient. Please describe what these numbers should tell and what should be the base for for these numbers from your POV. We can discuss about the implementation details then in a later step. thanks adrian On Dienstag, 14. März 2017, 20:56:59 CET wrote Jimmy Berry:
On Tuesday, March 7, 2017 3:27:11 PM CDT Henne Vogelsang wrote:
Hey,
On 01.03.2017 22:23, Jimmy Berry wrote:
On Wednesday, March 1, 2017 5:44:58 PM CST Henne Vogelsang wrote:
If you need to record some extra time series data for your staging workflow engine you can do that, as your engine always runs in the context of the OBS instance it's mounted on top of. So it will also have access to the influxdb instance etc.
Same is BTW true for access to the SQL database, your engine has the same access as the Rails app it's mounted from.
As I would expect. I was looking for access to develop against since it is difficult to recreate an accurate facsimile of the OBS instance and near impossible to simulate the variety of workflows through which requests have gone.
I very much doubt that. We have an extensive test suite that is already 'simulating' all major workflows, including requests of the various kinds. For creating data you can use the tooling that exists, like our data factories[1]. If you need help with this do not hesitate to contact me :-)
I skimmed through the files and I did not see anything similar to the Factory staging workflow managed by openSUSE/osc-plugin-factory. The components of that workflow would be covered by such tests and data creation, but is not terribly helpful for trying to build something to extract specific statistics. The staging workflow creates reviews when requests are staged in a particular staging and records in which staging the request was placed.
The statistics of interest need to look for specific types of reviews related to staging process and spacing between them and other events. The data needs to be very specifically structure like that in the real instance.
To be clear, I already wrote a few queries locally against records I created by hand that extract the desired information. An example of tricky data, a request can be staged, denied, unstaged, and then reinstated. At which point during the time it was denied no review changes will be recorded (ie the fact that it was unstaged). This is one of the cases the tooling has to handle and I can recreate locally so I have no doubt it occurs. Making sure the statistics properly handle all the intricacies of the real data cannot easily be simulated. Having done this sort of work on other live systems it is nearly impossible to predict the interesting edge-cases in real data and is not particularly productive to do so when compared to running against the real thing.
It would also be good to see if pulling certain metrics directly from the source tables is performant enough.
Aren't you getting ahead of yourself? Why don't you first figure out what you want to do and how and then worry about performance of the production DB :-)
As noted in the original post I have quite a bit of detail in what I want to do and a few approaches which are dependent on the performance of said approaches. If the simplest approach performance is sufficient why spend extra time on a more complex approach?
If others have time to get more directly involved I can document more publicly the specifics of what I have already done, but otherwise I'll save that for when I have a final solution.
When I worked on the tooling used by the development site for other open source projects it was possible to get a sanitized database dump or staging environment that had access to both a clone of production and read access to production. These resources were invaluable for validating data migrations and tools before deployment.
This is a good practice that we also follow. But what has this to do with your tool? You are neither migrating nor deploying...
Looking for the edge-cases in the data, especially when requests operated on while in a denied state (as noted above).
Without such access it was impossible to predict all
the ways in which data can be either inconsistent, corrupted, or odd edge- cases.
Again you are getting ahead of yourself I think. We have a very well documented data structure. If something is inconsistent, corrupted or an odd edge case it is by our definition broken. If you come across such a case you should tell us or better yet fix that case :-)
I agree the data structure is documented. As noted I already wrote queries for some of the desired information. Without running queries and scripts against the real data I cannot find edge-cases.
Given that storing additional information will not cover all the desired metrics it is likely more effective to just record timeseries data. I'll have to look at the tool in question, but I would expect a background job to run that periodically writes a record to the timeseries database.
No, the contrary. Every time something happens a data point get's recorded into a data set in the time series DB. So let's say a request is closed. You would record the fact, the time, add some tags describing the resolution (accepted, decline) or the user who did this etc. Once you have this data in the time series DB you can query and display it :-)
I contrasted storing additional data (in the OBS structure) to storing everything of interest in timeseries database. Indeed, having the data in a timeseries database would work, but represents a lot of data duplication and an entire process that as I understand does not currently exist. As such I was hoping to avoid it and pull at least a subset directly from the existing data structure.
On that note, are the various influx software pieces setup and hosted or has nothing been done short of selecting the desired tool?
No nothing is done yet. Just planed, sorry.
Henne
[1] https://github.com/openSUSE/open-build-service/tree/master/src/api/spec/fact ories
At this point, I am not sure what is desired to move this forward. I have a goal of specific metrics that I would like to extract and present documented in original post. I have done work on a local instance to see what metrics can be extracted from the existing data and wrote queries to do so. I have determined what information is lacking and that likely best to just have a new process for writing such timeseries data, which sounds similar to what was planned.
There are certain trends in the metrics I would expect to be present in the real data that I would like to confirm. In fact the metrics that can be extracted from the existing data may suffice if they demonstrate the things in which I am interested, but I cannot tell them from running queries against fake data.
I had hoped to avoid creating a data scraping tool, but if it is not possible to gain some sort of access to the data I may just do that to avoid being blocked. Likely I'll write the data into the same structure used by OBS so the tool will be compatible if ever deployed properly.
Some of the data is generic to all requests and other is specific to openSUSE/ obs_factory and openSUSE/osc-factory-plugin related specifics. I have considered building some additional API calls, perhaps some in obs_factory, that could expose certain aggregate query results. That may be useful, but at the moment this project is somewhat exploratory in that it will become clear what is interesting in the data when it is explored. As such a more fluid setup that allows for developing queries and metrics until a full picture is clear seems to make sense rather than trying to build code and have it deployed before even an initial result can be seen.
-- Adrian Schroeter email: adrian@suse.de SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany