[zypp-devel] Re: [zypp-commit] r7678 - in /trunk/sat-solver/tools: ./ CMakeLists.txt attr_store.c attr_store.h attr_store_p.h dumpattr.c repo_susetags.c repo_susetags.h susetags2solv.c

29 Oct 2007

      Hi,

On Mon, 29 Oct 2007, Duncan Mac-Vicar Prett wrote:
...
...
It's not finished yet, neither the on-disk not the in-memory layout.  
Not the interfaces :-)  You can be sure that I'll document everything 
when I'm more satisfied with the whole thing.
I am really interested in the discussion about putting non-solvable data 
in the solver, or how to define solvable data, for example vendor can be 
used for solver policies (cross vendor upgrades), sizes can be use by 
solver for installation summary, etc. and at the same time they cost 
time and space.
I agree that an attribute store and the solvable data structures need to 
be integrated somewhat, exactly for the reason you state, sometimes a 
solver (or something sitting on top of it) needs to know assorted 
information about solvables for whatever it wants to achieve.

A common theme for such information is that it's used only relatively 
seldomly (e.g. only when there are several alternatives, which are 
completely equivalent from a dependency perspective, but for which other 
information could provide guidance which of the alternatives would be 
preferred).  I.e. it's used only when the depsolver itself can't determine 
everything on it's own.  To that end such additional information usually 
is only accessed on demand.

This property (seldomly used) is shared with purely diagnostic information 
(for instance to show in a GUI, like package summary or description), 
which also only needs to be retrieved when needed.

There's another property of all this information: once determined for a 
repository it usually doesn't change, i.e. is read-only mostly.

So, we have already one separation of all the information bits: necessary 
for the depsolver, hence accessed all the time, and all of it, and "rest".

My intent is to implement all this "rest" information as side-band info to 
the solver itself, so that it doesn't take much space when it isn't 
actually accessed.  Collectively I call this rest information 
"attributes".  At the same time I want accesses to be cumulatively "fast 
enough", i.e. accessing much of that information should still be fairly 
fast, even though I'm willing to pay a certain setup price (for initially 
accessing it).

Additionally for such rest information constant memory usage (what has to 
be kept in-core once it is accessed at all) is more important to me than 
speed of retrieval, as long as retrieval is "fast enough".

For some of these attributes it makes sense to be able to "grep" them, 
i.e. search for the set of attributes which match certain criteria.

Now, at this point this all looks much like I want a database :-)  If it 
weren't for memory requirements (and perhaps ease of use) :-)

So, the rough design goals I had in my brain were such (in no particular 
order):

* needs to be possible to store all kinds of information bits: numbers, 
  string, chunks of large data (e.g. for descriptions)
* the set of information bits associated with one data-set needs to be 
  dynamic (i.e. can be extended later)
* memory usage should be low
* access reasonable fast
* grepable (i.e. some sort of query interface, but not too elaborate)
* should be able to integrate into sat-solver proper

To reach these goals I made certain design decisions:

* large data blobs are stored separately and are _not_ generally held in 
  core.  Only (offset,length) pairs are stored directly.
* attributes are name,type,value triples, with name being a string, and 
  value being of several basic types (number, string, blob)
* a set of attributes simply is a list of such triples
* attribute names actually are uniquified strings, i.e. represented only 
  as IDs (local to one store, other stores might use the same ID for a 
  different name, but the name strings itself are stored and hence 
  uniquied in the Pool).  This leads to those name IDs being small numbers
* maximum 1<<12 different attribute names are possible per store
* maximum 1<<4 types are possible (so (name,type) fits into one short)
* for space and efficiency reason we support a store local uniquified 
  string pool.  So the values can be two types of strings:
    - an inline string, represented as is in the list of attributes
      the same string in different attributes is represented multiple 
      times
    - an string ID, pointing into the store-local string pool, so all 
      equal strings store in this manner are represented only once (very 
      good e.g. for the License: tag)
  this string pool is _not_ merged with the Pool stringpool, it would just
  cost time in loading an attribute store and wouldn't provide many 
  advantages.
* two internal representations: one read-only (the space efficient one),
  one where you can dynamically add attributes.
* the read-only representation uses schema abbreviations, which are a list 
  of sorted (name,type) pairs, which can be references by the individual 
  attribute lists (which then only need to store the actual data per 
  data-set, not all the (name,type) specifiers again and again, which tend 
  to be equal for many data-sets.  That's like in DWARF (the debug 
  format).

I only started to implement this on saturday, so I'm sure I'll revisit 
some of those decisions over time when I implement the other things 
necessary.  In particular these things are still missing:

* easily usable iterator over the read-only representation
* grep interface
  here I intend to provide some functions with callbacks which then get
  either the matching data-sets (like grep -l), or the data-sets and all
  (name,value) pair which matched (like normal grep).
  I also intend to provide different search strategies:
    * only in certain attributes, or in all of them
    * only a substring search or a regexp or a fileglob (let's see how 
      determined I am during my holidays :-))
* support for big blobs (that means generating them, storing them away, 
  and loading them on demand, I have in mind either a paging scheme with 
  compressed pages, or an uncompressed scheme)

And of course making use of all of this in the several xxx2solv 
generators.  A start can be seen in susetags2solv which (when called with 
'-a') emits a file "test.attr" containing license, authors, group and 
keywords of each solvable.

Ciao,
Michael.
-- 
To unsubscribe, e-mail: zypp-devel+unsubscribe@opensuse.org
For additional commands, e-mail: zypp-devel+help@opensuse.org

[zypp-devel] Re: [zypp-commit] r7678 - in /trunk/sat-solver/tools: ./ CMakeLists.txt attr_store.c attr_store.h attr_store_p.h dumpattr.c repo_susetags.c repo_susetags.h susetags2solv.c

Michael Matz