[openFATE 306379] Use rsync when refreshing repositories

5 Dec 2009

      Feature changed by: Jan Engelhardt (jengelh)
Feature #306379, revision 16
  Title: Use rsync when refreshing repositories

  openSUSE-11.2: Rejected by Michael Löffler (michl19)
  reject date: 2009-08-11 16:01:45
  reject reason: too late for 11.2, moved to 11.3
  Priority
      Requester: Important

  openSUSE-11.3: Evaluation
  Priority
      Requester: Important

  Requested by: Piotrek Juzwiak (benderbendingrodriguez)

Description:
  It would be a great idea to use rsync when refreshing repositories, one
  of the bad things is the refresh speed. It gets worse when people have
  many repositories. I'm not sure but it already compares if something
  has changed in the repo but to speed things up it would be great to use
  rsync. For example big repositories like Packman for example download
  every time the default 10 minutes are over (in zypp settings) while
  nothing great changes there.

  Discussion:
  #1: Roberto Mannai (robermann79) (2009-05-08 14:57:32) 
  The best way to download incrementally only the diff of a binary file,
  for my best knowledge, is using the GDIFF protocol, who was submitted
  ten years ago to the W3C consortium:
  http://www.w3.org/TR/NOTE-gdiff-19970901 (http://www.w3.org/TR/NOTE-gdiff-19970901)
  I know for sure that a commercial product of Configuration Management
  (Marimba, now buyed by BMC - see http://www.marimba.com
  (http://www.marimba.com/) ) use it, implemented in Java: it is very
  useful in low bandwidth nets, when downloading a service pack, for
  example. I don’t know if one person could use that Java algorithm
  implementation, anyway, being a commercial application.
  Other implementations are in PERL and RUBY:
  http://search.cpan.org/~geoffr/Algorithm-GDiffDelta-0.01/GDiffDelta.pm
  (http://search.cpan.org/%7Egeoffr/Algorithm-GDiffDelta-0.01/GDiffDelta.
  pm)  http://webscripts.softpedia.com/script/Development-Scripts-js/gdiff-gpatch-1...
  (http://webscripts.softpedia.com/script/Development-Scripts-js/gdiff-gpatch-1...)
  An open source .NET (C#) implementation: http://gdiff.codeplex.com/
  (http://gdiff.codeplex.com/) with MPL license
  I cannot understand why that algorithm is not widely used, given its
  quality; it shoud be useful if it was available when downloading large
  files like ISOs or VM images, or repositories information

  #2: Roberto Mannai (robermann79) (2009-05-08 15:08:01) (reply to #1)
  In your usecase, the repository could provide a GDIFF file of content
  metadata variation, the delta between two known "versions" of it in the
  time.

  #3: Piotrek Juzwiak (benderbendingrodriguez) (2009-06-18 23:16:55) 
  Hmm, i guess packman wouldn't implement that only for me ;) Though it
  would speed things up as it is a widely known and spoken that
  refreshing the repo in openSUSE is slow.

  #4: Luc de Louw (delouw) (2009-07-04 18:21:31) 
  Why not rsync? Because it does not work with http(s). This is important
  since in many companies the only way to get data from the internet is
  via http proxy.
  The GDIFF approach sounds promissing

  #5: Roberto Mannai (robermann79) (2009-07-04 18:39:56) (reply to #4)
  For a "GDIFF on HTTP" implementation, see
  http://www.w3.org/TR/NOTE-drp-19970825

  #6: Jan Engelhardt (jengelh) (2009-07-05 15:23:26) 
  Making use of rsync would bring zypper the checksumming, automatic
  download resuming/repairing at no cost ;-)

  #7: Roberto Mannai (robermann79) (2009-11-30 21:58:19) (reply to #6)
  "delouw" says that rsync does not support HTTP. This is a real blocking
  problem.

+ #9: Jan Engelhardt (jengelh) (2009-12-05 13:30:59) (reply to #7)
+ Even if the rsync *program* could do HTTP, it would not help you much,
+ because HTTP does not implement that rolling checksum and all the other
+ fluffy things of rsync.
+ Also, it seems obvious to me that use of rsync is an optional extra
+ feature that you can chose to ignore when refreshing your
+ repositories.

  #8: Robert Davies (robopensuse) (2009-11-30 22:23:41) 
  Why not make the transfered refresh file, by definition based on
  deltas?  First time through, you have current v empty file, the repo
  can save a delta against the empty file &  and monthly / weekly
  changes, with delta's for changes made against those, then a refresh
  can check for updates in last week if it has an current week file
  (download only if exists), monthly if the weekly is out of date and
  fall backto delta v empty ifle if local montly & weekly are both out of
  date.  Some sanity check, based on server's idea of date can prevent
  clients getting things too horribly wrong.
  Then most of the time it's a small file, that's easily cached for short
  time; the monthly can be cached for longer with predictable TTL, and
  current v empty could be cached for a day say.
  I'm not sure about the repo format, but if delta handling involved
  compression then wouldn't simple text format for the repo transfer
  contents be natural and efficient, any binary file should likely be a
  cache generated locally.
  Though this sounds much more complicated, presumably there's tools for
  generating the repo contents file, and processing the downloaded repo
  file, so it ought not be so difficult in principal.

-- 
openSUSE Feature: 
https://features.opensuse.org/306379

[openFATE 306379] Use rsync when refreshing repositories

fate_noreply＠suse.de