Re: [suse-mirror] rsync module for the sources - now there

10 Apr 2009


      Eberhard Moenkeberg (emoenke@gwdg.de) wrote on 3 April 2009 22:49:
...
On Fri, 3 Apr 2009, Peter Poeml wrote:
...
You can use a module named opensuse-full-really-everything which 
isn't publicly documented, but it's there; I can't remember exactly if
it has always existed or if I made it at one point in time, but never
announced it. (I think the former.)
Very good (long appreciated), but:
receiving file list ... 
rsync: The server is configured to refuse --hard-links (-H)
rsync error: requested action not supported (code 4) at 
clientserver.c(839) [sender=3.0.5]
You should allow -H, especially at this "top level".
Certainly. It shows that it was indeed not done by Peter, because he
obviously knows much better :-)

I'm waiting for this to be fixed to pull the sources without having to
do a separate sync.
...
...
...
Perhaps the easiest and most effective way is a social one: mirror
tiering. You chose the bigger, better connected and better managed
mirrors spread around the world, and ask them a commitment in being
tier-1 mirrors for opensuse. They'd need to have at least
full-with-factory-and-sources [oh!! :-)], plus factory ppc, and allow
public access via rsync. Only these would have access to stage, the
others would use the tier-1s, so that you keep the crowd off your
machine. Tiering would give you a solid distribution network without
consuming your resources. This is what most distros do. There'd be no
changes in using mirror brain to monitor all mirrors and sending
clients to them. You could perhaps also count on the tier-1s to
implement some of the technical methods below.
Yes. It's amazing that we have survived, bandwidth-wise, without tiering
for so long.
No wonder. SUSE could from the beginning rely on the german university net 
structure (DFN). It just was appropriate to the needs of the university 
data centers...
...
An issue that I see with tiering is a little give-up of control, which
doesn't matter much (knowing who syncs; discovering new mirrors that
would go unnoticed otherwise), but some things could be a loss (the
possibility of updating the mirror database tied to rsync log parsing,
which is something that I wanted to explore a little further, but
which is probably better done by some form of triggering or pushing
anyway).
I think this is unrelated. You can continue to request rsync access
for the scanner from all mirrors, as you do now.
...
You already have this tiering, by tradition.
The german DFN network always was (and still is) the "technological 
leader" in bandwidth, so since ever it is best choice to use a tier-1 
mirror within the DFN.
Sure. However this isn't enough; the idea is that only tier-1 mirrors
sync from stage to reduce its load, so more, geographically spread,
tier-1 mirrors are necessary. It seems that the one(s) in DFN are
enough for Europe but it'd be convenient to have others in the other
side of the Atlantic and in Asia and Australia.
...
...
At each of these stages, the sync needs to be timely (triggered pull or
push), and not overlapping a sync at the stage before or after. At the
same time, at each stage load can be wasted, or saved by avoiding
unneeded syncs. At each of the stages, the order of files synced could
matter (packages before repodata is better...).
A good point - Peter, you are the only one who could help to make this 
better...
This is feasible with mirrors that have good connection with stage and
are well managed, that is tier-1's. If you want to be very fancy
you'll have to manage your account on the mirrors, like sourceforge
does. Some of the features above we already do. About packages before
repodata, I think this is largely unnecessary with --delay-updates and
--delete-delay or --delete-after (depending on the mirror rsync version).
...
...
I don't know details about how kernel.org detects changes (it may depend
on their tree), but I have two things in mind for that:
1) a generic way to detect changes in a file tree based on the inotify
mechanism. I'm thinking of a recursive directory watch which notices
when changes occur, awaits the end of the changes (by waiting for a
long enough period of time without further changes happening), thereby
detecting the moment when an incoming sync is finished. Then it could
trigger the next-stage mirror, or start a push sync to it.
...
Can't you watch the incoming syncs by inspecting the logs?
...
2) a more specialized mechanism for too large trees where inotify
doesn't scale (that's the case with the buildservice with its 70.000
directories). This mechanism is specific to the build service tree and
its layout.
The master tree should only be a repository for the tier-1 mirrors.
When you change it you should know what has changed without having to
scan it all or using inotify.
...
I'm sure you already _do_ have a good mechanism for the buildservice tree.
...
I am not sure if 1) and 2) can be combined; but I'm sure that I would
like to have something that is not a hacky local solution but rather
something that's reusable enough for other people to use
I disagree. This is highly specific to the process used to change the
master repository. Each software distribution has its own.
...
...
...
Another possibility, as you say, is to have write access, which is
equivalent to having an account on the machine. We also do it here;
sourceforge is a very big example. They're very good at keeping a 10
times bigger-than-a-distro tree in sync with minimal load.
Sourceforge is also very good in selecting what to actually sync to
mirrors. That's also a direction that I want to explore (and talk to
them if we can share something).
I don't understand. You have to send what's in the master tree,
period. I suppose you already know how to build the master tree...
...
...
...
A third method is to use rsync in a better way. We don't do disk
scanning here when we update; only the master is hammered :-) However
if you give me a list of files in your site, such as the one created
by find or
rsync localhost::a-[hidden]module-with-everything > filelist
then we'll do *no* disk scans at *either* end, and only pull the
necessary files. We do this for another distro...
That's nice idea that I haven't looked into so far. I need to
familiarize a bit with that. So far, I tended to believe that it
wouldn't work for us; not for the large trees. Collecting the filelist
locally is a job that takes hours, which makes it a hopeless undertake
for a tree that changes every 30 seconds. Nice for trees that don't
change I guess...
As I said above, you shouldn't need to scan the whole tree; you should
know what has changed from the internal update process, and patch the
filelist. And you don't need to patch it every 30 seconds; you may
decide that changes will only be visible to mirrors in larger
intervals, like 5 or 15 minutes.
...
A full directory scan ("ls -lR") at ftp5.gwdg.de (7.7 TB, 11.5 million 
files) needs 38 minutes - can you add some more RAM to your machine?
ftp5 has 32 GB, and the only bottleneck is network.
The filespace is LVM'ed over a single Transtec/Infortrend raid array 
with two controllers (each serving a raid6 set of 11 disks).
Filesystem type: xfs.
ls -lR does measure the time to scan the disk, so I agree with you.
However, it's always useful to observe that it's format is very
inadequate for mirroring. With the right options ls can be improved
but it's always cumbersome. rsync or find are always better. ls -lR
was used more than a decade ago...
...
...
Sounds doable if the filelists are collected not from the filesystem,
but e.g. from the buildservice (which already knows what it updates).
Surely the best way.
Exactly.
...
...
...
Even better, if you
give a list of checksums we'll use it both for updating and for
verifying that our repo is correct. We also do it with another distro.
The disadvantage of this method is that it needs a complicated script,
so mirrors are unlikely to use it.
I have a local tree of checksums for most files, but unfortunately that
tree has as many directories and as many files as the actual file tree.
This is nearly useless. Put them all in a single file.
...
...
Maybe there are smarter ways
The smarter way is to put them in a single file, like other distros
do.
...
...
to use them for something that I haven't thought about yet.
The master has no use for them; it's just a nice service you offer to
your dear mirrors :-) I'd certainly much appreciate to have them.
In retribution, [some of] your dear mirrors wouldn't scan you all out
at every update :-)
-- 
To unsubscribe, e-mail: mirror+unsubscribe@opensuse.org
For additional commands, e-mail: mirror+help@opensuse.org

Re: [suse-mirror] rsync module for the sources - now there

carlos＠fisica.ufpr.br