Hi, What do I have to do if I wish to get a directory tree home using wget? Recursively? I have tried all - but I do not know how to specify a tree for download. When I tell it a file, it gets it. When I tell it a directory, it gets only an artificial index.html and that is it. Can it even be done? I am trying to mirror ftp.suse.com's 8.2/i386 branch on my local machine, for installations etc and having a hard time writing a script which would get INDEX.gz from SuSE, and then construct a file list, and try to fetch every absent file [locally] from any of the 12-13 mirrors in a roundrobin fashion. There is only one download stream from a site at a time. But all sites are busy all the time. I think that is the way for maximum throughput on one connection per site. Anyone tried this before? Please help. Rohit ********************************************************* Disclaimer This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message and are hereby notified that any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited. ********************************************************* Visit us at http://www.mahindrabt.com
On Wed, May 28, 2003 at 11:56:12AM +0530, Rohit wrote: : Hi, : : What do I have to do if I wish to get a directory tree home using wget? : Recursively? I have tried all - but I do not know how to specify a tree : for download. When I tell it a file, it gets it. When I tell it a : directory, it gets only an artificial index.html and that is it. : : Can it even be done? I am trying to mirror ftp.suse.com's 8.2/i386 branch : on my local machine, for installations etc and having a hard time writing : a script which would get INDEX.gz from SuSE, and then construct a file : list, and try to fetch every absent file [locally] from any of the 12-13 : mirrors in a roundrobin fashion. There is only one download stream from a : site at a time. But all sites are busy all the time. I think that is the : way for maximum throughput on one connection per site. Take a look at the "--mirror" option. Depending on the file layout you want on the local machine you may also be interested in the '-nH' and '--cut-dirs=' options. Full details can be found in the wget(1) man and info pages. However, it may not be what you really want. You can't effectively mirror from multiple servers at the same time. You'll end up stepping on your own toes as the multiple processes try to determine the ever-changing differences between the local and remote filesystems. You're probably better off taking the big hit the first time and then keeping up with a nightly or weekly cron job. --Jerry -- Open-Source software isn't a matter of life or death... ...It's much more important than that!
The 03.05.28 at 12:08, Rohit wrote:
Can it even be done? I am trying to mirror ftp.suse.com's 8.2/i386 branch on my local machine, for installations etc and having a hard time writing
If I remember correctly, that server has a policy against mirrors. Why don't you try rsync? I think it is better for the job than wget. For example, at "ftp.gwdg.de" there were these modules a few weeks ago: SuSE ftp://ftp.gwdg.de/pub/linux/suse (~100 GB) SuSE-7.3 ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/suse/i386/7.3 SuSE-7.3_update ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/suse/i386/update/7 SuSE-8.0 ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/suse/i386/8.0 SuSE-8.0_update ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/suse/i386/update/8 SuSE-8.1 ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/suse/i386/8.1 SuSE-8.1_update ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/suse/i386/update/8 SuSE-8.2_update ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/suse/i386/update/8 SuSE-people ftp://ftp.gwdg.de/pub/linux/suse/ftp.suse.com/people And some more. Then, If I want to download or update the patches dir only, I do: rsync --archive --update --compress --delete -P \ ftp.gwdg.de::SuSE-8.1_update/patches/. \ /usr/local/update/i386/update/8.1/patches/. \ | tee /home/cer/rsync.81.patches.log Or this to download ...rpm/i586: rsync --archive --update --compress --size-only -vv -P \ --exclude-from=/home/cer/bin/rsync.81.exc \ ftp.gwdg.de::SuSE-8.1_update/rpm/i586/. \ /usr/local/update/i386/update/8.1/rpm/i586/. \ | tee /home/cer/rsync.81.rpm.i586.log Of course, you could configure it to download everything from the tree, but I don't (I don't have the bandwidth). Have a look at the man page. -- Cheers, Carlos Robinson
On Wed, 28 May 2003, Carlos E. R. wrote:
Why don't you try rsync? I think it is better for the job than wget. For example, at "ftp.gwdg.de" there were these modules a few weeks ago:
Does rsync handle connections by authenticating via HTTP proxy for web/ftp accessing? I know that wget does. I am not sure whether rsync does so. The man page says nothing about HTTP-PROXY and authentication on such server. Rohit ********************************************************* Disclaimer This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message and are hereby notified that any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited. ********************************************************* Visit us at http://www.mahindrabt.com
On Thu, May 29, 2003 at 11:17:58AM +0530, Rohit wrote: : On Wed, 28 May 2003, Carlos E. R. wrote: : > Why don't you try rsync? I think it is better for the job than wget. For : > example, at "ftp.gwdg.de" there were these modules a few weeks ago: : : Does rsync handle connections by authenticating via HTTP proxy for web/ftp : accessing? I know that wget does. I am not sure whether rsync does so. The : man page says nothing about HTTP-PROXY and authentication on such server. No, rsync works over rsh/ssh and either spawns an rsync process or (in this case) talks to the rsyncd process on the other end. HTTP-PROXY doesn't even enter into the equation b/c it's not necessary. --Jerry -- Open-Source software isn't a matter of life or death... ...It's much more important than that!
* Rohit;
On Wed, 28 May 2003, Carlos E. R. wrote:
Why don't you try rsync? I think it is better for the job than wget. For example, at "ftp.gwdg.de" there were these modules a few weeks ago:
Does rsync handle connections by authenticating via HTTP proxy for web/ftp accessing? I know that wget does. I am not sure whether rsync does so. The man page says nothing about HTTP-PROXY and authentication on such server.
from man rsync "You may establish the connection via a web proxy by setting the environment variable RSYNC_PROXY to a host name:port pair pointing to your web proxy. Note that your web proxys configuration must allow proxying to port 873." So as long as your proxy supports to connections to port 873 it is possible -- Togan Muftuoglu Unofficial SuSE FAQ Maintainer http://dinamizm.ath.cx
Rohit wrote : | Hi, | | What do I have to do if I wish to get a directory tree home using wget? | Recursively? I have tried all - but I do not know how to specify a tree | for download. When I tell it a file, it gets it. When I tell it a | directory, it gets only an artificial index.html and that is it. What you will have to do is simple, use httrack, i don't really know if it is in the Suse distro however !. page : http://www.httrack.com/index.php Grtz Dries -- <End of message>
| What do I have to do if I wish to get a directory tree home using wget? | Recursively? I have tried all - but I do not know how to specify a tree | for download. When I tell it a file, it gets it. When I tell it a | directory, it gets only an artificial index.html and that is it.
What you will have to do is simple, use httrack, i don't really know if it is in the Suse distro however !. page : http://www.httrack.com/index.php
Thanks to all who replied. o Although it seems that httrack would work only for websites, I wonder whether it includes ftp sites. Shall check. o I still feel that my script structure is good and efficient and may be better than the alternatives suggested. Here it is.. SuSE_sites -> Contains master as well as mirror sites for SuSE and the value for -cut-dirs parameter of wget. SuSE_init -> Get INDEX.gz for DL_VERSION [=8.2 for now] Construct directory tree for paths in INDEX.gz Get MD5SUMS file for the directory which contains any .rpm file [as mentioned in INDEX.gz, again] SuSE_download -> Site-id is supplied as an argument, parses SuSE_sites, builds URL and downloads the file if it is already not there, is not under download, or if fails md5 checksum [if applicable]. Runs for one site. SuSE_verify -> Recursive md5 checksum for files which have it. For N sites, N instances of SuSE_download shall be heavy on system, but I am sure that they operating in parallel shall give me the best throughput for download. File-listing, MD5SUMS and filenames are universal and for anything underdownload, there would be a lockfile, so no overwriting. Inputs? I think httrack may not be able to do this - in parallel from 18 sites, but what do I know? I am yet to find out. Rohit ********************************************************* Disclaimer This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message and are hereby notified that any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited. ********************************************************* Visit us at http://www.mahindrabt.com
participants (5)
-
Carlos E. R.
-
dries
-
Jerry A!
-
Rohit
-
Togan Muftuoglu