I'm a bit confused by the behavior of wget... I've issued the command wget -m -k -K -np http://host.domain.tld/directory/ on a box to backup a site completely, but after what I can see, wget loops this indefinately, am I right? My intention was that it should do one round, and then exit, but it seems like it just goes on... -- Anders Norrbring Norrbring Consulting
Anders Norrbring wrote:
I'm a bit confused by the behavior of wget... I've issued the command wget -m -k -K -np http://host.domain.tld/directory/ on a box to backup a site completely, but after what I can see, wget loops this indefinately, am I right?
I just tried the very same command on http://www.doubledecker.ch/ - no indefinite looping. /Per Jessen, Zürich -- http://www.spamchek.dk/ - managed anti-spam and anti-virus solution. Sign up for your free 30-day trial now!
On 2005-10-30 09:48 Per Jessen wrote:
Anders Norrbring wrote:
I'm a bit confused by the behavior of wget... I've issued the command wget -m -k -K -np http://host.domain.tld/directory/ on a box to backup a site completely, but after what I can see, wget loops this indefinately, am I right?
I just tried the very same command on http://www.doubledecker.ch/ - no indefinite looping.
I just don't get it.. Could it be that the source site isn't static? That wget restarts somehow if there are new files somewhere or anything like that. I'm puzzled. When I look in the wget log, I can see that it has looped files several times. Just as an example, I used the command above to mirror, then if I look in the directory everything is saved in, I have; host.domain.tld/directory/ (and lots of files and dirs below this) If I grep the log for 'directory/subdir/file.html' I have this: --09:16:00-- http://host.domain.tld/directory/file.html.html => `host.domain.tld/directory/file.html.html' Server file no newer than local file `host.domain.tld/directory/file.html.html' -- not retrieving. => `host.domain.tld/directory/file.html.html' Server file no newer than local file `host.domain.tld/directory/file.html.html' -- not retrieving. => `host.domain.tld/directory/file.html.html' Server file no newer than local file `host.domain.tld/directory/file.html.html' -- not retrieving. => `host.domain.tld/directory/file.html.html' Server file no newer than local file `host.domain.tld/directory/file.html.html' -- not retrieving. => `host.domain.tld/directory/file.html.html' Server file no newer than local file `host.domain.tld/directory/file.html.html' -- not retrieving. => `host.domain.tld/directory/file.html.html' Server file no newer than local file `host.domain.tld/directory/file.html.html' -- not retrieving. Seems like it has looped this specific file a couple of times. -- Anders Norrbring Norrbring Consulting
I'm a bit confused by the behavior of wget... I've issued the command
wget -m -k -K -np http://host.domain.tld/directory/
on a box to backup a site completely, but after what I can see, wget loops this indefinately, am I right?
My intention was that it should do one round, and then exit, but it seems like it just goes on... --
Anders Norrbring Norrbring Consulting
Try wget -rEKk www.somesite.com/directory/ Goksin Akdeniz
On 2005-10-30 11:54 Goksin Akdeniz wrote:
I'm a bit confused by the behavior of wget... I've issued the command
wget -m -k -K -np http://host.domain.tld/directory/
on a box to backup a site completely, but after what I can see, wget loops this indefinately, am I right?
My intention was that it should do one round, and then exit, but it seems like it just goes on...
Try wget -rEKk www.somesite.com/directory/
But I don't want the -E since it translates links.. Neither do I need the -r, all pages are static html. So, that leaves kK, why would I skip -np and -m ? -- Anders Norrbring Norrbring Consulting
Anders, On Saturday 29 October 2005 23:19, Anders Norrbring wrote:
I'm a bit confused by the behavior of wget... I've issued the command
wget -m -k -K -np http://host.domain.tld/directory/
on a box to backup a site completely, but after what I can see, wget loops this indefinately, am I right?
Is it possible that the modification time returned is not that of the underlying files, but rather the current time? I think this will subvert wget's cycle-breaking logic, 'cause when it comes around to download a given file a second time and it appears to have changed (later mod time than the previous download), it will retrieve that file again and the files it refers to again. The result: An infinite loop.
My intention was that it should do one round, and then exit, but it seems like it just goes on...
Normally it works, as long as the graph formed by the hyperlinks is bounded, either intrinsically or because you gave options that cut off, say, links that go off-site or outside the hierarchy at which you initiated the retrieval.
--
Anders Norrbring
Randall Schulz
On 2005-10-30 15:26 Randall R Schulz wrote:
Anders,
On Saturday 29 October 2005 23:19, Anders Norrbring wrote:
I'm a bit confused by the behavior of wget... I've issued the command
wget -m -k -K -np http://host.domain.tld/directory/
on a box to backup a site completely, but after what I can see, wget loops this indefinately, am I right?
Is it possible that the modification time returned is not that of the underlying files, but rather the current time? I think this will subvert wget's cycle-breaking logic, 'cause when it comes around to download a given file a second time and it appears to have changed (later mod time than the previous download), it will retrieve that file again and the files it refers to again. The result: An infinite loop.
That would certainly make sense.. Looking in the log (with tail), I see that: Length: 4,573 (4.5K) [image/jpeg] Server file no newer than local file `/dir/file.jpg' -- not retrieving. I have no idea what the server time stamp would be..
My intention was that it should do one round, and then exit, but it seems like it just goes on...
Normally it works, as long as the graph formed by the hyperlinks is bounded, either intrinsically or because you gave options that cut off, say, links that go off-site or outside the hierarchy at which you initiated the retrieval.
It would be nice if I knew how to make it break after one cycle, but I don't.... Anders
Anders, On Sunday 30 October 2005 06:50, Anders Norrbring wrote:
On 2005-10-30 15:26 Randall R Schulz wrote:
Anders,
...
Is it possible that the modification time returned is not that of the underlying files, but rather the current time? I think this will subvert wget's cycle-breaking logic, ...
That would certainly make sense.. Looking in the log (with tail), I see that:
Length: 4,573 (4.5K) [image/jpeg] Server file no newer than local file `/dir/file.jpg' -- not retrieving.
Modification times only really matter for HTML files (for the purposes of this discussion), since they're the only ones with hyperlinks.
I have no idea what the server time stamp would be..
You can see what mod time the server is returning by opening the page in a browser and getting the page's information or properties. In Mozilla, it's "View -> Page Info...". The "General" tab of the resulting window includes the modification and expiration times.
My intention was that it should do one round, and then exit, but it seems like it just goes on...
Normally it works, as long as the graph formed by the hyperlinks is bounded, either intrinsically or because you gave options that cut off, say, links that go off-site or outside the hierarchy at which you initiated the retrieval.
It would be nice if I knew how to make it break after one cycle, but I don't....
Ordinarily, it's not up to you, but rather wget itself. But if the server is not returning sensible modification times, then it may be difficult. You might want to try a different mirroring tool to see if the same problem exists there, too.
Anders
Randall Schulz
On 2005-10-30 16:06 Randall R Schulz wrote: [8<]
Ordinarily, it's not up to you, but rather wget itself. But if the server is not returning sensible modification times, then it may be difficult.
You might want to try a different mirroring tool to see if the same problem exists there, too.
You wouldn't possibly have any suggestions? ;) It has to be able to set the user agent and http user/password pair since the directory is protected with .htaccess Anders
Anders, On Sunday 30 October 2005 08:14, Anders Norrbring wrote:
On 2005-10-30 16:06 Randall R Schulz wrote: [8<]
Ordinarily, it's not up to you, but rather wget itself. But if the server is not returning sensible modification times, then it may be difficult.
You might want to try a different mirroring tool to see if the same problem exists there, too.
You wouldn't possibly have any suggestions? ;) It has to be able to set the user agent and http user/password pair since the directory is protected with .htaccess
Anders
What did you try by means of locating an alternative mirroring tool already? Obviously you searched on the Internet, right? What did you think of http://www.softpanorama.org/WWW/mirroring_tools.shtml? Randall Schulz
participants (4)
-
Anders Norrbring
-
Goksin Akdeniz
-
Per Jessen
-
Randall R Schulz