[OT sed awk grep regexp] How to grep URLs our of an html page
Hey, I want to get "grep" out an exact list of URLs from a whole buch of downloaded html pages. I can get as far as this sort of thing: www.onlinebible.net/notes.html: href="http://www.answersingenesis.org/TheWord/Files/Notes//mhcc.exe">Matthew Bu I don't want the part from http:// through .exe - not any of the rest of the line. In the example about it would be http://www.answersingenesis.org/TheWord/Files/Notes//mhcc.exe It seems to me like this might be possible with RegExp, but I havn't a clue how to do it. Anyone know the magic string? :-) TIA ---------------------------------------------------- Jonathan Wilson System Administrator Cedar Creek Software http://www.cedarcreeksoftware.com Central Texas IT http://www.centraltexasit.com
Today, Jonathan Wilson wrote...
Hey,
Hi,
I want to get "grep" out an exact list of URLs from a whole buch of downloaded html pages. I can get as far as this sort of thing:
www.onlinebible.net/notes.html: href="http://www.answersingenesis.org/TheWord/Files/Notes//mhcc.exe">Matthew
Anyone know the magic string? :-)
maybe this can help (maybe not the best answer, but... :) cat file.html | grep "your original grep expression"| sed "s/<[aA] [Hh][Rr][Ee][Ff]=\(.*\)>/\1/g; s/\"//g" the first command substitutes a href="proto://abc.def.ghi/asd/qwe/asd" for "proto://abc.def.ghi/asd/qwe/asd", and the second command strips the quotes hope it helps
Best Regards, Adilson Ribeiro
participants (2)
-
Adilson Guilherme Vasconcelos Ribeiro
-
wilson@claborn.net