talkgroup

Downloading all html links of a certain extension from a web page

I had reason to download a bunch of PDFs from a website recently and didn’t want to click 80+ links.

Here is how I approached it. On the page I opened up the console and ran:

var links =[]; Array.prototype.map. call(document.querySelectorAll("a[href$=\".pdf\"]"), function(e, i){if((links||[]).indexOf(e.href)==-1){ links.push( e.href);} }); console.log('"' + links.join('" "') + '"');

Note the .pdf in there, it was the file extension I was looking for. This gives me back a nice list of urls each in parenthesis. Then I just used wget to fetch them all.

wget "url1" "url2" ...

HT to this SO post that gave me the solution.

(Originally posted on my blog).

2 Likes

I miss the firefox extension DownThemAll, it made it so easy to get stuff off sites…

2 Likes

FYI you can feed wget a textfile of URLs. Combined with Tim’s hattrick it might be more efficient if the number of URLs you are after is too numerous.

Wget has some fancy footwork in it’s options. I think you can set it to reject or accept only specific file extensions, when it’s in it’s mirroring a website mode. Doing that with a depth=1 would probably be my immediate go to for such a task.

Though it’s going to fail on any links their are dynamically constructed wholecloth out of javascript. Which I suspect is becoming more common. So Tim’s hattrick might be increasingly more futureproof.

1 Like