Download entire websites easy

Alexio
  13 years ago
  25

GNU Wget is a nice tool for downloading resources from the internet. The basic usage is wget url:

wget http://linuxreviews.org/

The power of wget is that you may download sites recursive, meaning you also get all pages (and images and other data) linked on the front page:

wget -r http://linuxreviews.org/

But many sites do not want you to download their entire site. To prevent this, they check how browsers identify. Many sites refuse you to connect or send a blank page if they detect you are not using a web-browser. You might get a message like:

Sorry, but the download manager you are using to view this site is not supported. We do not support use of such download managers as flashget, go!zilla, or getright

There is a very handy -U option for sites like this. Use

-U My-browser

to tell the site you are using some commonly accepted browser:

 wget  -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html

A web-site owner will probably get upset if you attempt to download his entire site using a simple

wget http://foo.bar

command. However, the web-site owner will not even notice you if you limit the download transfer rate and pause between fetching files.

To make sure you are not manually added to a blacklist, the most important command line options are --limit-rate= and --wait= .

To pause 20 seconds between retrievals you should add

--wait=20

and to limit the download rate use something like

--limit-rate=20K

as this option defaults to bytes, add K to set KB/s.

Example:

wget --wait=20 --limit-rate=20K -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html

A very handy option that guarantees wget will not download anything from the folders beneath the folder you want to acquire is:

--no-parent

Use this to make sure wget does not fetch more than it needs to if you just want to download the files in a folder.

Read the manual page for wget to learn more about GNU Wget. The full official manual is available here.

To install the Gnome front-end for wget click here.


The original version of this how-to is available at http://linuxreviews.org/quicktips/wget/wget.en.pdf

Copyright (c) 2000-2004 Øyvind Sæther. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

Comments
MagicMint 8 years ago

It’s still useful.


shahriyar1369 11 years ago

it's awesome i like it ;)


sunewbie 12 years ago

very useful


Alexio 13 years ago

To use GNU Wget with Firefox, you can follow a small tutorial about using wget from Firefox.


Alexio 13 years ago

@troyM - GNU Wget is a command line utility that downloads files. By using only the terminal, the best browsers are Links2 and ELinks.

Also, you should know that GNU Wget can work in the background, even while the user is not logged on. This means that you can start a retrieval and disconnect from the system, while wget finishes the download. By contrast, most of the Web browsers require constant user's presence.


troyM 13 years ago

Alexio, what browser(s) utilize this GNU Wget best? Thanks for the tutorial.


grim 13 years ago

Pretty awesome, this helped me a lot, ty good sir :)