[webnote]  
 Geek to Live: Wget  local copies of your online research (del.icio.us, digg or Google  Notebook)
wget-online-research1.jpg
wget-online-research1.jpg
 wget-online-research1.jpg
wget-online-research1.jpg
by Gina  Trapani
 You've been diligently  bookmarking and clipping web pages using an online service like del.icio.us,  Google Notebook or digg. Sure, storing your data on the web is great for  from-any-online-computer access, but in an age of cheap, enormous hard drives  and powerful desktop search, why not replicate the data you keep on the web to  your computer? That way you'll have an copy of every web page as it appeared  when you bookmarked it, and a searchable archive of your research even when  you're offline.
 Using my favorite  command line tool wget, you can download the contents of a page of del.icio.us  links, diggs or public Google Notebook automatically and efficiently to your  hard drive.
Wget 101
 Wget 101
Wget newbies, take a  gander at my first tutorial on wget. There you'll get some background on how  wget works, where to download it, and the format of a wget  command.
 Seasoned wgetters, come  with me.
Archive del.icio.us bookmarks
 Archive del.icio.us bookmarks
Say you've got a  presentation due about the current state of software and you've been collecting  research on the topic in your del.icio.us bookmarks' "software" tag. Download  all the documents linked from the http://del.icio.us/ginatrapani/software  page using the following command (WITHOUT line breaks):
 wget -H -r --level=1 -k  -p -erobots=off -np -N   
--exclude-domains=del.icio.us,doubleclick.net
http://del.icio.us/ginatrapani/software
 --exclude-domains=del.icio.us,doubleclick.net
http://del.icio.us/ginatrapani/software
    wget  -H -r --level=1 -k -p -erobots=off -np -N  
--exclude-domains=del.icio.us,doubleclick.net
http://del.icio.us/ginatrapani/software
 --exclude-domains=del.icio.us,doubleclick.net
http://del.icio.us/ginatrapani/software
How to run this script:  Replace http://del.icio.us/ginatrapani/software  with your del.icio.us username and desired tag. Create a new directory called  "del.icio.us archive" and from that directory at the command line, run your  edited version of the script. (Even better, copy and paste the command into a  text file, tweak it to your own needs, and save it as a script - .bat for  Windows users, and .sh for Mac users. Then run the script instead of typing out  that long thing every time.) When the command has completed, you'll have  directories set up named after each domain in the del.icio.us links, with the  files stored within them.
 The breakdown: This  command says tells wget to fetch all the documents linked from http://del.icio.us/ginatrapani/software:
     *  -H: Across hosts meaning, get all the links from del.icio.us to other  sites
* -r: Recursively
* --level=1: 1 level in so as not to grab all the docments those pages link to too
* -k: With local copy links converted to link to the local copies of pages
* -p: Get all the images and other auxiliary files to completey construct the pages
* -erobots=off: Ignore robots files and just download
* -np: Don't go up to the parent directory (or all ginatrapani's bookmarks)
* -N: Only download NEWER files than what's already been downloaded
* --exclude-domains=del.icio.us,doubleclick.net: Exclude links to other del.icio.us pages and the ad server at doubleclick.net because you don't want to download ads.
 * -r: Recursively
* --level=1: 1 level in so as not to grab all the docments those pages link to too
* -k: With local copy links converted to link to the local copies of pages
* -p: Get all the images and other auxiliary files to completey construct the pages
* -erobots=off: Ignore robots files and just download
* -np: Don't go up to the parent directory (or all ginatrapani's bookmarks)
* -N: Only download NEWER files than what's already been downloaded
* --exclude-domains=del.icio.us,doubleclick.net: Exclude links to other del.icio.us pages and the ad server at doubleclick.net because you don't want to download ads.
If that's too much for  you to swallow, simply run the command pointed at your own del.icio.us  bookmarks. Trust me, it works.
 Alternately, instead of  limiting the download to one tag, get all your del.icio.us bookmarks using the  following command (omit the line breaks):
 wget -H -r --level=1 -k  -p -erobots=off -np -N
--exclude-directories=ginatrapani
--exclude-domains=del.icio.us, doubleclick.net http://del.icio.us/ginatrapani
 --exclude-directories=ginatrapani
--exclude-domains=del.icio.us, doubleclick.net http://del.icio.us/ginatrapani
    wget  -H -r --level=1 -k -p -erobots=off -np -N
--exclude-directories=ginatrapani
--exclude-domains=del.icio.us, doubleclick.net http://del.icio.us/ginatrapani
 --exclude-directories=ginatrapani
--exclude-domains=del.icio.us, doubleclick.net http://del.icio.us/ginatrapani
The only difference  between this command and the last is that it includes an  "--exclude-directories=ginatrapani" directive, which keeps wget from downloading  every tag folder unnecessarily.
Archive someone's diggs
 Archive someone's diggs
Say you want to archive  all the stories Kevin Rose diggs. The wget command would look something like  this (without the line breaks):
 wget -H -r --level=1 -k  -p -erobots=off -np -N   
--exclude-domains=digg.com,doubleclick.net,doubleclick.com,fastclick.net,fmpub.net,tacoda.net,adbrite.com,sitemeter.com
http://digg.com/users/kevinrose/dugg
 --exclude-domains=digg.com,doubleclick.net,doubleclick.com,fastclick.net,fmpub.net,tacoda.net,adbrite.com,sitemeter.com
http://digg.com/users/kevinrose/dugg
    wget  -H -r --level=1 -k -p -erobots=off -np -N  
--exclude-domains=digg.com,doubleclick.net,doubleclick.com,fastclick.net,fmpub.net,tacoda.net,adbrite.com,sitemeter.com
http://digg.com/users/kevinrose/dugg
 --exclude-domains=digg.com,doubleclick.net,doubleclick.com,fastclick.net,fmpub.net,tacoda.net,adbrite.com,sitemeter.com
http://digg.com/users/kevinrose/dugg
Similar to the command  above, this one excludes more ad servers (so you don' t fill your hard drive  with banner ad images) and is pointed at kevinrose's dugg page.
Archive a public Google Notebook
 Archive a public Google Notebook
Google Notebook's a  great way to clip sections of web pages and make notes about them online, and  you can make those notebooks public. Say you've got a public Google Notebook of  aviation quotes you've found all over the web you want to archive locally for  when you're off line. Point wget at that notebook and tell it to save the page  to aviationquotes-notebook.html with this command. (Omit the line  breaks.)
 wget  -k -p  -erobots=off -np -N -nd -O aviationquotes-notebook.html 
http://www.google.com/notebook/public/18344006957932515597/BDSKUIgoQ9K_Emdkh
 http://www.google.com/notebook/public/18344006957932515597/BDSKUIgoQ9K_Emdkh
     wget  -k -p -erobots=off -np -N -nd -O aviationquotes-notebook.html  
http://www.google.com/notebook/public/18344006957932515597/BDSKUIgoQ9K_Emdkh
 http://www.google.com/notebook/public/18344006957932515597/BDSKUIgoQ9K_Emdkh
Tips and tricks for  archiving the web locally
 Use Google Desktop or  Mac OS X's Spotlight to search the contents of your downloaded bookmarks and web  clips. Serious researchers on a Mac could import the downloaded documents into  DevonThink as well.
 Make downloaded pages  expire after x amount of time. If you want to read all the stuff Kevin's dugg  only in the last two weeks, clean up your download folder using the hard drive  janitor, which will delete old files.
 Schedule automatic runs  of wget downloads using Windows Task Scheduler or cron on OS X and  Linux.
 Got a trusted wget  recipe that you use all the time? Or a question about any of the ones presented  here? Hit us up in the comments.
 Gina Trapani, the  editor of Lifehacker, thinks distributed personal data is the killer app. Her  semi-weekly feature, Geek to Live, appears every Wednesday and Friday on  Lifehacker. Subscribe to the Geek to Live feed to get new installments in your  newsreader.
read more:
 read more:
    *  back to school
* command line
* del.icio.us
* desktop search
* digg
* download managers
* feature
* geek to live
* google notebook
* research
* social bookmarking
* top
* web as desktop
* wget
 * command line
* del.icio.us
* desktop search
* digg
* download managers
* feature
* geek to live
* google notebook
* research
* social bookmarking
* top
* web as desktop
* wget
Mail2Friend Permalink  icon [+] Add this post to... del.icio.us digg wists
 No commenter image  uploaded phantomdata says:
 Very nice Gina! I've  always been an avid fan of wget. It's saved me any number of times from having a  particular site disappear just as I was going to reference it in a report. I  always keep a wget mirror of sites I reference for just that reason. However, I  had never thought of using it to grab delicious tags. What a wonderful  idea!
 You could take it  further if you've got a UNIX box, and set it up as a cron job to run nightly in  order to ensure that you'll always have your favorite sites. When I was on  dial-up I would have a cronjob fire up the connection and download all my daily  reads for me around 1A, so I'd always have a local mirror to read in the  mornings.
09/13/06 01:02 PM
No commenter image uploaded CorranRogue9 says:
 09/13/06 01:02 PM
No commenter image uploaded CorranRogue9 says:
@  phantomdata:
 Good idea setting it up  to run nightly, but if the site goes away, then it will rewrite your current  data with the blank site. Then you *won't* have it. I would suggest that just  after writing your essay that you need websites archived for, then backup the  data.
09/13/06 01:34 PM
No commenter image uploaded digdug says:
 09/13/06 01:34 PM
No commenter image uploaded digdug says:
If you use Firefox, you  could also give the Slogger extension a try. It lets you archive a complete copy  of a webpage with one click, or even automatically.
09/13/06 02:26 PM
Image of David Burch David Burch says:
 09/13/06 02:26 PM
Image of David Burch David Burch says:
Gina,
 Couldn't you also pass  a user name and password, either in the URL or as command-line arguments, to  down load private Google Notebooks?
09/13/06 02:54 PM
Image of Bassam Bassam says:
 09/13/06 02:54 PM
Image of Bassam Bassam says:
Great Article Gina!  I'll definitely be trying this out.
 Any ideas on how to use  wget to archive private Google Notebooks? I keep most of my notebooks private,  and I'd love to be able to download them.
09/13/06 02:56 PM
No commenter image uploaded cebailey says:
 09/13/06 02:56 PM
No commenter image uploaded cebailey says:
Good stuff. I think  this might strike a perfect balance for me in terms of keeping online bookmarks  but also having an archive.
 What would be fantastic  for this is a way to somehow tie the del.icio.us notes and tags to the  downloaded pages, maybe via spotlight metadata. Anyone have any thoughts on  this? It might be possible using the API, or the export to HTML  feature...
09/13/06 04:01 PM
Image of Gina Trapani, Lifehacker Editor Gina Trapani, Lifehacker Editor says:
 09/13/06 04:01 PM
Image of Gina Trapani, Lifehacker Editor Gina Trapani, Lifehacker Editor says:
@David: You'd probably  have to pass on your Google cookies with wget to the command to authenticate to  see private notebooks. I haven't given this a try, but do wget --help to see the  cookie options.
 And yes, Slogger and  Scrapbook are both Firefox-based non command line (so non schedulable)  extensions that do this as well, with a much friendly GUI interface.
09/13/06 04:17 PM
No commenter image uploaded kjohn says:
 09/13/06 04:17 PM
No commenter image uploaded kjohn says:
testing
09/13/06 08:03 PM
No commenter image uploaded kjohn says:
 09/13/06 08:03 PM
No commenter image uploaded kjohn says:
Good one Gina! But,  still I can only get 1 pages worth of my del.icio.us  bookmarks.
 One workaround is to  use the url of the form http://del.icio.us/username?setcount=100  .
Still you need multiple tries if you have more than 100 bookmarks.
09/13/06 08:08 PM
No commenter image uploaded cebailey says:
 Still you need multiple tries if you have more than 100 bookmarks.
09/13/06 08:08 PM
No commenter image uploaded cebailey says:
Gina:
 There's an errant space  in your example command for getting ALL del.icio.us links. You  have:
 --exclude-domains=del.icio.us, doubleclick.net
 should probably  be:
 --exclude-domains=del.icio.us,doubleclick.net
09/14/06 12:17 AM
No commenter image uploaded Sander says:
 09/14/06 12:17 AM
No commenter image uploaded Sander says:
I just found the  following code, it is the most concise way I found in two days to back up  delicious bookmarks.
 wget http://del... --http-user=YOURUSERNAME  --http-passwd=YOURPASSWORD --no-check-certificate
     wget  http://del... --http-user=YOURUSERNAME  --http-passwd=YOURPASSWORD --no-check-certificate
 It gets all your  bookmarks because it accesses the API. Not just your first 100 like some other  examples. The only problem i have is that the results are in XML format, if  anyone had a automated way of transforming it to a unordered html list that  would be great. (or could adjust the command).
09/14/06 04:09 AM
No commenter image uploaded babette says:
 09/14/06 04:09 AM
No commenter image uploaded babette says:
Been a fan of wget for  a long time and converted a few people. I have a few family members who are  terrified of command line. For them I reccomend
deep vacuum. Sort of a wget with GUI for OS X. Anybody else used this?
09/14/06 12:3
deep vacuum. Sort of a wget with GUI for OS X. Anybody else used this?
09/14/06 12:3
 
