星期六, 九月 16, 2006

Geek to Live: Wget local copies of your online research (del.icio.us, digg or Google Notebook)

[webnote]
 
 
Geek to Live: Wget local copies of your online research (del.icio.us, digg or Google Notebook)
wget-online-research1.jpg
wget-online-research1.jpg
 
by Gina Trapani
 
You've been diligently bookmarking and clipping web pages using an online service like del.icio.us, Google Notebook or digg. Sure, storing your data on the web is great for from-any-online-computer access, but in an age of cheap, enormous hard drives and powerful desktop search, why not replicate the data you keep on the web to your computer? That way you'll have an copy of every web page as it appeared when you bookmarked it, and a searchable archive of your research even when you're offline.
 
Using my favorite command line tool wget, you can download the contents of a page of del.icio.us links, diggs or public Google Notebook automatically and efficiently to your hard drive.
Wget 101
 
Wget newbies, take a gander at my first tutorial on wget. There you'll get some background on how wget works, where to download it, and the format of a wget command.
 
Seasoned wgetters, come with me.
Archive del.icio.us bookmarks
 
Say you've got a presentation due about the current state of software and you've been collecting research on the topic in your del.icio.us bookmarks' "software" tag. Download all the documents linked from the http://del.icio.us/ginatrapani/software page using the following command (WITHOUT line breaks):
 
wget -H -r --level=1 -k -p -erobots=off -np -N 
--exclude-domains=del.icio.us,doubleclick.net 
http://del.icio.us/ginatrapani/software
 
    wget -H -r --level=1 -k -p -erobots=off -np -N 
    --exclude-domains=del.icio.us,doubleclick.net 
    http://del.icio.us/ginatrapani/software
 
How to run this script: Replace http://del.icio.us/ginatrapani/software with your del.icio.us username and desired tag. Create a new directory called "del.icio.us archive" and from that directory at the command line, run your edited version of the script. (Even better, copy and paste the command into a text file, tweak it to your own needs, and save it as a script - .bat for Windows users, and .sh for Mac users. Then run the script instead of typing out that long thing every time.) When the command has completed, you'll have directories set up named after each domain in the del.icio.us links, with the files stored within them.
 
The breakdown: This command says tells wget to fetch all the documents linked from http://del.icio.us/ginatrapani/software:
 
    * -H: Across hosts meaning, get all the links from del.icio.us to other sites
    * -r: Recursively
    * --level=1: 1 level in so as not to grab all the docments those pages link to too
    * -k: With local copy links converted to link to the local copies of pages
    * -p: Get all the images and other auxiliary files to completey construct the pages
    * -erobots=off: Ignore robots files and just download
    * -np: Don't go up to the parent directory (or all ginatrapani's bookmarks)
    * -N: Only download NEWER files than what's already been downloaded
    * --exclude-domains=del.icio.us,doubleclick.net: Exclude links to other del.icio.us pages and the ad server at doubleclick.net because you don't want to download ads.
 
If that's too much for you to swallow, simply run the command pointed at your own del.icio.us bookmarks. Trust me, it works.
 
Alternately, instead of limiting the download to one tag, get all your del.icio.us bookmarks using the following command (omit the line breaks):
 
wget -H -r --level=1 -k -p -erobots=off -np -N
--exclude-directories=ginatrapani 
--exclude-domains=del.icio.us, doubleclick.net  http://del.icio.us/ginatrapani
 
    wget -H -r --level=1 -k -p -erobots=off -np -N
    --exclude-directories=ginatrapani 
    --exclude-domains=del.icio.us, doubleclick.net  http://del.icio.us/ginatrapani
 
The only difference between this command and the last is that it includes an "--exclude-directories=ginatrapani" directive, which keeps wget from downloading every tag folder unnecessarily.
Archive someone's diggs
 
Say you want to archive all the stories Kevin Rose diggs. The wget command would look something like this (without the line breaks):
 
wget -H -r --level=1 -k -p -erobots=off -np -N 
--exclude-domains=digg.com,doubleclick.net,doubleclick.com,fastclick.net,fmpub.net,tacoda.net,adbrite.com,sitemeter.com
 http://digg.com/users/kevinrose/dugg
 
    wget -H -r --level=1 -k -p -erobots=off -np -N 
    --exclude-domains=digg.com,doubleclick.net,doubleclick.com,fastclick.net,fmpub.net,tacoda.net,adbrite.com,sitemeter.com
     http://digg.com/users/kevinrose/dugg
 
Similar to the command above, this one excludes more ad servers (so you don' t fill your hard drive with banner ad images) and is pointed at kevinrose's dugg page.
Archive a public Google Notebook
 
Google Notebook's a great way to clip sections of web pages and make notes about them online, and you can make those notebooks public. Say you've got a public Google Notebook of aviation quotes you've found all over the web you want to archive locally for when you're off line. Point wget at that notebook and tell it to save the page to aviationquotes-notebook.html with this command. (Omit the line breaks.)
 
wget  -k -p -erobots=off -np -N -nd -O aviationquotes-notebook.html
http://www.google.com/notebook/public/18344006957932515597/BDSKUIgoQ9K_Emdkh
 
    wget  -k -p -erobots=off -np -N -nd -O aviationquotes-notebook.html
    http://www.google.com/notebook/public/18344006957932515597/BDSKUIgoQ9K_Emdkh
 
Tips and tricks for archiving the web locally
 
Use Google Desktop or Mac OS X's Spotlight to search the contents of your downloaded bookmarks and web clips. Serious researchers on a Mac could import the downloaded documents into DevonThink as well.
 
Make downloaded pages expire after x amount of time. If you want to read all the stuff Kevin's dugg only in the last two weeks, clean up your download folder using the hard drive janitor, which will delete old files.
 
Schedule automatic runs of wget downloads using Windows Task Scheduler or cron on OS X and Linux.
 
Got a trusted wget recipe that you use all the time? Or a question about any of the ones presented here? Hit us up in the comments.
 
Gina Trapani, the editor of Lifehacker, thinks distributed personal data is the killer app. Her semi-weekly feature, Geek to Live, appears every Wednesday and Friday on Lifehacker. Subscribe to the Geek to Live feed to get new installments in your newsreader.
read more:
 
    * back to school
    * command line
    * del.icio.us
    * desktop search
    * digg
    * download managers
    * feature
    * geek to live
    * google notebook
    * research
    * social bookmarking
    * top
    * web as desktop
    * wget
 
Mail2Friend Permalink icon [+] Add this post to... del.icio.us digg wists
 
 
 
No commenter image uploaded phantomdata says:
 
Very nice Gina! I've always been an avid fan of wget. It's saved me any number of times from having a particular site disappear just as I was going to reference it in a report. I always keep a wget mirror of sites I reference for just that reason. However, I had never thought of using it to grab delicious tags. What a wonderful idea!
 
You could take it further if you've got a UNIX box, and set it up as a cron job to run nightly in order to ensure that you'll always have your favorite sites. When I was on dial-up I would have a cronjob fire up the connection and download all my daily reads for me around 1A, so I'd always have a local mirror to read in the mornings.
09/13/06 01:02 PM
No commenter image uploaded CorranRogue9 says:
 
@ phantomdata:
 
Good idea setting it up to run nightly, but if the site goes away, then it will rewrite your current data with the blank site. Then you *won't* have it. I would suggest that just after writing your essay that you need websites archived for, then backup the data.
09/13/06 01:34 PM
No commenter image uploaded digdug says:
 
If you use Firefox, you could also give the Slogger extension a try. It lets you archive a complete copy of a webpage with one click, or even automatically.
09/13/06 02:26 PM
Image of David Burch David Burch says:
 
Gina,
 
Couldn't you also pass a user name and password, either in the URL or as command-line arguments, to down load private Google Notebooks?
09/13/06 02:54 PM
Image of Bassam Bassam says:
 
Great Article Gina! I'll definitely be trying this out.
 
Any ideas on how to use wget to archive private Google Notebooks? I keep most of my notebooks private, and I'd love to be able to download them.
09/13/06 02:56 PM
No commenter image uploaded cebailey says:
 
Good stuff. I think this might strike a perfect balance for me in terms of keeping online bookmarks but also having an archive.
 
What would be fantastic for this is a way to somehow tie the del.icio.us notes and tags to the downloaded pages, maybe via spotlight metadata. Anyone have any thoughts on this? It might be possible using the API, or the export to HTML feature...
09/13/06 04:01 PM
Image of Gina Trapani, Lifehacker Editor Gina Trapani, Lifehacker Editor says:
 
@David: You'd probably have to pass on your Google cookies with wget to the command to authenticate to see private notebooks. I haven't given this a try, but do wget --help to see the cookie options.
 
And yes, Slogger and Scrapbook are both Firefox-based non command line (so non schedulable) extensions that do this as well, with a much friendly GUI interface.
09/13/06 04:17 PM
No commenter image uploaded kjohn says:
 
testing
09/13/06 08:03 PM
No commenter image uploaded kjohn says:
 
Good one Gina! But, still I can only get 1 pages worth of my del.icio.us bookmarks.
 
One workaround is to use the url of the form http://del.icio.us/username?setcount=100 .
Still you need multiple tries if you have more than 100 bookmarks.
09/13/06 08:08 PM
No commenter image uploaded cebailey says:
 
Gina:
 
There's an errant space in your example command for getting ALL del.icio.us links. You have:
 
--exclude-domains=del.icio.us, doubleclick.net
 
should probably be:
 
--exclude-domains=del.icio.us,doubleclick.net
09/14/06 12:17 AM
No commenter image uploaded Sander says:
 
I just found the following code, it is the most concise way I found in two days to back up delicious bookmarks.
 
wget http://del... --http-user=YOURUSERNAME --http-passwd=YOURPASSWORD --no-check-certificate
 
    wget http://del... --http-user=YOURUSERNAME --http-passwd=YOURPASSWORD --no-check-certificate
 
It gets all your bookmarks because it accesses the API. Not just your first 100 like some other examples. The only problem i have is that the results are in XML format, if anyone had a automated way of transforming it to a unordered html list that would be great. (or could adjust the command).
09/14/06 04:09 AM
No commenter image uploaded babette says:
 
Been a fan of wget for a long time and converted a few people. I have a few family members who are terrified of command line. For them I reccomend
deep vacuum. Sort of a wget with GUI for OS X. Anybody else used this?
09/14/06 12:3

没有评论:

google站内搜索

Google