Scraping Data from websites is pretty cool. But what if we could not only grab and parse the markup from external websites, but take full-blown screenshots and even modify that markup before we take that screenshot? We can do all that with an excellent package called PhantomJS.
A Browser without the Browser.
The hardest part of working with PhantomJS is getting it installed. Because of the requirements you are not going to be able to run this on a standard shared host, but an average vps will handle it fine. The following instructions are for getting things up and running with Ubuntu 11.10 on Linode.
Step1: Install PhantomJS Requirements
This will install everything we need to compile PhantomJS, along with the ability to run it virtually without a window system.
Step2: Install Browser Goodies
Installs flash plugin and windows fonts so sites appear more accurate.
Clone the repo and build from source.
After everything finishes we should now have the program installed. Test it at the command-line by typing “phantomjs”
To run this script, navigate to where the script lives on your server and run the following command. We run “Xvfb -screen 0 1024x768x24&” first to set the parameters of our Xvfb screen buffer. This is basically a virtual screen that allows us to emulate a window environment. Then we call the script with “DISPLAY=:0 phantomjs –load-plugins=yes shotty.js” to ensure Phantom runs in the buffer. PhantomJS will execute our script and save our screenshot in the same folder the script is in. If everything went well our screenshot of espn.com should look just like the real thing. How cool is that?
Muck’n with Markup
I’m sure by now your gears are already turning on fun uses of this technology. In Part II of this series I will go over how to integrate with Node.js to create a Phantom powered web app.