Very Poorly Formatted Data

cURL is a fantastic way to scrape data from websites. It's pretty ubiquitous on LAMP servers nowadays, so you probably don't even have to do anything to enable it and start using it. You can essentially get data that's behind a login form by spoofing a browser logging into the site.

I'm not going to do a dissertation on the nuances of using cURL. Instead, I'd like to discuss how to process data after you've gotten the HTML string from a page using the curlsetopt($ch, CURLOPTRETURNTRANSFER, 1); setting. The general process using cURL to get the data from a password protected site goes something like this:

  1. Make sure your "cookie jar" is set up and working so that your session can be saved while you're scraping other pages after login.

  2. Get the HTML for the login page.

    1. Get the form post location.

    2. Get any special tokens or hidden variables.

  3. Post to the login script with your authorized username and password.

  4. Scrape any pages you need to get and process them.

Getting auth tokens and stuff

Sometimes login pages make it difficult to figure out what to post because of "authentication tokens" or similar variables that change every time a page is hit and then have to be posted back along with the username and password. This is most common among .NET applications since it's built right in to the .NET form creation classes. The process of finding these variables is made a lot easier if you use phpQuery. It's a project that attempts to replicate what jQuery does but with PHP. You pass phpQuery a string with all the HTML content in it and then you can perform selectors and traverse the DOM just like with jQuery. What you do with the HTML after that is up to you.

gist#https://gist.github.com/1752657.js

In the above example, I already had the HTML string of the login page and simply used phpQuery to get all the inputs on the page with a very jQuery-like selector syntax and then looped through the results to get all my input fields for the form so I could use them in my next call to post the data back to the login script.

Getting poorly formatted data and stuff

I consider "well formatted" data to be in formats like JSON, XML, CSV, etc. that are specifically meant for data transfer. But what happens when you need to scrape data from HTML tables or DIVs?

phpQuery to the rescue! You can do the same thing as above but parse the newly acquired HTML string of the data you're looking for. Here's an example of looking at a two column table of data to get all sorts of neat stuff and format it into an array. Then I check for an element outside of the table loop and set another variable.

gist#https://gist.github.com/1752759.js

Using the data and stuff

After you're done with that, you have all this useful data in an array and you can basically do whatever you want to it. Since this example I created originated from code used in a real life client project (with variable names changed to protect the innocent), I went on to take that array and save it out to a JSON file that I could easily read in and do what I wanted to with the data.

You may need to go through some tests to get the right kind of cleaning or data sanitization for your values, but phpQuery makes it really easy to scrape this kind of data if you're used to jQuery's selectors and traversing. Instead of scraping the site each time to get the data, I like saving out the html string to a file and then playing with selectors and traversing to get the data I want form a local file so I don't put a lot of strain on the server.

And don't forget stealth...

If you're not sure how friendly the server (or serveradmin) is to you doing this, make sure you set the cURL user agent for each request. I like  pretending I'm GoogleBot, but any common user agent string would suffice so that the server doesn't explicitly know you're PHP trying to log in.

By default, cURL spits out a user agent like this (changes based on operating system and version of cURL):

curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5

If a serveradmin sees that in their logs, they might freak out... so  I like these user agent strings for sneaky data pulls:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Macintosh; Intel Mac OS X 1072) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7

Audience participation

How do you use cURL and phpQuery? Any useful scripts or resources to point people to other than their respective documentation sites?