AJAX scraping

airbornflght · April 2010

So I'm working on a case study project for a programming class.

The idea is a webpage presents data that is updated through ajax periodically. What I want to do is when an update occurs is grab that data, check for changes, and then file them into the database to analyze later. We are dipping our toes in market intelligence applications if anyone is interested.

I can do everything but grabbing the source data. Can anyone point me in some directions. I've googled, but have kinda come up short. We have our data models and ERDs ready. But when it comes to actually obtaining the data I'm not well versed. It needs to be as close to real time as possible. Things I'm open to are vb, C#, python, ruby, php. I don't really know any of the three web programming languages but I'm willing to learn. I'd almost rather use one of the three so we can run it all on linux. Either way. I need to learn to grab the data and dump it into a postgresql database.

I've heard you can just fire off an xmlhttp request to the server, but is it just going to shoot back and xml document willy nilly? I've read of using extensions such as firewatir to tap into firefox's javascript engine. I'm trying to learn.

Was also looking at scrubyt but had difficulty setting ubuntu up with ruby. I've never done it before and kept running into issues. Definitely a learning curve for setting up a dev environment. Not as easy as VS2008's install.

shwaip · April 2010

wget and diff?

Linc · April 2010

I read all that and have no idea what's going on. I think you're confusing things.

Ajax is when the web page sends an "asynchronous request" to the server, gets a reply, and does something with it (like inserting the reply into the page). You could use a Javascript library like jQuery or Prototype to do this fairly easily.

How the server replies is based on the server-side script you call, something in PHP, Ruby, or whatever. This has nothing to do with the Ajax itself. Ajax just sends and receives data.

If you want to store something in a database, you do that on the backend with PHP (or whatever script language you want). Ajax (which is really just a particular way of using Javascript) doesn't have anything to do with database calls.

airbornflght · April 2010

sorry if I was being vague. Let me try to explain again.

The website in question is dynamically getting data back from the server. I'll admit I'm an ajax virgin so I'm learning here. But from what I understand it's more than likely using an xmlhttprequest, getting all the updated data from the server, and then updating the data on the page with javascript. Am I close?

What I want to do is capture that data and store it in a database, continuously, and in near real time. Sounds simple.

Then I will have another part of the system that is analyzing that data in the database and making decisions, thus the need for real time decisions. The two parts are so loosely coupled that they are essentially two separate systems. I can do the second part, just not the first that I talked about up above.

Overly simplistic model:

[website] --> [scraper] --> [database] <--> [market intelligence app]

drasnor · April 2010

Unless the state of the art in systems architecture has changed radically since the last time I looked, you're approaching this problem from the wrong direction. What you appear to be saying you want to do is to have a website which sends some sort of request to a remote database (the server) and renders the result on your screen, all the while you have a program running in the background that intercepts those server calls and logs them in another database somewhere. Here's why this is a bad idea:

Fundamentally, a website is a script that a browser downloads and renders locally on the client machine. That script may (and often does) make calls to remote services running elsewhere on the server (e.g. web applications, databases, etc) but those services aren't considered part of the website. From that perspective, the client viewing your website is making a request to a remote server containing the market data, downloading it to the client machine, uploading it to a program on your server (the scraper), and then loading it to a database on your server. This might work if you only ever have a single user, but websites are usually viewed by a lot of people so your database may rapidly fill up with duplicates. Also, your real-time requirement creates a lot of traffic since everything has to be downloaded and uploaded again.

So the client-in-the-middle solution isn't so good. What you need is a backend running on your server. The backend in this case is a combination of your database and a program running as a daemon that goes out and collects all this market data in real-time and loads it into your database. A simple fetch-and-store script run with the cron scheduler sounds like it might work nicely. Then, your website can just make database calls to the database on your server and render those. When your market intelligence app comes along, it can use the same database. This solution will scale nicely and not eat up more than the minimum bandwidth for your task.

-drasnor

pragtastic · April 2010

I think what ABF is trying to say, is that he's trying to scrape the data from an already existing website, store it in his db, then have his market intelligence app run against that data.

Linc · April 2010

Dras is right though, he probably wants the server to directly grab the site's content periodically (probably with something like cURL and a cron job) and deal with it entirely on the backend. Having Ajax updating a website is a needless step to getting it into a database.

Your server should directly query their site, get the contents, and store it in a database. No browser should be involved at all.

airbornflght · April 2010

Lincoln wrote:

Your server should directly query their site, get the contents, and store it in a database. No browser should be involved at all.

Sorry if that is what I was alluding to. But what you said is what I meant. Now my question is will the webserver likely serve the data with me making the request outside the domain?

kryyst · April 2010

Ok unless I'm missing something it just sounds like you're trying to build a targeted web-crawler.

Start some reading how to build a web-crawler in 5 minutes.

AJAX scraping

Comments