AJAX scraping
airbornflght
Houston, TX Icrontian
So I'm working on a case study project for a programming class.
The idea is a webpage presents data that is updated through ajax periodically. What I want to do is when an update occurs is grab that data, check for changes, and then file them into the database to analyze later. We are dipping our toes in market intelligence applications if anyone is interested.
I can do everything but grabbing the source data. Can anyone point me in some directions. I've googled, but have kinda come up short. We have our data models and ERDs ready. But when it comes to actually obtaining the data I'm not well versed. It needs to be as close to real time as possible. Things I'm open to are vb, C#, python, ruby, php. I don't really know any of the three web programming languages but I'm willing to learn. I'd almost rather use one of the three so we can run it all on linux. Either way. I need to learn to grab the data and dump it into a postgresql database.
I've heard you can just fire off an xmlhttp request to the server, but is it just going to shoot back and xml document willy nilly? I've read of using extensions such as firewatir to tap into firefox's javascript engine. I'm trying to learn.
Was also looking at scrubyt but had difficulty setting ubuntu up with ruby. I've never done it before and kept running into issues. Definitely a learning curve for setting up a dev environment. Not as easy as VS2008's install.
The idea is a webpage presents data that is updated through ajax periodically. What I want to do is when an update occurs is grab that data, check for changes, and then file them into the database to analyze later. We are dipping our toes in market intelligence applications if anyone is interested.
I can do everything but grabbing the source data. Can anyone point me in some directions. I've googled, but have kinda come up short. We have our data models and ERDs ready. But when it comes to actually obtaining the data I'm not well versed. It needs to be as close to real time as possible. Things I'm open to are vb, C#, python, ruby, php. I don't really know any of the three web programming languages but I'm willing to learn. I'd almost rather use one of the three so we can run it all on linux. Either way. I need to learn to grab the data and dump it into a postgresql database.
I've heard you can just fire off an xmlhttp request to the server, but is it just going to shoot back and xml document willy nilly? I've read of using extensions such as firewatir to tap into firefox's javascript engine. I'm trying to learn.
Was also looking at scrubyt but had difficulty setting ubuntu up with ruby. I've never done it before and kept running into issues. Definitely a learning curve for setting up a dev environment. Not as easy as VS2008's install.
0
Comments
Ajax is when the web page sends an "asynchronous request" to the server, gets a reply, and does something with it (like inserting the reply into the page). You could use a Javascript library like jQuery or Prototype to do this fairly easily.
How the server replies is based on the server-side script you call, something in PHP, Ruby, or whatever. This has nothing to do with the Ajax itself. Ajax just sends and receives data.
If you want to store something in a database, you do that on the backend with PHP (or whatever script language you want). Ajax (which is really just a particular way of using Javascript) doesn't have anything to do with database calls.
The website in question is dynamically getting data back from the server. I'll admit I'm an ajax virgin so I'm learning here. But from what I understand it's more than likely using an xmlhttprequest, getting all the updated data from the server, and then updating the data on the page with javascript. Am I close?
What I want to do is capture that data and store it in a database, continuously, and in near real time. Sounds simple.
Then I will have another part of the system that is analyzing that data in the database and making decisions, thus the need for real time decisions. The two parts are so loosely coupled that they are essentially two separate systems. I can do the second part, just not the first that I talked about up above.
Overly simplistic model:
[website] --> [scraper] --> [database] <--> [market intelligence app]
Fundamentally, a website is a script that a browser downloads and renders locally on the client machine. That script may (and often does) make calls to remote services running elsewhere on the server (e.g. web applications, databases, etc) but those services aren't considered part of the website. From that perspective, the client viewing your website is making a request to a remote server containing the market data, downloading it to the client machine, uploading it to a program on your server (the scraper), and then loading it to a database on your server. This might work if you only ever have a single user, but websites are usually viewed by a lot of people so your database may rapidly fill up with duplicates. Also, your real-time requirement creates a lot of traffic since everything has to be downloaded and uploaded again.
So the client-in-the-middle solution isn't so good. What you need is a backend running on your server. The backend in this case is a combination of your database and a program running as a daemon that goes out and collects all this market data in real-time and loads it into your database. A simple fetch-and-store script run with the cron scheduler sounds like it might work nicely. Then, your website can just make database calls to the database on your server and render those. When your market intelligence app comes along, it can use the same database. This solution will scale nicely and not eat up more than the minimum bandwidth for your task.
-drasnor
Your server should directly query their site, get the contents, and store it in a database. No browser should be involved at all.
Sorry if that is what I was alluding to. But what you said is what I meant. Now my question is will the webserver likely serve the data with me making the request outside the domain?
Start some reading how to build a web-crawler in 5 minutes.