Home | Projects | Tutorials | Articles | live chat | Submit Project | Big Vote
 
Ajax Projects
.NET Frameworks
Java Frameworks
PHP Frameworks
Ruby Frameworks
Other Frameworks
Cool AJAX sites
Ajax Resources
Ajax Tools
JavaScript frameworks
Partners

 Home /  Articles / Crawling AJAX

Crawling AJAX




Traditionally, a web spider system is tasked with connecting to a server, pulling down the HTML document, scanning the document for anchor links to other HTTP URLs and repeating the same process on all of the discovered URLs. Each URL represents a different state of the traditional web site. In an AJAX application, much of the page content isn't contained in the HTML document, but is dynamically inserted by Javascript during page load. Furthermore, anchor links can trigger javascript events instead of pointing to other documents. The state of the application is defined by the series of Javascript events that were triggered after page load. The result is that the traditional spider is only able to see a small fraction of the site's content and is unable to index any of the application's state information.

So how do we go about fixing the problem?

Crawl AJAX Like A Human Would
To crawl AJAX, the spider needs to understand more about a page than just its HTML. It needs to be able to understand the structure of the document as well as the Javascript that manipulates it. To be able to investigate the deeper state of an application, the crawling process also needs to be able to recognize and execute events within the document to simulate the paths that might be taken by a real user.

Shreeraj Shah's paper, Crawling Ajax-driven Web 2.0 Applications, does a nice job of describing the "event-driven" approach to web crawling. It's about creating a smarter class of web crawling software which is able to retrieve, execute, and parse dynamic, Javascript-driven DOM content, much like a human would operate a full-featured web browser.

    The "protocol-driven" approach does not work when the crawler comes across an Ajax embedded page. This is because all target resources are part of JavaScript code and are embedded in the DOM context. It is important to both understand and trigger this DOM-based activity. In the process, this has lead to another approach called "event-driven" crawling. It has following three key components

       1. Javascript analysis and interpretation with linking to Ajax
       2. DOM event handling and dispatching
       3. Dynamic DOM content extraction

The Necessary Tools


The easiest way to implement an AJAX-enabled, event-driven crawler is to use a modern browser as the underlying platform. There are a couple of tools available, namely Watir and Crowbar, that will allow you to control Firefox or IE from code, allowing you to extract page data after it has processed any Javascript.

Watir is a library that enables browser automation using Ruby. It was originally built for IE, but it's been ported to both Firefox and Safari as well. The Watir API allows you to launch a browser process and then directly extract and click on anchor links from your Ruby application. This application alone makes me want to get more familiar with Ruby.

Crowbar is another interesting tool which uses a headless version of Firefox to render and parse web content. What's cool is that it provides a web server interface to the browser, so you can issue simple GET or POST requests from any language and then scrape the results as needed. This lets you interact with the browser from even simple command line scripts, using curl or wget.

Which tool you use depends on the needs of your crawler. Crowbar has the benefit of being language agnostic and simple to integrate into a traditional crawler design to extract page information that would only be present after a page has completed loading. Watir, on the other hand, gives you deeper, interactive access to the browser, allowing you to trigger subsequent Javascript events. The downside is that the logic behind a crawler that can dig deep into application state is quite a bit more complicated, and with Watir you are tied to Ruby which may or may not be your cup of tea.

source: hackszine



Says:
Tue Feb 17, 2009 12:28 pm
Says:
Tue Feb 17, 2009 12:28 pm
neon Says:
Mon Jun 15, 2009 9:50 pm



I salute the turkey was introduced to your site. the content of your site and found the explanation very clear and I should be more visits a site. I'm especially happy to meet with you to give permission to the share's very good. Give your site the extent that the labor inserted. good work and greetings from turkey .... ajax super ....
TV studio film lighting Says:
Fri Jun 26, 2009 2:07 pm
thanks for sharing
deeper voice Says:
Thu Aug 27, 2009 9:22 pm
I'm especially happy to meet with you to give permission to the share's very good.
Joomla hosting Says:
Thu Sep 17, 2009 7:13 pm
I certainly didn't think you could spider <a href="http://www.buyhttp.com/web_hosting.html">web hosting</a> AJAX.
best registry cleaner Says:
Tue Oct 20, 2009 7:36 pm
As long as the majority Muslim population, especially the Shiites, feel that they are not being properly represented in parliament there is a high probability for political conflict. In future a political conflict the Israelis could be involved and that could ruin Obama's efforts at acheiving Middle East peace. If the United States really wants peace in Lebanon it would have to advocate allocate seats based equally on population.
resveratrol supplements Says:
Mon Oct 26, 2009 12:53 pm
Watir, on the other hand, gives you deeper, interactive access to the browser, allowing you to trigger subsequent Javascript events.
µç´ÅÌú Says:
Fri Oct 30, 2009 11:01 am
thanks for sharing.
fix red ring of death Says:
Sun Dec 06, 2009 6:46 pm
Well well folks, I will try to think about this.
Gas4Free Says:
Fri Dec 11, 2009 8:04 am
Some time ago I had the same problem, spider saw only a fraction of my webpage, I did some research and found a similar solution to this stated here(good quality overview and solution) Good for rookies or people who are just facing the problem(now I am drilling down on flash so lets see what happens there)

David Sims - Promoting Renewable Energy Solutions
get pregnant fast Says:
Sun Dec 13, 2009 3:25 pm
Thanks for sharing.
thai silk Says:
Thu Dec 17, 2009 2:12 am
like your article. Thanks for the sharing with us.
Ramon1982 Says:
Mon Dec 21, 2009 12:25 pm
Isn´t this a problem joomla or any other CMS faced? Or did they design around the problem from the start? I remember quite clearly atleast joomla having the ability for a sitemap. But making it more naturally crawlable will be ofcourse the best option.
business card scanner Says:
Wed Dec 23, 2009 9:19 am
Very well written article indeed, thank you so much for sharing such information with us, i hope we will see more from author in the future. Cheers.
Earth4Energy Says:
Tue Dec 29, 2009 8:50 am
Yes, Ramon, I too recall having seen a sitemap capabilities in joomla, but it certainly must be naturally crawable - the best way for sure, no argument there!
µç´ÅÌú Says:
Wed Dec 30, 2009 8:03 am
Thanks for article. Keep up sharing.
Says:
Fri Jan 01, 2010 8:38 pm
Shannon Says:
Mon Jan 04, 2010 3:42 pm
Thanks for the article. It is still weird to think that the best thing for SEO is to have a simple HTML website rather than a site with a lot of fancy stuff like PHP, AJAX, or flash.
hilarious quotes Says:
Sun Jan 10, 2010 6:09 am
Can you get us some more info on Dynamic DOM content extractions plz. Indeed this is a very good post but without DOM defs its like incomplete. Thx
Dental Solutions Says:
Sun Jan 10, 2010 6:12 am
amazing post, one good post for geeks .. and useful to us also..
Demonstrating Integrity Says:
Sun Jan 10, 2010 6:14 am
one of the best article i read on ajaxprojects...
buy nexium online Says:
Wed Jan 13, 2010 9:14 am
Still after a long time json is not secure yet, am using and found couple of bugs, dont why not getting complete fixation yet. Any ways your page worth reading for me.
Panerai watches Says:
Mon Jan 18, 2010 4:32 am
I agree with Shannon says.It is still weird to think that the best thing for SEO is to have a simple HTML website rather than a site with a lot of fancy stuff like PHP, AJAX, or flash.
Legal accreditation Says:
Tue Jan 19, 2010 7:56 am


Not a big fan of pitch correction on tracks although it is fun to play around with. It does seem to be used very frequently on tracks that really I cant see any benefit from.
Psychology education accreditation Says:
Tue Jan 19, 2010 7:57 am
Richard Branson is not a man to waste money it will be interesting to see how he gets on using the Burts Rutan ship for space tourism. I suppose if he covers his costs the publicity he will get in addition is bunce. Taking calculated risks seems to be part of the virgin brand.Psychology education accreditation
Criminal Justice accreditation Says:
Tue Jan 19, 2010 7:57 am
just took the poll. Thanks for reaching out and giving the public some say so in the future of space explorations and technologies.
Engineering accreditation Says:
Tue Jan 19, 2010 7:58 am
I think that NASA's role here should be one that fosters the innovation that comes from opening up space to private enterprise. They could be the governing body so-to-speak regarding the commercialization of space so that rules are followed.
Computer Science education accreditation Says:
Tue Jan 19, 2010 7:58 am
One thing they could also do is to help private companies launch more satellites so that they can be used to provide GPS For Children so that parents can breathe easier when their children leave the house.
Web Design in Staten Island Says:
Tue Jan 19, 2010 8:18 pm
As a web designer myself, I am very cautious of using AJAX and fancy effects in my websites because I know that it is going to hurt search engine rankings. Flash websites are a great example of this -- they look very nice and you can pull off dazzling effects but the spiders can't read the flash // have a very hard time navigating through it. I hope they fix this problem in the future.
Taiwanese Guy Says:
Tue Jan 19, 2010 8:20 pm
Thanks for sharing. Another question: I need to crawl a web board, which uses ajax for dynamic update/hide/show of comments without reloading the corresponding post. Any idea for crawling this data? Thanks in advance!
kan Says:
Wed Jan 20, 2010 10:10 am
Really Thanks alot for your article!

I spent the whole day to google by typing "crawling browser content", "spiding browser content", ... and finally I got your article solving all my questions (how stupid I am without thinking the word 'AJAX'!)
chopper tattoo Says:
Thu Jan 21, 2010 8:07 am
Good indeed. I don't want to miss this opportunity. Thanks for such information.
Natural Hemorrhoid Cure Says:
Sat Jan 30, 2010 1:29 pm
Thanks for the article. It is still weird to think that the best thing for SEO is to have a simple HTML website rather than a site with a lot of fancy stuff like PHP, AJAX, or flash.
how to build a solar panel Says:
Tue Feb 02, 2010 3:27 pm
Thansk for this best one article and i think SEO with ajax materials always professional. PHP and ajax have the compatibility .
Childrens Fancy Dress Says:
Tue Feb 02, 2010 3:37 pm
Ajax is the powerful tool for the php. And it is always better when the material upfront required in a minutes
Wedding Invitations Says:
Fri Feb 05, 2010 12:39 pm
Interesting article. But I think some AJAX lessons would be just perfect for a good article. Not to mention that it could get really messy writing the article.
Rapidshare SE Says:
Mon Feb 08, 2010 6:27 pm
Thanks for the explanation. It was very interesting for me about web spiders. It may help a lot.

Leave Your Comment

Name (Required)
Mail (will not be published) (required)
Website
AddThis Social Bookmark Button
Top Projects
MSN Web Messenger
MessengerFX
ebuddy
e-messenger
ILoveIM
AJAX file upload
You Tube
KoolIM.com
Meebo
Ajax.NET Professional
Tutorials
[PHP, AJAX, MySQL] Simple AJAX chat in PHP
Working with Authentication and Profile Services Using Ajax
Cross-Domain AJAX calls using PHP
JSON Serializers in .NET - not there yet
Create Autocomplete TextBox using AJAX in Asp.net 3.5
AJAX WAS Here - Part 3
Bug with Ajax HTML Grid and File Upload Forms
Using ASP.Net / AJAX slide extender to create a color selector
Very Dynamic Web Interfaces
A Better Javascript Memoizer