Traditionally, a web spider system is tasked with connecting to a server, pulling down the HTML document, scanning the document for anchor links to other HTTP URLs and repeating the same process on all of the discovered URLs. Each URL represents a different state of the traditional web site. In an AJAX application, much of the page content isn't contained in the HTML document, but is dynamically inserted by Javascript during page load. Furthermore, anchor links can trigger javascript events instead of pointing to other documents. The state of the application is defined by the series of Javascript events that were triggered after page load. The result is that the traditional spider is only able to see a small fraction of the site's content and is unable to index any of the application's state information.
So how do we go about fixing the problem?
Crawl AJAX Like A Human Would To crawl AJAX, the spider needs to understand more about a page than just its HTML. It needs to be able to understand the structure of the document as well as the Javascript that manipulates it. To be able to investigate the deeper state of an application, the crawling process also needs to be able to recognize and execute events within the document to simulate the paths that might be taken by a real user.
Shreeraj Shah's paper, Crawling Ajax-driven Web 2.0 Applications, does a nice job of describing the "event-driven" approach to web crawling. It's about creating a smarter class of web crawling software which is able to retrieve, execute, and parse dynamic, Javascript-driven DOM content, much like a human would operate a full-featured web browser.
The "protocol-driven" approach does not work when the crawler comes across an Ajax embedded page. This is because all target resources are part of JavaScript code and are embedded in the DOM context. It is important to both understand and trigger this DOM-based activity. In the process, this has lead to another approach called "event-driven" crawling. It has following three key components
1. Javascript analysis and interpretation with linking to Ajax 2. DOM event handling and dispatching 3. Dynamic DOM content extraction
The Necessary Tools
The easiest way to implement an AJAX-enabled, event-driven crawler is to use a modern browser as the underlying platform. There are a couple of tools available, namely Watir and Crowbar, that will allow you to control Firefox or IE from code, allowing you to extract page data after it has processed any Javascript.
Watir is a library that enables browser automation using Ruby. It was originally built for IE, but it's been ported to both Firefox and Safari as well. The Watir API allows you to launch a browser process and then directly extract and click on anchor links from your Ruby application. This application alone makes me want to get more familiar with Ruby.
Crowbar is another interesting tool which uses a headless version of Firefox to render and parse web content. What's cool is that it provides a web server interface to the browser, so you can issue simple GET or POST requests from any language and then scrape the results as needed. This lets you interact with the browser from even simple command line scripts, using curl or wget.
Which tool you use depends on the needs of your crawler. Crowbar has the benefit of being language agnostic and simple to integrate into a traditional crawler design to extract page information that would only be present after a page has completed loading. Watir, on the other hand, gives you deeper, interactive access to the browser, allowing you to trigger subsequent Javascript events. The downside is that the logic behind a crawler that can dig deep into application state is quite a bit more complicated, and with Watir you are tied to Ruby which may or may not be your cup of tea.
I salute the turkey was introduced to your site. the content of your site and found the explanation very clear and I should be more visits a site. I'm especially happy to meet with you to give permission to the share's very good. Give your site the extent that the labor inserted. good work and greetings from turkey .... ajax super ....
As long as the majority Muslim population, especially the Shiites, feel that they are not being properly represented in parliament there is a high probability for political conflict. In future a political conflict the Israelis could be involved and that could ruin Obama's efforts at acheiving Middle East peace. If the United States really wants peace in Lebanon it would have to advocate allocate seats based equally on population.
Some time ago I had the same problem, spider saw only a fraction of my webpage, I did some research and found a similar solution to this stated here(good quality overview and solution) Good for rookies or people who are just facing the problem(now I am drilling down on flash so lets see what happens there)
Isn´t this a problem joomla or any other CMS faced? Or did they design around the problem from the start? I remember quite clearly atleast joomla having the ability for a sitemap. But making it more naturally crawlable will be ofcourse the best option.
Yes, Ramon, I too recall having seen a sitemap capabilities in joomla, but it certainly must be naturally crawable - the best way for sure, no argument there!
Thanks for the article. It is still weird to think that the best thing for SEO is to have a simple HTML website rather than a site with a lot of fancy stuff like PHP, AJAX, or flash.
Still after a long time json is not secure yet, am using and found couple of bugs, dont why not getting complete fixation yet. Any ways your page worth reading for me.
I agree with Shannon says.It is still weird to think that the best thing for SEO is to have a simple HTML website rather than a site with a lot of fancy stuff like PHP, AJAX, or flash.
Not a big fan of pitch correction on tracks although it is fun to play around with. It does seem to be used very frequently on tracks that really I cant see any benefit from.
Richard Branson is not a man to waste money it will be interesting to see how he gets on using the Burts Rutan ship for space tourism. I suppose if he covers his costs the publicity he will get in addition is bunce. Taking calculated risks seems to be part of the virgin brand.Psychology education accreditation
I think that NASA's role here should be one that fosters the innovation that comes from opening up space to private enterprise. They could be the governing body so-to-speak regarding the commercialization of space so that rules are followed.
One thing they could also do is to help private companies launch more satellites so that they can be used to provide GPS For Children so that parents can breathe easier when their children leave the house.
As a web designer myself, I am very cautious of using AJAX and fancy effects in my websites because I know that it is going to hurt search engine rankings. Flash websites are a great example of this -- they look very nice and you can pull off dazzling effects but the spiders can't read the flash // have a very hard time navigating through it. I hope they fix this problem in the future.
Thanks for sharing. Another question: I need to crawl a web board, which uses ajax for dynamic update/hide/show of comments without reloading the corresponding post. Any idea for crawling this data? Thanks in advance!
I spent the whole day to google by typing "crawling browser content", "spiding browser content", ... and finally I got your article solving all my questions (how stupid I am without thinking the word 'AJAX'!)
Thanks for the article. It is still weird to think that the best thing for SEO is to have a simple HTML website rather than a site with a lot of fancy stuff like PHP, AJAX, or flash.
Interesting article. But I think some AJAX lessons would be just perfect for a good article. Not to mention that it could get really messy writing the article.