Scanning the Deep Web

During undergrad at Urbana-Champaign, one of the hot topics for search engine scientists was the “Deep Web”. While most of the web was easily accessible through HTML, a large portion of the dynamically generated content, such as PHP and ASP pages, could not be indexed by search engines.

The key reason dynamic pages could not be indexed was that they were behind a form. Since search engines are not people, they can’t intelligently fill out the form. They can only scan and index what is directly handed to them. Additionally, even if a search engine did make it past the form, most lacked the capability to analyze structured data.

At that time, Google and other search engines were finally starting to surmount the deep web. Query string parameters began to show in Google page URLs. Most importantly, developers started to understand the importance of SEO, and developed “backdoors” for the search engines to read the site content by following curated links.

Interestingly, in 2014, ten years later, the same problem is again starting to arise through the manifestation of AJAX. AJAX, a pet name for Asynchronous Javascript, is a technique many websites are adopting to improve their user experience. Instead of the traditional experience of clicking and waiting for page changes to load, AJAX allows web pages to act like desktop programs, and immediately respond to user actions.

The central problem with AJAX is that the page state sits in a browser’s memory. When a user interacts with the site, that state becomes transient and can often cause indexing problems with many search engines. Although workarounds exist, such as History.js and hashtags, the problem is only increasing in scope as more apps are rewritten in AJAX.

In the never-ending cycle of advancement between content producers, platform developers, and search engines, it will be interesting to see the challenges and where we arrive another ten years from today.

Written by Andrew Palczewski

About the Author
Andrew Palczewski is CEO of apHarmony, a Chicago software development company. He holds a Master's degree in Computer Engineering from the University of Illinois at Urbana-Champaign and has over ten years' experience in managing development of software projects.
Google+

Search Engine Challenges – Scanning the Deep Web

Leave a Reply Cancel reply