Great question! We recently published a paper in VLDB which, I believe, stands for ‘Very Large Databases’, that talks exactly about our criteria, all the ways we try to do it safely so that if people don’t want their forms to be crawled, we won’t crawl them.
There are some very simple things you can do. Rather than having text to be filled out, like a zip code, if you can make it a drop-down, for example, that’s much more helpful. If you can make it so it’s not a huge form with 20 things to fill out, but more like 1 drop-down, or 2 drop-downs, that’s going to be a lot easier as well.
I definitely encourage you to go and read the paper, there’s nothing super-duper confidential in it.
Of course, if you can make it so that you’re not part of the deep web, if you can take those pages that are your database, and have an HTML Sitemaps so that people can reach all the different pages on your site, like crawling through categories or geographic area, then we don’t have to fill out forms. Google is a pretty good company about being able to index the deep web through forms, but not every search engine does that. So, if you can expose that database where people can get to all the pages of your site just by clicking, not by submitting a form, then you’re going to open yourself up to even wider audiences.
If you can do that, that’s what I’d recommend. But if you can’t do that, then I’d say to check out this paper from the VLDB conference where the team talked about it in more detail.
Related posts:
- Does Google Analytics have plans to start adding specific tools around Web 2.0 or social media websites?
- When I do a Google search for my business name, Google suggests ‘Did you mean:’ with some other company name. Is there anything we can do to keep that from happening?
- How can I make sure that Google reaches and indexes pages that are on a lower (deeper) level of a website?
- There seems to be little impact on visitors where in the site’s structure a given page is, so: Is it better to keep key content pages close to root, or have them deep within a topical funnel-structure, e.g. food/fast-food/burgers/hamburgers.php?
- Does Googlebot use inference when spidering – having crawled site.com/article/page1.htm and /page2.htm, can it guess at the existence of a /page3.htm and crawl it? Or does it stick entirely to what it finds via the link graph and/or Sitemaps/feeds?
