The GSA will crawl all the domains and subdomains added to its' URL list. It processes each of the pages as it crawls in order to compile a massive index of all the words it sees and their location on each page. It may take 100 or even 1000 jumps for the crawler to find a page, but if the page is linked from another page it will be indexed.
The GSA will also process additional information including meta element, key content tags (title tags) and attributes (ALT attributes). The GSA will follow links contained in PDF documents and Flash files. It will not follow links in other formats, such as Microsoft Office documents.
Exceptions:
How do I exclude unwanted text from being indexed?
How often does the GSA index my site?
Publishing Best Practices
How do I prevent folders from being indexed?
What websites get indexed?
I updated my website but the GSA does not return it in the search result page.
How does the appliance detect the language of a web page?
What file size limits does the GSA use?