There are a lot of methods one can use for web-crawling seed lists:
- Typosquatted domain names
- Web engines search
- Blacklists
- Spam emails
Today I would like to show you a simple way to perform google search using the latest AJAX API provided by google. As some may know, google provided SOAP API in the past, which is still accessible if you have a key. However, to use AJAX there is no key or registration required. The only limitation at the moment of this writing is the number of search results you get. I was able find information that google allows 32 results per search, but I'm able to extract 40 results per search.
Here's a little script that we use: (download from here)
import urllib
import simplejson
import sys
import time
import os
# Add a search keyword in the searchKeywords array
# This script takes 40 results from google and saves it to searchResults.txt file
# Only urls for the search are saved.
searchKeywords = ['honeynet', 'python', 'windows']
def main():
totalResults = 0
f = open('searchResults.txt', 'w')
for query in searchKeywords:
print "Searching %s" % query
query = urllib.urlencode({'q' : query})
sCounter = 0
urls = []
while ( sCounter <= 36 ):
url = 'http://ajax.googleapis.com/ajax/services/search/web?start=%d&v=1.0&%s' \
% (sCounter, query)
search_results = urllib.urlopen(url)
json = simplejson.loads(search_results.read())
results = json['responseData']['results']
for i in results:
f.write('%s\n' % i['url'])
totalResults += 1
sCounter += 4
# close the results file
f.close()
if __name__=='__main__':
main()
You can hard-code your desired queries in the searchKeywords array by appending new or replacing existing strings: searchKeywords = ['honeynet', 'python', 'windows']
One interesting thing to note is the URL that we use to perform the search:
http://ajax.googleapis.com/ajax/services/search/web?start=%d&v=1.0&%s
Here, start=%d takes us to the specified search page. Every query returns 4 search results, therefore we increment sCounter by 4 until a total of 40 results is extracted.
In the default google search from a web browser you usually get 10 results per page, so modifying 'start' argument to '100' will take you to the 11th page:
http://www.google.ca/search?q=google&hl=en&start=100&sa=N
The search query itself is usually appended to the end of URL using &%s. There are additional options you can insert into the URL, which you can easily find from google.