Google is great, and yes, I know all those tricks that make it greater. But I still want to use REGEX in some cases. So, I figured out a way to do that, in theory, all I need now is the code…
Briefly, the software I need, which I shall call googlereg for now, feeds the harvest from a google search through a regex filter and produces a list of hits.
There are three streams of data that could be fed through the regex filter: The basic output of the Google search (what you see as results on a Google search page), the actual pages found by the Google search, and the entire site at the domain found by Google. The first is quick and will only allow limited contact between the regex filter and the web site, the second is more effective (obviously) but would take a long time and is web-noxious and poor behavior if applied to a large number of sites. The third option is probably redundant because presumably Google has already checked those pages for your Google search term, but the ability to do this should be built in just in case.
So, the following parameters would be used to make the search work and to adjust its behavior:
googleterm: The Google search term that generates a preliminary list of sites.
regex: The regular expression that is designed to test the results
nsall: The number of total sites (a page and all its links on the same domain) to search and test with the regex. Default is zero, and one would rarely if ever want to set it to a larger number.
npage: The number of pages found by the google search to completely search and test with the regex
nsum: the number of summaries of hits (from the google search) that is tested with the regex.
The data streams, the, are the googlestream, sitestream, and pagestream. To make this work they would need to be pre-filtered to isolate groups of text to test associated with a specific URL. Furthermore, it would be cool if blogs produced by commonly used software (Movable Type, WordPress, etc.) were also processed to strip out headers, footers, and sidebars. Beyond this, the remaining text would ideally be stripped of newlines so they don’t have to be messed with, with hyphen/newline combos turned into nothing, and newlines otherwise converted to spaces. And so on.
So typing this at the command line:
googlereg -b googleterm regex npage nsum nall
…eventually opens up a browser window with web page titles (as links) with the Google summary below, and below that a 200 character (configurable) context with the regex embedded in the middle. If -b was left off the results go to standard output. Cool parameters like -o for ouput file and -i for input file could also be useful.
This is probably doable as a perl one-liner. What do you think?