Google is great, and yes, I know all those tricks that make it greater. But I still want to use REGEX in some cases. So, I figured out a way to do that, in theory, all I need now is the code...
Briefly, the software I need, which I shall call googlereg for now, feeds the harvest from a google search through a regex filter and produces a list of hits.
There are three streams of data that could be fed through the regex filter: The basic output of the Google search (what you see as results on a Google search page), the actual pages found by the Google search, and the entire site at the domain found by Google. The first is quick and will only allow limited contact between the regex filter and the web site, the second is more effective (obviously) but would take a long time and is web-noxious and poor behavior if applied to a large number of sites. The third option is probably redundant because presumably Google has already checked those pages for your Google search term, but the ability to do this should be built in just in case.
So, the following parameters would be used to make the search work and to adjust its behavior:
googleterm: The Google search term that generates a preliminary list of sites.
regex: The regular expression that is designed to test the results
nsall: The number of total sites (a page and all its links on the same domain) to search and test with the regex. Default is zero, and one would rarely if ever want to set it to a larger number.
npage: The number of pages found by the google search to completely search and test with the regex
nsum: the number of summaries of hits (from the google search) that is tested with the regex.
The data streams, the, are the googlestream, sitestream, and pagestream. To make this work they would need to be pre-filtered to isolate groups of text to test associated with a specific URL. Furthermore, it would be cool if blogs produced by commonly used software (Movable Type, WordPress, etc.) were also processed to strip out headers, footers, and sidebars. Beyond this, the remaining text would ideally be stripped of newlines so they don't have to be messed with, with hyphen/newline combos turned into nothing, and newlines otherwise converted to spaces. And so on.
So typing this at the command line:
googlereg -b googleterm regex npage nsum nall
...eventually opens up a browser window with web page titles (as links) with the Google summary below, and below that a 200 character (configurable) context with the regex embedded in the middle. If -b was left off the results go to standard output. Cool parameters like -o for ouput file and -i for input file could also be useful.
This is probably doable as a perl one-liner. What do you think?
- Log in to post comments
Interesting. But not a Perl one-liner because the results are not quite simple enough.
HOWEVER
I found this interesting tool you might want to check out:
Yahoo Pipes
http://pipes.yahoo.com/pipes/
includes amongst many other results filters a regex tool. With examples filtering Google results.
Yes, I know it is not a command line tool. That would take some more research, like - are Google search results available in pure XML feed formats instead of wrapped in human-only visual HTML crippled syntax? If that is true, then there are CPAN XML modules that can be used along with the regex post-results filter. But if your desired end result is a web page then Yahoo Pipes may do the trick for you.
The pipes look interesting . I've seen that before but forgot about them.
As far as dealing with HTML, that's fairly easy with the proper text based web readers and sed, but there should be something in the google api that will work.
The problem with the google api might be that they change it now and then.
I've done some Perl code that performs web page text analysis for SEO, and some of that seems connected to some of what you're interested in doing. Stripping out all but the actual content of the pages is essential.
I can see how
googlereg
would be helpful for some genealogical searching that I do once in a while.Google API has the ability to return XML formatting for the results, but their terms mean that you have to use their API, get a special API key, and have the web site using it be publicly accessible without restrictions. Without their library modules and valid key all you get back is encrypted binary.
So back to page scraping.
Interesting Idea.
Hmm, or alternatively, I bet a Greasemonkey script doing that could be hacked together...