Software Needed: Enhanced web search

Google is great, and yes, I know all those tricks that make it greater. But I still want to use REGEX in some cases. So, I figured out a way to do that, in theory, all I need now is the code...

Briefly, the software I need, which I shall call googlereg for now, feeds the harvest from a google search through a regex filter and produces a list of hits.

There are three streams of data that could be fed through the regex filter: The basic output of the Google search (what you see as results on a Google search page), the actual pages found by the Google search, and the entire site at the domain found by Google. The first is quick and will only allow limited contact between the regex filter and the web site, the second is more effective (obviously) but would take a long time and is web-noxious and poor behavior if applied to a large number of sites. The third option is probably redundant because presumably Google has already checked those pages for your Google search term, but the ability to do this should be built in just in case.

So, the following parameters would be used to make the search work and to adjust its behavior:

googleterm: The Google search term that generates a preliminary list of sites.

regex: The regular expression that is designed to test the results

nsall: The number of total sites (a page and all its links on the same domain) to search and test with the regex. Default is zero, and one would rarely if ever want to set it to a larger number.

npage: The number of pages found by the google search to completely search and test with the regex

nsum: the number of summaries of hits (from the google search) that is tested with the regex.

The data streams, the, are the googlestream, sitestream, and pagestream. To make this work they would need to be pre-filtered to isolate groups of text to test associated with a specific URL. Furthermore, it would be cool if blogs produced by commonly used software (Movable Type, WordPress, etc.) were also processed to strip out headers, footers, and sidebars. Beyond this, the remaining text would ideally be stripped of newlines so they don't have to be messed with, with hyphen/newline combos turned into nothing, and newlines otherwise converted to spaces. And so on.

So typing this at the command line:

googlereg -b googleterm regex npage nsum nall

...eventually opens up a browser window with web page titles (as links) with the Google summary below, and below that a 200 character (configurable) context with the regex embedded in the middle. If -b was left off the results go to standard output. Cool parameters like -o for ouput file and -i for input file could also be useful.

This is probably doable as a perl one-liner. What do you think?

More like this

I know all my fellow bloggers are jealous about my Linux calendar posts (like this one) and normally I don't reveal my secrets. But this is so cool I have to share it. The Linux calendar command (in the terminal window) puts out, by default, a listing of events, etc. from today and tomorrow.…
This quarter, I'm using a wiki with my bioinformatics class and posting sometimes about the things that I learn. Two things I've been experimenting with are: Setting up pages for individual students so they can take notes while they're working. Embedding a Google form into one of my wiki pages…
On the Googles, Common Knowledge gets more than 25,000,000 hits. It's a market research company, a scholarship foundation, a non profit fundraising firm, and in its inverse as Uncommon Knowledge part of a conservative group site, and an interview series at the Hoover Institution. We can take the…
On the Googles, Common Knowledge gets more than 25,000,000 hits. It's a market research company, a scholarship foundation, a non profit fundraising firm, and in its inverse as Uncommon Knowledge part of a conservative group site, and an interview series at the Hoover Institution. We can take the…

Interesting. But not a Perl one-liner because the results are not quite simple enough.

HOWEVER

I found this interesting tool you might want to check out:

Yahoo Pipes

http://pipes.yahoo.com/pipes/

includes amongst many other results filters a regex tool. With examples filtering Google results.

By Gray Gaffer (not verified) on 02 Nov 2009 #permalink

Yes, I know it is not a command line tool. That would take some more research, like - are Google search results available in pure XML feed formats instead of wrapped in human-only visual HTML crippled syntax? If that is true, then there are CPAN XML modules that can be used along with the regex post-results filter. But if your desired end result is a web page then Yahoo Pipes may do the trick for you.

By Gray Gaffer (not verified) on 02 Nov 2009 #permalink

The pipes look interesting . I've seen that before but forgot about them.

As far as dealing with HTML, that's fairly easy with the proper text based web readers and sed, but there should be something in the google api that will work.

The problem with the google api might be that they change it now and then.

I've done some Perl code that performs web page text analysis for SEO, and some of that seems connected to some of what you're interested in doing. Stripping out all but the actual content of the pages is essential.

I can see how googlereg would be helpful for some genealogical searching that I do once in a while.

Google API has the ability to return XML formatting for the results, but their terms mean that you have to use their API, get a special API key, and have the web site using it be publicly accessible without restrictions. Without their library modules and valid key all you get back is encrypted binary.

So back to page scraping.

By Gray Gaffer (not verified) on 02 Nov 2009 #permalink