Software Needed: Enhanced web search

By gregladen on November 2, 2009.

Google is great, and yes, I know all those tricks that make it greater. But I still want to use REGEX in some cases. So, I figured out a way to do that, in theory, all I need now is the code...

Briefly, the software I need, which I shall call googlereg for now, feeds the harvest from a google search through a regex filter and produces a list of hits.

There are three streams of data that could be fed through the regex filter: The basic output of the Google search (what you see as results on a Google search page), the actual pages found by the Google search, and the entire site at the domain found by Google. The first is quick and will only allow limited contact between the regex filter and the web site, the second is more effective (obviously) but would take a long time and is web-noxious and poor behavior if applied to a large number of sites. The third option is probably redundant because presumably Google has already checked those pages for your Google search term, but the ability to do this should be built in just in case.

So, the following parameters would be used to make the search work and to adjust its behavior:

googleterm: The Google search term that generates a preliminary list of sites.

regex: The regular expression that is designed to test the results

nsall: The number of total sites (a page and all its links on the same domain) to search and test with the regex. Default is zero, and one would rarely if ever want to set it to a larger number.

npage: The number of pages found by the google search to completely search and test with the regex

nsum: the number of summaries of hits (from the google search) that is tested with the regex.

The data streams, the, are the googlestream, sitestream, and pagestream. To make this work they would need to be pre-filtered to isolate groups of text to test associated with a specific URL. Furthermore, it would be cool if blogs produced by commonly used software (Movable Type, WordPress, etc.) were also processed to strip out headers, footers, and sidebars. Beyond this, the remaining text would ideally be stripped of newlines so they don't have to be messed with, with hyphen/newline combos turned into nothing, and newlines otherwise converted to spaces. And so on.

So typing this at the command line:

googlereg -b googleterm regex npage nsum nall

...eventually opens up a browser window with web page titles (as links) with the Google summary below, and below that a 200 character (configurable) context with the regex embedded in the middle. If -b was left off the results go to standard output. Cool parameters like -o for ouput file and -i for input file could also be useful.

This is probably doable as a perl one-liner. What do you think?

More like this

Interesting. But not a Perl one-liner because the results are not quite simple enough.

HOWEVER

I found this interesting tool you might want to check out:

Yahoo Pipes

http://pipes.yahoo.com/pipes/

includes amongst many other results filters a regex tool. With examples filtering Google results.

Yes, I know it is not a command line tool. That would take some more research, like - are Google search results available in pure XML feed formats instead of wrapped in human-only visual HTML crippled syntax? If that is true, then there are CPAN XML modules that can be used along with the regex post-results filter. But if your desired end result is a web page then Yahoo Pipes may do the trick for you.

The pipes look interesting . I've seen that before but forgot about them.

As far as dealing with HTML, that's fairly easy with the proper text based web readers and sed, but there should be something in the google api that will work.

The problem with the google api might be that they change it now and then.

I've done some Perl code that performs web page text analysis for SEO, and some of that seems connected to some of what you're interested in doing. Stripping out all but the actual content of the pages is essential.

I can see how googlereg would be helpful for some genealogical searching that I do once in a while.

Google API has the ability to return XML formatting for the results, but their terms mean that you have to use their API, get a special API key, and have the web site using it be publicly accessible without restrictions. Without their library modules and valid key all you get back is encrypted binary.

So back to page scraping.

Interesting Idea.

Hmm, or alternatively, I bet a Greasemonkey script doing that could be hacked together...

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Last Post

October 30, 2017

This is my last post at Scienceblogs.com. In the future I will be blogging at Greg Laden's blog, located at its original home at gregladen.com. I have a feeling that Scienceblogs will not last long without me. What do you think? :) But seriously, I'll be talking about the story of the current…

Hacking Voting Machines

October 10, 2017

In every area of life, but especially in the overlapping realms of technology, science, and health, misunderstanding how things work can be widespread, and that misunderstanding can lead to problems. In the area of voting, the main problem seems to be the expenditure of great amounts of outrage and…

On that chilling law suit against the environmental groups

October 5, 2017

... which I've posted on before ... there are new developments, summarized at Inside Climate News: Invoking the Racketeer Influenced and Corrupt Organizations Act, or RICO, a federal conspiracy law devised to ensnare mobsters, the suit accuses the organizations, as well as several green campaigners…

One response to the Las Vegas Shooting

October 5, 2017

from a major non profit, click through the the X Blog to read the press release.

Watch Jeff Merkley Wipe Floor With Trump's William Wehrum

October 5, 2017

William Wehrum is a lawyer and once, apparently, worked for the EPA. Trump is trying to appoint him to be assistant administrator for air and radiation. This is a reasonably important job that concerns many aspects of the environment. Watch: https://twitter.com/SenJeffMerkley/status/…

Software Needed: Enhanced web search

More like this

Last Post

Hacking Voting Machines

On that chilling law suit against the environmental groups

One response to the Las Vegas Shooting

Watch Jeff Merkley Wipe Floor With Trump's William Wehrum

Every Galaxy will have New Stars for Trillions of Years!

An Unforgettable Time-Lapse Volcano (Synopsis)

Paper Up: Correlations between DNA binding thermodynamics and DNA polymerase activity