Now on ScienceBlogs: Must Read

ScienceBlogs Book Club: Inside the Outbreaks

Greg Laden's Blog

Evolution, Life Sciences, Science Education, Human Evolution, and Stuff

Darwing_Face.jpg Learn more about Charles Darwin and his work.

Hornbill170.jpg Looking for stuff about birds?

Lion_mane170.jpg Lean more about lions

Congo_sidebar.jpg An archaeological expedition to the Congo


The Skeptical Search Engine


Nature Blog Network
Climate Defense Fund


The contents of Greg Laden's Blog are copyrighted by Greg Laden.

Recent Comments

Search

Profile


Click on "About" for the big picture, and "Archives" for the details.


Recent Posts

Blogroll

If you don't see yourself on my blogroll, just drop me a line and let me know. I'll add you.*
*Assuming that I'm on your blogroll, of course!

Archives

« vaccinating many more people could slow the seasonal influenza virus's ability to evade vaccines | Main | Blogroll A »

Software Needed: Enhanced web search

Category: LinuxSoftware neededTechnologyone-liners
Posted on: November 2, 2009 11:07 AM, by Greg Laden

Google is great, and yes, I know all those tricks that make it greater. But I still want to use REGEX in some cases. So, I figured out a way to do that, in theory, all I need now is the code...

Briefly, the software I need, which I shall call googlereg for now, feeds the harvest from a google search through a regex filter and produces a list of hits.

There are three streams of data that could be fed through the regex filter: The basic output of the Google search (what you see as results on a Google search page), the actual pages found by the Google search, and the entire site at the domain found by Google. The first is quick and will only allow limited contact between the regex filter and the web site, the second is more effective (obviously) but would take a long time and is web-noxious and poor behavior if applied to a large number of sites. The third option is probably redundant because presumably Google has already checked those pages for your Google search term, but the ability to do this should be built in just in case.

So, the following parameters would be used to make the search work and to adjust its behavior:

googleterm: The Google search term that generates a preliminary list of sites.

regex: The regular expression that is designed to test the results

nsall: The number of total sites (a page and all its links on the same domain) to search and test with the regex. Default is zero, and one would rarely if ever want to set it to a larger number.

npage: The number of pages found by the google search to completely search and test with the regex

nsum: the number of summaries of hits (from the google search) that is tested with the regex.

The data streams, the, are the googlestream, sitestream, and pagestream. To make this work they would need to be pre-filtered to isolate groups of text to test associated with a specific URL. Furthermore, it would be cool if blogs produced by commonly used software (Movable Type, WordPress, etc.) were also processed to strip out headers, footers, and sidebars. Beyond this, the remaining text would ideally be stripped of newlines so they don't have to be messed with, with hyphen/newline combos turned into nothing, and newlines otherwise converted to spaces. And so on.

So typing this at the command line:

googlereg -b googleterm regex npage nsum nall

...eventually opens up a browser window with web page titles (as links) with the Google summary below, and below that a 200 character (configurable) context with the regex embedded in the middle. If -b was left off the results go to standard output. Cool parameters like -o for ouput file and -i for input file could also be useful.

This is probably doable as a perl one-liner. What do you think?


Share on Facebook
Share on StumbleUpon
Share on Facebook
Find more posts in: Technology

TrackBacks

TrackBack URL for this entry: http://scienceblogs.com/mt/pings/123807

Comments

1

Interesting. But not a Perl one-liner because the results are not quite simple enough.

HOWEVER

I found this interesting tool you might want to check out:

Yahoo Pipes

http://pipes.yahoo.com/pipes/

includes amongst many other results filters a regex tool. With examples filtering Google results.

Posted by: Gray Gaffer | November 2, 2009 2:36 PM

2

Yes, I know it is not a command line tool. That would take some more research, like - are Google search results available in pure XML feed formats instead of wrapped in human-only visual HTML crippled syntax? If that is true, then there are CPAN XML modules that can be used along with the regex post-results filter. But if your desired end result is a web page then Yahoo Pipes may do the trick for you.

Posted by: Gray Gaffer | November 2, 2009 2:39 PM

3

The pipes look interesting . I've seen that before but forgot about them.

As far as dealing with HTML, that's fairly easy with the proper text based web readers and sed, but there should be something in the google api that will work.

The problem with the google api might be that they change it now and then.

Posted by: Greg Laden | November 2, 2009 2:51 PM

4

I've done some Perl code that performs web page text analysis for SEO, and some of that seems connected to some of what you're interested in doing. Stripping out all but the actual content of the pages is essential.

I can see how googlereg would be helpful for some genealogical searching that I do once in a while.

Posted by: Dan J | November 2, 2009 5:40 PM

5

Google API has the ability to return XML formatting for the results, but their terms mean that you have to use their API, get a special API key, and have the web site using it be publicly accessible without restrictions. Without their library modules and valid key all you get back is encrypted binary.

So back to page scraping.

Posted by: Gray Gaffer | November 2, 2009 10:12 PM

6

Interesting Idea.

Posted by: Googleverse | November 3, 2009 7:16 AM

7

Hmm, or alternatively, I bet a Greasemonkey script doing that could be hacked together...

Posted by: DrMcCoy | November 3, 2009 7:22 AM

Post a Comment

(Email is required for authentication purposes only. On some blogs, comments are moderated for spam, so your comment may not appear immediately.)





ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.