one-liners https://scienceblogs.com/ en What Linux Distro and Version am I using???? https://scienceblogs.com/gregladen/2009/11/14/what-linux-distro-and-version <span>What Linux Distro and Version am I using????</span> <div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"><p>In the old days, you could just use the "help" menu item to figure this out (drilling down to "about") but now there is so much "helpful" crap in the dialog that opens when you do so, that it has become much less helpful.</p> <p>So just open a command line and cause the contents of the files that contain your release information to be fed to standard output. </p> <p>i.e., type: </p> <blockquote><p>cat /etc/*-release</p></blockquote> </div> <span><a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a></span> <span>Sat, 11/14/2009 - 07:45</span> <div class="field field--name-field-blog-tags field--type-entity-reference field--label-inline"> <div class="field--label">Tags</div> <div class="field--items"> <div class="field--item"><a href="/tag/linux" hreflang="en">Linux</a></div> <div class="field--item"><a href="/tag/one-liners" hreflang="en">one-liners</a></div> </div> </div> <section> <article data-comment-user-id="0" id="comment-1407003" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1258206777"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Hey, cool. I think it's more for a program to learn about the system it's installed on, though. I don't really find myself in front of random Linux systems too often. But this definitely goes in the toolbox alongside "uname -a".</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1407003&amp;1=default&amp;2=en&amp;3=" token="fQgOFLS7SNbeGO93zOWmVvsoet3z7GJr320QYqbtrkE"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Nemo (not verified)</span> on 14 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1407003">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1407004" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1258207374"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>There's a better way, command lsb_release which should be in any LSB-compatible (LSB stands for Linux Standards Base) distribution.</p> <p>Particularly useful is "lsb_release -a" command:</p> <p>cyberax@lw1:~/work/app$ lsb_release -a<br /> LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:cxx-3.0-amd64:cxx-3.0-noarch:cxx-3.1-amd64:cxx-3.1-noarch:cxx-3.2-amd64:cxx-3.2-noarch:cxx-4.0-amd64:cxx-4.0-noarch:desktop-3.1-amd64:desktop-3.1-noarch:desktop-3.2-amd64:desktop-3.2-noarch:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.0-amd64:graphics-3.0-noarch:graphics-3.1-amd64:graphics-3.1-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch:qt4-3.1-amd64:qt4-3.1-noarch<br /> Distributor ID: Ubuntu<br /> Description: Ubuntu 9.10<br /> Release: 9.10<br /> Codename: karmic</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1407004&amp;1=default&amp;2=en&amp;3=" token="MO4qw4btGni18OnoPEdD3kLhjd6LM4J62jfqfCBYEy8"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Alex Besogonov (not verified)</span> on 14 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1407004">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1407005" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1258207668"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>cat /etc/*-release</p> <p>does not work with Debian lenny</p> <p>the file is /etc/debian_version</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1407005&amp;1=default&amp;2=en&amp;3=" token="rwNSF9qNkQkbVlAXd-S3a8x5A35lG7XKp9OKbTDa-JU"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">alex (not verified)</span> on 14 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1407005">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1407006" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1258211517"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>In Ubuntu, you can also go to System-&gt;About Ubuntu. This will bring up a help page with the current version of your operating system, its release date, how long the current version will be supported, a description of the Ubuntu project, and some helpful links.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1407006&amp;1=default&amp;2=en&amp;3=" token="H2bHSlBmvbIVawfEIApm853WqPSeAH3WwcjjkgXXzQI"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">BruceH (not verified)</span> on 14 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1407006">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1407007" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1258213501"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Try installing acpi.</p> <p>sudo apt-get install acpi</p> <p>When you run it, you'll get</p> <p># acpi -V</p> <p> Battery 0: Full, 100%<br /> Battery 0: design capacity 4752 mAh, last full capacity 4752 mAh = 100%<br /> AC Adapter 0: on-line<br /> Thermal 0: ok, 55.0 degrees C<br /> Cooling 0: LCD 0 of 10<br /> Cooling 1: Processor 0 of 10<br /> Cooling 2: Processor 0 of 10</p> <p>This is very useful if your machine runs hot on occasion.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1407007&amp;1=default&amp;2=en&amp;3=" token="qYdljb7CGQibwF9a5L3odbYT10SAKxfJTuFrdAMTg9A"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">mikey.duhhh (not verified)</span> on 14 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1407007">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1407008" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1258247586"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>There is a feature that Apple used to have on the Classic MacOS (I don't know if it exists on OS X, though there seems to be something of the sort still in there) called the Gestalt Manager that would allow you to inquire about the computer environment. It was extremely useful for certain purposes, so much so that Apple actually included it in the standard libraries for Mac development so it would be available on System 6. It would be nice if Linux had something like that, but I don't think the development model is conducive to it, and in any case I don't know how Gestalt handled shared libraries after Apple started using them all over the place.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1407008&amp;1=default&amp;2=en&amp;3=" token="iF0C95sIhsnLOLxaZ5nCgsVMRYOkUnNnnNzh2B5ArJA"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://offseasontv.blogspot.com" lang="" typeof="schema:Person" property="schema:name" datatype="">Brian X (not verified)</a> on 14 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1407008">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> </section> <ul class="links inline list-inline"><li class="comment-forbidden"><a href="/user/login?destination=/gregladen/2009/11/14/what-linux-distro-and-version%23comment-form">Log in</a> to post comments</li></ul> Sat, 14 Nov 2009 12:45:51 +0000 gregladen 28065 at https://scienceblogs.com Software Needed: Enhanced web search https://scienceblogs.com/gregladen/2009/11/02/software-needed-enhanced-web-s <span>Software Needed: Enhanced web search</span> <div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"><p>Google is great, and yes, I know all those tricks that make it greater. But I still want to use REGEX in some cases. So, I figured out a way to do that, in theory, all I need now is the code...</p> <!--more--><p>Briefly, the software I need, which I shall call googlereg for now, feeds the harvest from a google search through a regex filter and produces a list of hits.</p> <p>There are three streams of data that could be fed through the regex filter: The basic output of the Google search (what you see as results on a Google search page), the actual pages found by the Google search, and the entire site at the domain found by Google. The first is quick and will only allow limited contact between the regex filter and the web site, the second is more effective (obviously) but would take a long time and is web-noxious and poor behavior if applied to a large number of sites. The third option is probably redundant because presumably Google has already checked those pages for your Google search term, but the ability to do this should be built in just in case. </p> <p>So, the following parameters would be used to make the search work and to adjust its behavior:</p> <p><em>googleterm:</em> The Google search term that generates a preliminary list of sites.</p> <p><em>regex:</em> The regular expression that is designed to test the results </p> <p><em>nsall:</em> The number of total sites (a page and all its links on the same domain) to search and test with the regex. Default is zero, and one would rarely if ever want to set it to a larger number.</p> <p><em>npage:</em> The number of pages found by the google search to completely search and test with the regex</p> <p><em>nsum:</em> the number of summaries of hits (from the google search) that is tested with the regex. </p> <p>The data streams, the, are the googlestream, sitestream, and pagestream. To make this work they would need to be pre-filtered to isolate groups of text to test associated with a specific URL. Furthermore, it would be cool if blogs produced by commonly used software (Movable Type, WordPress, etc.) were also processed to strip out headers, footers, and sidebars. Beyond this, the remaining text would ideally be stripped of newlines so they don't have to be messed with, with hyphen/newline combos turned into nothing, and newlines otherwise converted to spaces. And so on.</p> <p>So typing this at the command line:</p> <blockquote><p> googlereg -b googleterm regex npage nsum nall </p></blockquote> <p>...eventually opens up a browser window with web page titles (as links) with the Google summary below, and below that a 200 character (configurable) context with the regex embedded in the middle. If -b was left off the results go to standard output. Cool parameters like -o for ouput file and -i for input file could also be useful. </p> <p>This is probably doable as a perl one-liner. What do you think? </p> </div> <span><a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a></span> <span>Mon, 11/02/2009 - 05:07</span> <div class="field field--name-field-blog-tags field--type-entity-reference field--label-inline"> <div class="field--label">Tags</div> <div class="field--items"> <div class="field--item"><a href="/tag/linux" hreflang="en">Linux</a></div> <div class="field--item"><a href="/tag/one-liners" hreflang="en">one-liners</a></div> <div class="field--item"><a href="/tag/technology" hreflang="en">Technology</a></div> </div> </div> <section> <article data-comment-user-id="0" id="comment-1405683" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1257168979"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Interesting. But not a Perl one-liner because the results are not quite simple enough.</p> <p>HOWEVER</p> <p>I found this interesting tool you might want to check out:</p> <p>Yahoo Pipes</p> <p><a href="http://pipes.yahoo.com/pipes/">http://pipes.yahoo.com/pipes/</a></p> <p>includes amongst many other results filters a regex tool. With examples filtering Google results.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1405683&amp;1=default&amp;2=en&amp;3=" token="mTI4H9Wh6tkN3VHdQfLNMs721xdXDtUYtOnGiOk4d6w"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Gray Gaffer (not verified)</span> on 02 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1405683">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1405684" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1257169186"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Yes, I know it is not a command line tool. That would take some more research, like - are Google search results available in pure XML feed formats instead of wrapped in human-only visual HTML crippled syntax? If that is true, then there are CPAN XML modules that can be used along with the regex post-results filter. But if your desired end result is a web page then Yahoo Pipes may do the trick for you.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1405684&amp;1=default&amp;2=en&amp;3=" token="poFlvBmjA0A9Z16e19WnZVxCB9ywCe76JZlFqFfch4w"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Gray Gaffer (not verified)</span> on 02 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1405684">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1405685" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1257169907"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>The pipes look interesting . I've seen that before but forgot about them.</p> <p>As far as dealing with HTML, that's fairly easy with the proper text based web readers and sed, but there should be something in the google api that will work.</p> <p>The problem with the google api might be that they change it now and then.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1405685&amp;1=default&amp;2=en&amp;3=" token="NoOKfdqvP-s6-uC9J-_3b-ha-eJe8gzvNsYKEdnMh5o"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 02 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1405685">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1405686" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1257180057"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>I've done some Perl code that performs web page text analysis for SEO, and some of that seems connected to some of what you're interested in doing. Stripping out all but the actual <em>content</em> of the pages is essential.</p> <p>I can see how <code>googlereg</code> would be helpful for some genealogical searching that I do once in a while.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1405686&amp;1=default&amp;2=en&amp;3=" token="bxZZPmwtpJngXHU92n-_KJ75FyEa-_nCWjWueIEakew"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://www.relativelyunrelated.com/" lang="" typeof="schema:Person" property="schema:name" datatype="">Dan J (not verified)</a> on 02 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1405686">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1405687" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1257196362"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Google API has the ability to return XML formatting for the results, but their terms mean that you have to use their API, get a special API key, and have the web site using it be publicly accessible without restrictions. Without their library modules and valid key all you get back is encrypted binary.</p> <p>So back to page scraping.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1405687&amp;1=default&amp;2=en&amp;3=" token="i1zOBqJU2LBi0gdfmbEsd123BnFIx12lwla1h-sRM3E"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Gray Gaffer (not verified)</span> on 02 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1405687">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1405688" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1257228976"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Interesting Idea.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1405688&amp;1=default&amp;2=en&amp;3=" token="AK8Z1ERkmrEx8xGZ6xFB-QJLXUvohhtlxpthSoLpMew"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://www.gtricks.com/" lang="" typeof="schema:Person" property="schema:name" datatype="">Googleverse (not verified)</a> on 03 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1405688">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1405689" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1257229354"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Hmm, or alternatively, I bet a Greasemonkey script doing that could be hacked together...</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1405689&amp;1=default&amp;2=en&amp;3=" token="mpGYmlabwA__dM1gKlytobacKeiBtaBdHRk7RgZIVPA"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://drmccoy.de/" lang="" typeof="schema:Person" property="schema:name" datatype="">DrMcCoy (not verified)</a> on 03 Nov 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1405689">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> </section> <ul class="links inline list-inline"><li class="comment-forbidden"><a href="/user/login?destination=/gregladen/2009/11/02/software-needed-enhanced-web-s%23comment-form">Log in</a> to post comments</li></ul> Mon, 02 Nov 2009 10:07:05 +0000 gregladen 27953 at https://scienceblogs.com Win friends and fix up your data with regular expressions https://scienceblogs.com/gregladen/2008/09/12/win-friends-and-fix-up-your-da <span>Win friends and fix up your data with regular expressions</span> <div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"><blockquote><p>In many instances, a well thought-out regular expression can convince most non-technical people in the room that you're a computer genius who's brain possesses more synapses, forming more bridges and firing more rapidly than anyone's ever should.</p></blockquote> <p>Oh this is so true. The other day I was working on cleaning up some data with a colleague. We had two simple but common problems with our database, which had a few thousand records, and forty or so variables. </p> <p>1) We needed to get a subset of data that included means for all numeric values based on a single factor (a factor = a categorical variable used to divide up or group the other data... in this case, species name); and</p> <p>2) We needed to make sure that blank spaces in this grid were filled with "NA" in order for the statistical software to recognize these blanks as missing data.</p> <p>Chances are, you've had this problem too. Let me tell you how we solved them.</p> <!--more--><p>The subtotal problem seems easy. There are two obvious ways to do this. One is to use brute force ... put the data in a spreadsheet and sort by the factor (in this case 'species'). Insert a row after each set of species data, and put in formulas to calculate the averages. This option is unacceptable because it is hard work, laborious, error prone, boring, and every time we 'fix' the data (finding errors, etc.) we have to recalculate, so any method should be fast and accurate.</p> <p>The other way is to use the subtotals function on the spreadsheet. Say "we want subtotals of these variables ... but make the totals averages ... and get a subtotal for each different species...." All commonly used spreadsheets have a subtotal calculation function. Just press the subtotal calculation button and you are done.</p> <p>This works, but it does not work. You get the subtotals, and you can hide away the raw data and see only the subtotals, but you can't copy and paste these subtotals, you can't save the file as only the subtotals, you can't do anything with these subtotals without the raw data coming along for the ride. </p> <p>Or at least, I have no idea how to do this. If you do, tell me!</p> <p>Here is how I solved the problem: I created the subtotals and saved the file ... with the subtotals ... as a comma-separated file (any kind of text file would work). Then I got up a shell, navigated to the directory holding the text file, and typed something like:</p> <p>grep Result filename.csv &gt; summarydata.txt</p> <p>The subtotal lines all have the string "Result" in them. So, this one liner which is not even a regular expression (but could be if I wanted it to be) creates a file with only the summary data. You have to go and manually copy and paste the headings from your original data file.</p> <p>Then, without skipping a beat, I addressed the problem of the missing data. I typed:</p> <p>sed 's/,,/,NA,/g' summarydata.txt | sed "s/,,/,NA,/g" &gt; summarydatafixed.txt</p> <p>The data were totally fixed, averages only and all the missing data represented at NA. The output was so perfect and elegant, I'm pretty sure my colleague actually stopped breathing for a moment. </p> <p>In the above command, I ran the file through the sed command (which changes ",," to ",NA,") twice because three commas in a row (representing two missing data items in a row) would change to ",NA,," because the second comma is eaten by the regular expression and thus not used farther down the line. But twice gets them all.</p> <p>There is another way to do this without running the data twice, using "lookahead" capacities that are in Perl. The quote above, which I think I will print out and frame and hang on my wall, <a href="http://linuxshellaccount.blogspot.com/2008/09/finding-overlapping-matches-using-perls.html"><em>comes from a blog post by Mike that I think may explain this </em></a>... but my solution was quicker and remained efficient. I swear to you, everyone in the room thought I was brilliant, and this was a room crammed with scientists, mostly cladists. It is pretty rare to get cladists thinking anyone else is smart. </p> </div> <span><a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a></span> <span>Fri, 09/12/2008 - 14:50</span> <div class="field field--name-field-blog-tags field--type-entity-reference field--label-inline"> <div class="field--label">Tags</div> <div class="field--items"> <div class="field--item"><a href="/tag/linux" hreflang="en">Linux</a></div> <div class="field--item"><a href="/tag/one-liners" hreflang="en">one-liners</a></div> <div class="field--item"><a href="/tag/technology" hreflang="en">Technology</a></div> </div> </div> <section> <article data-comment-user-id="0" id="comment-1379381" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221248412"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>I can do all this in spreadsheets, using conditional links to the output of pivot tables, because I've had to. It's better than the brute force option, because I can do it in such a way that it updates when the data does. But it is not elegant. It does, however, inspire some awe.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379381&amp;1=default&amp;2=en&amp;3=" token="4UmL2v0v_y3w_csu_5CesQK33d-o0fOy4Hn1SIqvxwk"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://almostdiamonds.blogspot.com/" lang="" typeof="schema:Person" property="schema:name" datatype="">Stephanie Z (not verified)</a> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379381">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379382" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221249539"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>I'm sorry, but I feel that this post is just not complete without a reference to this <a href="http://xkcd.com/208/">XKCD</a> comic. ;-)</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379382&amp;1=default&amp;2=en&amp;3=" token="YagUBcZdsUimobyyy8g37suxKwoMbpWK9bgg-OiPZ0I"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Jim M (not verified)</span> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379382">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379383" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221250042"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Luckily you didn't have any qualified fields in that csv file.</p> <p>What statistical software are you using?</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379383&amp;1=default&amp;2=en&amp;3=" token="p-eswzvuV1Fp0Jr82oDCCjz3LXVDXAiXYnP6B372LEw"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">AnonymousCoward (not verified)</span> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379383">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379384" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221250873"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>With the spreadsheet can't you just hide the raw data and then copy only visible cells to a new spreadsheet.<br /> Sometimes you have to write a simple formula to propagate the titles of the averages to a blank cell on the same row as the subtotal.</p> <p>I am pretty sure this is feasible with Excel not sure about other spreadsheet software.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379384&amp;1=default&amp;2=en&amp;3=" token="JVLeyeVIhsncztB43DD_ONVZD2ZtpBzX4faKFytTN8s"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Rohan Smith (not verified)</span> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379384">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379385" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221251068"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Rohan: No! Amazingly enough. I have never gotten this to work with Excel. Let me know if you can actually do it. It might well be possible, but there is a point at which trying to figure out something becomes far more time consuming than banging out a quick sed script. </p> <p>Coward: R</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379385&amp;1=default&amp;2=en&amp;3=" token="DRVkGyHdkQaH20zi8475AkAYEe8JTORmVspinVyuwa4"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379385">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379386" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221251341"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Stephanie: Yes, you can also use database functions. The problem is you still have to redo or adjust when the number (or level) of factors changes . The automatic "make sub totals" works best to get the average (or other calculation) but then you cant get them out. </p> <p>You can copy and paste a pivot table, though. That tends to work where it works.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379386&amp;1=default&amp;2=en&amp;3=" token="NlyVLbztIcsFL5yDUTVujhm9925KxMtU-YNzFG3_fGg"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379386">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379387" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221252434"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Greg, you will never see me copying and pasting a pivot table. They are some of the ugliest things on Earth. Even if I'm the only person who will ever see it, presentation counts.</p> <p>It's possible to set up the summary table to make adding a factor as easy as adding a line and entering the factor. A checksum will tell you whether you need to. Adding a level is a nightmare unless you're so well-versed in the syntax of pivot table calls that you can do it with search and replace.</p> <p>What Rohan is talking about should work in Excel as long as you use the filter function to get just the rows you want. Just don't use undo after pasting and expect to have the same contents in your clipboard. Blech.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379387&amp;1=default&amp;2=en&amp;3=" token="zDln3OV9bjFyibdGv928IrGhBeiHG2jhbzxcMWAeZYc"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://almostdiamonds.blogspot.com/" lang="" typeof="schema:Person" property="schema:name" datatype="">Stephanie Z (not verified)</a> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379387">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379388" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221253116"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>You can't hide the raw data and then copy the set of visible cells. You need to copy every line (of summary data) one at a time. </p> <p>I have not tried imposing a filter on top of the summary data. That might work, but until I see it happen I'm not buying it . So far my method is still better. </p> <p>Actually, what is even better and that I'll eventually do is to write an awk script that eats a data file and spits out a summary on factor file. How hard can that be? It can probably be done in fewer than ten lines.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379388&amp;1=default&amp;2=en&amp;3=" token="gzw804RlWPJbSFnigPUuO1rMne1_gW5oqLH4BpYDh0s"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379388">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379389" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221253651"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Oh, by all the useless gods, yes, your way is better. No question. Hands down. Much, much better. I just don't get to use it at work. :p</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379389&amp;1=default&amp;2=en&amp;3=" token="ZxMo2XhWb6NndyH6Omu0Js3kDcuLXUDGyH8jOauiSVk"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://almostdiamonds.blogspot.com/" lang="" typeof="schema:Person" property="schema:name" datatype="">Stephanie Z (not verified)</a> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379389">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379390" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221265965"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>I think it says somewhere in the comments that you are using R.</p> <p>This all could have been done within R very simply (if I understand what you are doing). One of my favorite and most widely used R functions is aggregate. You can use it to generate a summary of a data.frame, performing some function on groups of data sorted by factors. For instance</p> <p>aggregate(data.frame$var, list(data.frame$factor), FUN = mean)</p> <p>will give you the mean var for each level of factor. It should also automatically put the NAs in there if it can't calculate the means for some reason.</p> <p>Hope it helps.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379390&amp;1=default&amp;2=en&amp;3=" token="DyLzLhnA2FlQ2yna4wix5wjy9aJ_ydA1-s56BKNMqXA"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://skeetersays.blogspot.com" lang="" typeof="schema:Person" property="schema:name" datatype="">Kevin (not verified)</a> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379390">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379391" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221274649"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Is there any reason you have to use a spreadsheet rather than a relation database ? If you can go the relational database route then a little SQL will sort you out. </p> <p>Most database management systems allow easy import from Excel.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379391&amp;1=default&amp;2=en&amp;3=" token="kCHeVtG8hw4QGQggraFyHxe65kKT5d-gZgk5YbZBBRk"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Matt Penfold (not verified)</span> on 12 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379391">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379392" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221292294"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>You can avoid the problem of the triple comma if you use Perl and split the line on the comma. Say I have the following file foo.csv, which has the following contents:</p> <p>bacon,eggs,spam,,foobar,spam,,,foo</p> <p>I can use the following little Perl program to substitute the NAs (which I hope doesn't come out mangled):</p> <p>use warnings;<br /> use strict;</p> <p>while (&lt;&gt;) {<br /> my @line = split /,/;</p> <p> for (my $i = 0; $i &lt; @line; $i++) {<br /> $line[$i] = 'NA' if $line[$i] eq '';<br /> }<br /> print join(' ', @line);<br /> }</p> <p>Fields that are empty strings are replaced by "NA", and the output is</p> <p>bacon eggs spam NA foobar spam NA NA foo</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379392&amp;1=default&amp;2=en&amp;3=" token="X00dNeFPZ7LZZxnnDgxOiN1epV-xGoy91LFL3nxF_6I"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">J. J. Ramsey (not verified)</span> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379392">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379393" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221293907"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>I love this. this is why I posted this, to get so many different perspectives.</p> <p>Aggregate in R is the solution for PHase II. What we did so far is to prototype the whole process using the spreadsheet data, wiht the intention of eventually having the whole analysis done on a text file that get eaten by R and processed entirely with R functions. However, while R is a pretty good math and a very good stats environment, it simply is not a very good data manipulation environment.</p> <p>Why spreadsheet instead of sql? Every morning I wake up and ask myself that. But seriously ... there are technical reasons having to do with data collection for the data to have originally been put in a spreadsheet. The, I seriously considered moving it to a database, but did not because my co-author, who collected all the data and knows the most about the specific quirks and issues of the data is not using a database and while she could easily learn it, had not previously done so, etc. etc. So we are going straight from spreadsheet to text file to R. </p> <p>jj: Excellent perl solution, and probably better than the equivilant bash soluoitn using tr, etc. </p> <p>What I would like to see happen, possibly as a nice open ended open source project, is a port of all of the functionality of R to the bash command line as utilities/commands that make certain assumptions about the data structures (and read sql as well as text), and this sort of perl program would be a perfect utility. What do you call it? csv2na? csv-na?</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379393&amp;1=default&amp;2=en&amp;3=" token="pUTd3nLU4nC9SZcLv684e1rVUB1YmfgFpLhmUriBC3M"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379393">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379394" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221295268"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><blockquote><p>Fields that are empty strings are replaced by "NA"</p></blockquote> <p>Care is needed here.</p> <p>A blank string is not the same thing as a NULL value. A blank string is a string that does not have any characters in it. A NULL value means there is no value assigned to that field. Searching on blank strings may not return those records with a NULL value in the field.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379394&amp;1=default&amp;2=en&amp;3=" token="9HUbEUoT08SvTOMzSXoniITU89-pRK7PXhub2WFWFYw"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Matt Penfold (not verified)</span> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379394">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379395" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221296027"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Excel should be able to copy/paste those subtitles just fine.</p> <p><a href="">"Paste Special"</a>. Select "values" as the special thing you want to paste.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379395&amp;1=default&amp;2=en&amp;3=" token="1JkAoqBtT7I8btpIz1GCUOndsStTpJoMSWb_HlUNrzs"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://museinvivo.blogspot.com" lang="" typeof="schema:Person" property="schema:name" datatype="">Muse142 (not verified)</a> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379395">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379396" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221298305"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Muse: you would think. The behavior is to change the formulas in those rows to values (which save as csv does as well by the way) but the raw data does not go away.</p> <p>BTW, you CAN paste as values (and get the de-formula'd lines as well as raw data" to a different sheet then sort to get all the lines wiht "result" in them on one place then delete the copied raw data.... essentially the same thing that I did with grep. But it is still a kludge.</p> <p>Matt, clearly we need to field test this. At least in perl, blank an null should be handled in similar ways across platforms, yes?</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379396&amp;1=default&amp;2=en&amp;3=" token="s8c8mHxpLZZOsiQkPmeF_M1zVU3zv9N1ENtBwop48N4"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379396">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379397" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221299949"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><blockquote><p>Matt, clearly we need to field test this. At least in perl, blank an null should be handled in similar ways across platforms, yes?</p></blockquote> <p>Greg,</p> <p>It is rather dependent on both application(s) and platform. I raised the point because it has caught me out in the past. But yes, my experiences with Perl lead to me to believe blanks and nulls are handled the same. SQL of course is another matter, and you should never assume SQL will treat them the same. It really is one of things you need to test before using in anger. </p> <p>Actually there are valid reasons for treating blank strings different from nulls.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379397&amp;1=default&amp;2=en&amp;3=" token="c5THv1nbcbdHbH98rVYq_dmpHEaGMhBvcrrtBihDHF4"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Matt Penfold (not verified)</span> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379397">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379398" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221300476"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Matt Penfold: "A blank string is not the same thing as a NULL value."</p> <p>True, but the "split" command, IIRC, treats an empty string between two delimiters as, well, an empty string, so I wrote the program accordingly. If I were dealing with something other than a CSV file, I'd do it differently.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379398&amp;1=default&amp;2=en&amp;3=" token="ga4v7inZH83Z85tDiMWdpnlOty8tIVUVFnpqjEY72Tg"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">J. J. Ramsey (not verified)</span> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379398">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379399" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221300485"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Greg Laden,</p> <p>J. J. Ramsey's perl solution, although simple to understand, isn't all that good. Although it works for very simple input, as soon as you hand it data where some of the cells have commas in them, it CHANGES that perfectly valid data. Can you REALLY say that's what you want? Your sed solution is better for that reason alone. Plus, 5+ lines of perl replacing one line of sed? Bah. Check the -e option on how to make it a single sed command. Of course, as AnonymousCoward pointed out, you both are not handling data where consecutive commas actually appear in your data. Finally, there seems to be no concern over how to handle error conditions. This could be so much better.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379399&amp;1=default&amp;2=en&amp;3=" token="f_CP0iCpJlhunlyFgb5pddL2eytce1u65mNV55K8l4Y"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Shawn Smith (not verified)</span> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379399">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379400" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221309170"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Matt: I'm sure there is an interesting story as to why NA is "missing data"... I've done work with SPSS versions in which missing data was -9999. Which is totally dumb, of course, because that could be a measure. But a blank and a null, since they often look and act the same but sometimes not, are obviously bad choices. </p> <p>Shawn: Well, you can't handle error conditions too easily and stick with the one-liner, but yes, my sed line was the best solution at the time because I wanted a one liner. That was not a program, but rather, a command issued under the condition that I knew what the data looked like.</p> <p>In the command line statistical package (I think I'll call it statmagick) there would be utilities to fix up and verify the text-based or sql-based data sets.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379400&amp;1=default&amp;2=en&amp;3=" token="-Cxt3Xvzh59wuaozyHW7wf1WZXH29HhwYflPv9UPTEI"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379400">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379401" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221309172"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Shawn Smith: "Although it works for very simple input, as soon as you hand it data where some of the cells have commas in them, it CHANGES that perfectly valid data."</p> <p>But then, if you are dealing with commas in the data, then CSV format is <i>not</i> what you want. You'd be better off using a data format where there wasn't an issue of delimiters being confused for data. If you are collecting and storing data, then it should be stored in a format that makes processing it fairly trivial, either because the format is simple (e.g. CSV, tab-delimited) or because there are libraries to handle the bulk of the parsing and whatnot so that you don't have to do it yourself (e.g. XML, certain binary formats).</p> <p>Shawn Smith: "Plus, 5+ lines of perl replacing one line of sed? Bah."</p> <p>I'll take the 5+ lines of clear, readable Perl over a sed one-liner that has an apparent redundancy that requires an explanation. Further, if you want to handle error conditions, then it is not too difficult to modify and extend the Perl script, which isn't true for the sed one-liner.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379401&amp;1=default&amp;2=en&amp;3=" token="NaX4sAZy_2UIOF6Hd4ZoK8pfMNW_0_7GySJ58GMR4QY"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">J. J. Ramsey (not verified)</span> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379401">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379402" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221309527"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>BTW: why did I use csv instead of tab delmited? Two reasons: One, I can see a comma. Two: a comma is a single keystroke but entering a tab (in a RE) is two. Otherwise, no reason, really. Either one is happily eaten by R. </p> <p>Initially, all the utilities in statmagick will be programmed in various scripting languages such as perl, python, even bash, and of course awk. The question is, do we leave them that way or port them to c eventually?</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379402&amp;1=default&amp;2=en&amp;3=" token="EWnb-QLQVy9wvbTNGdzFCIdEsqcYuE-n5KVHiRqVZeI"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379402">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379403" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221325136"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>J. J. Ramsey,</p> <blockquote><p>But then, if you are dealing with commas in the data, then CSV format is not what you want. You'd be better off using a data format where there wasn't an issue of delimiters being confused for data.</p></blockquote> <p>Granted, CSV is probably not ideal, but it is one of the simpler ones that could include <i>any</i> characters in the data--simply quote all data items and double any quotes that are part of the data. It's unlikely you can come up with a format that will avoid the issue of delimiters being confused for data, in the general case, especially if you want to internationalize your code, and still be as simple as CSV. And thanks to CPAN for perl, jakarta for Java, and equivalent repositories for other languages, processing almost any format is quick and easy. In terms of developer time, anyway.</p> <blockquote><p>If you are collecting and storing data, then it should be stored in a format that makes processing it fairly trivial,</p></blockquote> <p>That depends. It is certainly true when "processing" (whatever the hell that is) requires the bulk of the work. But when ease of display or low space usage is a priority, there are good reasons to store it in a way that makes "processing" it more difficult.</p> <blockquote><p>I'll take the 5+ lines of clear, readable Perl over a sed one-liner that has an apparent redundancy that requires an explanation.</p></blockquote> <p>That is probably one of the few times the phrase "readable Perl" does <i>not</i> evoke peals of laughter. It's easy for me to believe there are perl programmers (people who use perl on a day-to-day basis for years in CGI and database programming) who don't know what "while (&lt;&gt;)" does, or that you can use an array in a scalar context to get its length, but that really does say more about those programmers than perl.</p> <blockquote><p>Further, if you want to handle error conditions, then it is not too difficult to modify and extend the Perl script, which isn't true for the sed one-liner.</p></blockquote> <p>Very true.</p> <p>To get back on topic, though, I would be concerned by the idea that really smart people are impressed with someone who knows how to deal with regular expressions (REs). REs really aren't that hard to understand, especially by people who are good at memorizing long lists (most biologists). If I did see someone who is impressed, I would have no problem letting them know that it really is no big deal, and do what I could to explain it to them so that they could use REs themselves. I certainly wouldn't want them to think there's anything difficult about REs, and that knowledge of how to use them implies expertise in areas not related to computers.</p> <p>Isn't that one of the problems we have with so many engineers and computer programmers (I am one) being ID advocates (I am not)? They know they are smart and they are told by biologists they are smart because they can process their data so easily using regular expressions. Those programmers, who know how simple REs are to understand, then think they are smarter than biologists but they don't get the whole evolution thing. Therefore, evolution is wrong and ID must be right. bah.</p> <p>It seems better to let as many people as possible know how to use REs (it shouldn't take more than a couple hours) so that we can remove one more example of magical thinking from the population.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379403&amp;1=default&amp;2=en&amp;3=" token="3KIvbp40zRtkIAf-Wss6B5Tx0T81YbygGbNN_RQh6do"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Shawn Smith (not verified)</span> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379403">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379404" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221325327"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Oops--that was "while (&lt;&gt;)". Stupid, stupid mistake on my part.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379404&amp;1=default&amp;2=en&amp;3=" token="ejrtJ9GY4xrTCPr9dcjOrKThHIK3fopNZZPN2cScROU"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Shawn Smith (not verified)</span> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379404">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379405" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221327801"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p><em>That is probably one of the few times the phrase "readable Perl" does not evoke peals of laughter.</em></p> <p>I have to admit that I did chuckle at this. </p> <p><em>I would be concerned by the idea that really smart people are impressed with someone who knows how to deal with regular expressions (REs). REs really aren't that hard to understand,</em></p> <p>Stop! Stop! You are ruining it for everyone!!!</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379405&amp;1=default&amp;2=en&amp;3=" token="rFrIq58eEKAlOu2LXpvO0xK1oRnoJDqHhjBLKQoB-PE"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 13 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379405">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1379406" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221374832"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Shawn Smith: "It's unlikely you can come up with a format that will avoid the issue of delimiters being confused for data, in the general case, especially if you want to internationalize your code, and still be as simple as CSV."</p> <p>"General case" is the operative phrase here. We aren't dealing with text processing code designed to be fully general, but rather something just suited to the purposes at hand. If you try to get too general, you risk introducing bugs that happen because you are trying to be too clever, and you are doing a lot of work for little benefit because you are designing for unrealistic scenarios that are unlikely to happen.</p> <p>Of course, if you are writing what amounts to library code, that is, code that is meant to be used by lots of people for several somewhat related but different purposes, then the effort expended in being clever or more general is less likely to be wasted.</p> <p>In short, the narrower the purpose of the code and the fewer the users, the more simplifying assumptions that you can make--and vice versa.</p> <p>That said, it is also wise to make even narrow-purpose code straightforward and readable, so that if it <i>does</i> need to be extended, it is more feasible.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379406&amp;1=default&amp;2=en&amp;3=" token="hUw2Qw1W5pNI5b1cJrAye8ZMjZZDxlY_Bg_ug153FRQ"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">J. J. Ramsey (not verified)</span> on 14 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379406">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="31" id="comment-1379407" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1221376590"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>JJ you are dead on here. Statmagick is for specific kinds of data. Most of the columns will be numbers. Rules can be imposed on the non-numeric columns.</p> <p>Of course one problem with internationalization and comma delimited is those crazy Europeans who use commas as decimal points.</p> <p>My own programming style, for my own purposes, is: Everything is a macro. But of course StatMagick can't have that philosophy.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1379407&amp;1=default&amp;2=en&amp;3=" token="eStEgW91Tb-O6uboxcUwT60ryjk8g9pTm9elmsw0UYA"></drupal-render-placeholder> </div> <footer> <em>By <a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a> on 14 Sep 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1379407">#permalink</a></em> <article typeof="schema:Person" about="/author/gregladen"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/author/gregladen" hreflang="en"><img src="/files/styles/thumbnail/public/pictures/HumanEvolutionIcon350-120x120.jpg?itok=Tg7drSR8" width="100" height="100" alt="Profile picture for user gregladen" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> </section> <ul class="links inline list-inline"><li class="comment-forbidden"><a href="/user/login?destination=/gregladen/2008/09/12/win-friends-and-fix-up-your-da%23comment-form">Log in</a> to post comments</li></ul> Fri, 12 Sep 2008 18:50:27 +0000 gregladen 24947 at https://scienceblogs.com Linux One Liners https://scienceblogs.com/gregladen/2008/08/01/linux-one-liners-1 <span>Linux One Liners</span> <div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"><p>ls | wc -l</p> <!--more--><p>How many files are in the current directory on my hard drive? </p> <p>The command ls gives me a list of files (ls stands for "list stuff")</p> <p>the vertical line is a pipe. This means the standard output of the left side of the pipe is sent (like in a pipe) to the standard input of the right side. </p> <p>wc means "word count" ... the default output is the number of lines, nmber of words, and number of bytes for a file or for standard input. the -l option puts out only the number of lines. That, then, is the number of files.</p> <p>Try it!</p> </div> <span><a title="View user profile." href="/author/gregladen" lang="" about="/author/gregladen" typeof="schema:Person" property="schema:name" datatype="">gregladen</a></span> <span>Fri, 08/01/2008 - 14:14</span> <div class="field field--name-field-blog-tags field--type-entity-reference field--label-inline"> <div class="field--label">Tags</div> <div class="field--items"> <div class="field--item"><a href="/tag/one-liners" hreflang="en">one-liners</a></div> </div> </div> <section> <article data-comment-user-id="0" id="comment-1377291" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1217616655"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>And for recursive (list the number of files under the current directory and all subdirectories, etc):</p> <p>find | wc -l</p> <p>If you want to count just normal files (i.e. not directories and other weird things):</p> <p>find -type f | wc -l</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1377291&amp;1=default&amp;2=en&amp;3=" token="n8CMCf8dIN7nOHoqy0SR8EgBlYAB_11X_33a17HhNbM"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Colin M (not verified)</span> on 01 Aug 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1377291">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1377292" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1217660461"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>The command in the article will give you the number of directory entries in the current directory that do have a leading dot in the name.</p> <p>This isn't quite the number of files in the directory, as it includes things like symbolic links, multiply-linked single files, subdirectories, and other special directory entries, and it excludes those normally directory entries whose names begin with '.'.</p> <p>Also, it is actually legal to have a newline in a directory entry, and an entry like that would be counted as two separate objects. In order to avoid that problem, it is usually good to remember that there are two characters that cannot appear in a POSIX directory entry, ASCII NUL and '/'.</p> <p>The ASCII NUL is an unambiguous separator when referring to path names or directory entries. That's why 'find' has an option switch "-print0", which prints the entries with a trailing NUL. And 'xargs' has the "-0" switch which accepts arguments separated by NUL.</p> <p>So, if the question is "how many regular file directory entries are in this directory?", I would use a command like this:</p> <p><code>find . -type f -maxdepth 1 -print0<br /> | xargs -0 PRINT-ARGC-MINUS-1<br /> </code></p> <p>Where PRINT-ARGC-MINUS-1 would be a C program written to print the value of argc less 1.</p> <p>There's a better way, though. How many distinct directory entries for regular files are in this directory, not counting multiple hardlinks more than once?</p> <p><code>find . -type f -maxdepth 1 -printf "%i\n" | sort<br /> | uniq | wc -l<br /> </code></p> <p>That is, don't send filenames around, they're awkward to parse because of the dangers of embedded whitespace and newlines. Instead, let 'find' produce a single line guaranteed to contain only numbers for each regular file. I print the inode number, which is shared among multiple hardlinks. Sort the inode list because 'uniq' requires a sorted list to work, pipe through 'uniq' to get the list of unique inode numbers for directory entries of regular files, with or without a leading '.', in this directory, not descending subdirectories (because of the -maxdepth 1 switch). Count the number of unique inodes, and you have your answer.</p> <p>NOTE: In the above code examples, I've added a line break to make them fit. They are in fact one line each. .... [the management]</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1377292&amp;1=default&amp;2=en&amp;3=" token="QnV9qf8Z29lQP5vE6WLyyvGXC_shXP4DxUBcvmazoog"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://distrofreelinuxuser.blogspot.com/" lang="" typeof="schema:Person" property="schema:name" datatype="">Winter Toad (not verified)</a> on 02 Aug 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1377292">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1377293" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1217670603"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>...and my favorite script<br /> ....<br /> cd /<br /> find -exec DONTDOTHIS rm {} \;</p> <p>(*hint* yes, I'm evil and DO NOT do this!)</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1377293&amp;1=default&amp;2=en&amp;3=" token="QlogB_efV-5o2bX2vR_QN7ur-H2J1lsnwrLc9kHfadM"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">TimJ (not verified)</span> on 02 Aug 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1377293">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1377294" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1217753540"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>/usr/bin/du -sk * | sort -n</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1377294&amp;1=default&amp;2=en&amp;3=" token="oUd0eHOn29XWsmDkWECBhB6KoC_BCmhEvOptsCf7PfE"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Virgil Samms (not verified)</span> on 03 Aug 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1377294">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1377295" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1217851401"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>The interesting thing about this command is that 'ls' doesn't list one file per line and yet 'wc' sees each file as a line. Somebody was thinking ahead.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1377295&amp;1=default&amp;2=en&amp;3=" token="GbhjdbZxV11hDQUgKxVNhpG_MfnO6DaoU6MmIq3qMMY"></drupal-render-placeholder> </div> <footer> <em>By <a rel="nofollow" href="http://photography.zvan.net" lang="" typeof="schema:Person" property="schema:name" datatype="">Ben Zvan (not verified)</a> on 04 Aug 2008 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1377295">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> <article data-comment-user-id="0" id="comment-1377296" class="js-comment comment-wrapper clearfix"> <mark class="hidden" data-comment-timestamp="1250643845"></mark> <div class="well"> <strong></strong> <div class="field field--name-comment-body field--type-text-long field--label-hidden field--item"><p>Ben: a program can test whether a file handle (such as stdout) is really a file, or a pipe, or a (pseudo)terminal, or ...<br /> ls does this.</p> <p>Winter Toad: as long as we are being pedantic, wouldn't it be better to use "%D,%i\n" in the printf? Somebody might be playing with union mounts and I don't believe all union mount implementations "remap" the device numbers, in other words, the inode numbers may clash.</p> </div> <drupal-render-placeholder callback="comment.lazy_builders:renderLinks" arguments="0=1377296&amp;1=default&amp;2=en&amp;3=" token="-X3tyrroJI0X1284aJOr57bELNfUkHx03Uhe4Nry4uw"></drupal-render-placeholder> </div> <footer> <em>By <span lang="" typeof="schema:Person" property="schema:name" datatype="">Peter Lund (not verified)</span> on 18 Aug 2009 <a href="https://scienceblogs.com/taxonomy/term/4633/feed#comment-1377296">#permalink</a></em> <article typeof="schema:Person" about="/user/0"> <div class="field field--name-user-picture field--type-image field--label-hidden field--item"> <a href="/user/0" hreflang="und"><img src="/files/styles/thumbnail/public/default_images/icon-user.png?itok=yQw_eG_q" width="100" height="100" alt="User Image" typeof="foaf:Image" class="img-responsive" /> </a> </div> </article> </footer> </article> </section> <ul class="links inline list-inline"><li class="comment-forbidden"><a href="/user/login?destination=/gregladen/2008/08/01/linux-one-liners-1%23comment-form">Log in</a> to post comments</li></ul> Fri, 01 Aug 2008 18:14:00 +0000 gregladen 24731 at https://scienceblogs.com