Finding doubled words using perl

I recently switched to Scrivener for writing my documents. Much more enjoyable interface than Word, with lots of nifty features for writers. One big issue: I’m still getting used to Scrivener’s spellchecker. Microsoft Word finds doubled words right out of the box, but Scrivener does not.

The script below is written in perl, which comes pre-installed on Macs. If you paste it into a text file, make the file executable, and then run it in the same directory with a file called “infile.txt” (a cut/paste from Word to the file will do nicely), it will report your doubled words.

*update* – the script won’t catch things like: “bang bang” because the quotes make it think it’s 2 patterns. Working on it 🙂

Example input (infile.txt):
This is a line This is another line And yet another line wow I sure do a lot of lines, "Don't I?" he said (in a funny voice)... Wow it sure is is fune typing all this I like dogs and cats and stuff. Big big is funner than small people. how are are the dodgers doing this year? Nobody knows. more lines and stuff... etc. etc. good things come to those who write scripts in perl and post them on the internet

Example Output:

is is ----> Wow it sure is is fune typing all this big big ----> Big big is funner than small people. are are ----> how are are the dodgers doing this year? Nobody knows. etc. etc. ----> etc. etc.

And now the script: rep.pl

#!/usr/bin/perl
open(FILE,"infile.txt") or die "Can't open infile.txt: $!";
$section_breaks = "*";  # I have * * * as section breaks. The script sees them as words and should ignore them.
while(<FILE>) {
   chomp();
   $a_line = $_;
   @line = split(/ /, $_);
   $prev = 0;   
   foreach $i (@line) {
      $i = lc($i);
      if ($i eq $prev && $i ne $section_breaks) {
         print "$prev $i ---->  $a_line\n";
      }
      $prev = $i;
   }
}
close(FILE);

10 responses to “Finding doubled words using perl”

Mike

July 3, 2015 at 12:35 pm

Bah!

perl -n -e ‘while (m{\b(\S+)\b(\s+\1\b)+}migs) { print “dup on line $.\n”; }’ infile.txt

- John L. Monk
  
  July 3, 2015 at 12:36 pm
  
  hah, I knew you’d do something like that.
  Nice work 🙂
  
  - John L. Monk
    
    July 3, 2015 at 12:36 pm
    
    Yours doesn’t account for section breaks! ( * * * )
- Mike
  
  July 3, 2015 at 12:44 pm
  
  There are no section breaks in my test file (or yours!)
  
  - Mike
    
    July 3, 2015 at 12:45 pm
    
    I guess if you’re looking to ignore consecutive asterisks you can simply do: perl -n -e ‘while (m{\b(\S+)\b(\s+\1\b)+}migs) { print “dup on line $.\n” unless $1 eq ‘*’; }’ infile.txt
kristinemckinley

July 3, 2015 at 2:34 pm

I’ve tried using Scrivener, I like it, but I always go back to Word

- John L. Monk
  
  July 3, 2015 at 3:11 pm
  
  I’m seriously considering going back to word for 2 reasons:
  1) a more intuitive search(replace) system.
  2) a more reliable spellchecker. In Scrivener, if you type “don’k” (for example) it doesn’t flag it as an error.
  
  - casslogan
    
    July 3, 2015 at 3:21 pm
    
    It may not flag it, but if you run a check does it show up? I vaguely remember reading that it doesn’t underline misspellings so that you don’t get caught up in correcting and instead focus on writing.
  - John L. Monk
    
    July 3, 2015 at 3:22 pm
    
    Casslogan: very interesting! I’ll have to try. Thanks for the tip. That’d help a lot 🙂
  - John L. Monk
    
    July 3, 2015 at 3:38 pm
    
    Casslogan: you’re right. If you manually run the check, it DOES find the doubles. I’ve amended my post so I’m not bashing Scrivener 🙂