tccros.blogg.se - Bash grep for files that are gigabytes

This was misguided since the limiting factor in this grep was not CPU usage but disk usage. My first idea to speed up the grep was to use GNU parallel or xargs, both of which allow grep to make use of multiple CPU cores.

Below is a screenshot of the iotop utility reporting a read speed of 447 M/s while grep is running: This was not good enough, so I eventually moved it to my main Samsung SSD which has over 500 MB/second read speeds. The SSD supports up to 200 MB/second read speeds. My first step toward speeding up the grep was to move the file to an old SSD I have that is attached to my desktop. This helps find usages that occur mostly in novels, rather than other types of books (the Gutenberg collection contains many non-novel files, such as encyclopedias and legal works). This finds usages of the word that start with a pronoun such as “she”. In order to find interesting examples, I use the following regular expression: egrep -aRh -B 20 -A 20 "\b(she|her|hers|his|he)\b.*taciturn" merged.txt Here is an example of the results for the word “taciturn”: I then wrote a PHP script that used the grep utility to search through about 250 billion lines of text 1 to find the word usages I needed. In order to have a wide corpus of classical texts to find word usage examples, I downloaded a massive ebook collection from the Gutenberg project and merged all of the text files into one big file that reached 12.4 gigabytes in size.