Parsing All That Output – GREP to the Rescue!!!

I will be discussing a useful approach to obtaining the desired output from Gaussian calculations using “grep” on Linux. Simple technique inspired by Linux knowledge and Gaussian manuals/books (see “Note 2” at bottom of post for details). For any newbie Linux users, my quoted out comments are acting as the Terminal command line and output – if you see a “$” symbol, that is the beginning of a command line, and NOT to be typed in. 🙂

Gaussian, created by Carnegie Mellon University in 1983 and now under the trademark Gaussian, Inc., is a computational chemistry software for simulating molecular systems and calculating their energies, reaction pathways, thermal properties, molecular orbitals, etc… in different solvent and temperature environments. GaussView, a GUI (usually pronounced “gooey”), allows these molecules (up to the size of large proteins, such as Green Fluorescent Protein, GFP) to be modified and examined at will.

However, if a person is interested in writing up their own input files (or “instructions”) for the Gaussian Software to follow, that person will likely avoid using the GUI because either (a) they prefer to work in the Linux Terminal type of environment, or (b) in order to save processor/memory for more difficult calculations (e.g. very large systems, or for high accuracy work). For my own work, using the GUI is convenient when I need to check chemical structures either (a) drawn in Chem3D (or other programs), or (b) downloaded from Pubchem (or other databases), but otherwise it is faster for me to type up a bunch of input files by copy/pasting replicate instructions into “.gjf” or “.com” extensions (text files with the added extension to make Gaussian read it) and submit them all at once to my University’s computer cluster.

Even more of a reason to write one’s own instructions manually is the ability to submit MANY jobs simultaneously to a computer cluster via the cluster’s “workload manager” (ours is SLURM); bash scripts for submitting batch jobs are your friend when it comes to doing many molecules all at once!

So… sounds like a good deal, right? Just do away with the GUI and get a whole bunch of work done in minutes? (system size and calculation type dependent)

Yes, except… you just obtain a rather lengthy output file written in text format that you then have to scour through and look for one or two numbers of importance – ACK! In the GUI, this is done visually in a nice table under “Summary,” but if you want to take the benefits of running Gaussian through a Terminal, you probably don’t want to open up the GUI everytime just to parse/read your results…

Never fear! Linux is built to help in situations such as these!

Let’s run through some examples for reference, and then feel free to play around. I won’t be discussing how to set up input/instruction files here, but great resources include the Gaussian reference manual and textbook (linked in the Notes at the bottom of post), as well as here on Gaussian’s website (which itself is full of useful materials in troubleshooting and understanding the types of calculations that it performs).

For example input files, googling “Gaussian input files” brings up a few University websites, such as this page from LMU in Germany illustrating a formaldehyde input (the same molecule my examples will be working with below, except they use a Z-matrix* format while I am personally using XYZ coordinates).

Example 1: Getting the SCF energy

$ g09 & formaldehyde_RHF.out

Running g09 on “formaldehyde_RHF.gjf” simulates formaldehyde at the restricted Hartree Fock level (due to the instructions in the .gjf file) and sends the output to “formaldehyde_RHF.out” (which can be named anything you would like). Once the run finishes (on a cluster, you can check your job queue, or on a personal machine you will gain control of the Terminal again), we will have a fairly long output file to work with (~49 KB, ~800 lines of text).

“grep” searches for keywords (or regular expressions, i.e. “regex” language**) within a chosen file (or system-wide, but I won’t discuss that here). Conveniently and as expected, Gaussian puts out a particular format for all of its calculations, and so one can use “grep” to find particular values that were calculated by searching for its name; e.g. “SCF Done” is the keyword(s) for finding the self-consistent field energy, or final ground state energy of the system.

$ grep “SCF Done” formaldehyde.out

 SCF Done: E(RHF) = -113.863703683 A.U. after 11 cycles

Tada! Got it!

A.U. is atomic units, and in this case is equivalent to hartree units (Gaussian uses hartree units instead of Rydberg units***), and the number of cycles is how many times it had to run the calculation and minimize energies to converge completely (under a tolerance level, which we will see in the next example).

In order to be even partially certain that the answer obtained is trustworthy though, so-to-speak, one must check whether (a) if any errors arose in the calculation (which will usually be printed at the end of the output somewhere, for quick reference), and (b) if the forces and displacement values of your molecule began varying under a certain amount after each “cycle.” For self-consistent field calculations, this represents locating an energy minima, i.e. the ground state, successfully. Not converging in one or both of these could mean that the structure found a secondary or local minima instead, and drastically influence final results!

Let’s check for errors first…

Example 2: Checking for E/errors

$ grep -i “error” formaldehyde.out

Using the “-i” option with “grep” tells it to ignore the casing of the search term, i.e., so it will find “Error,” “error,” “eRRor,” etc… Using this option will usually pick up any warnings that mention error (lowercased) and any final convergence/calculation errors themselves (usually “Error”).

Since this calculation was setup correctly and is fairly simple to converge to a minimum, no results are found for printing.

That’s… a little boring, eh?

Well, here is an example print out from error searching output from a more complex structure using Cadmium and Sulfur/Selenium:

$ grep -i “error” batchCd2Se2_*.log

[I’ve separated the output from a quote because of formatting issues…]
batchCd2Se2_a.log: may cause significant error
batchCd2Se2_a.log: vibrations may cause significant error
batchCd2Se2_a.log: AS THAT OF MAN TO ERROR.
batchCd2Se2_b.log: may cause significant error
batchCd2Se2_b.log: vibrations may cause significant error
batchCd2Se2_c.log: may cause significant error
batchCd2Se2_c.log: vibrations may cause significant error
batchCd2Se2_d.log: vibrations may cause significant error
batchCd2Se2_e.log: may cause significant error
batchCd2Se2_e.log: vibrations may cause significant error
batchCd2Se2_f.log: Error termination request processed by link 9999.
batchCd2Se2_f.log: Error termination via Lnk1e in /gpfs/runtime/opt/gaussian/g09/l9999.exe at Mon Mar 26 16:58:55 2018.

WOAH! Those are some odd looking results. And what’s with that “AS THAT OF MAN TO ERROR” result??

Short and sweet, I searched all the different structures that I have for these CdS/Se dimers to give a more elaborate example of possible error searching results, and it illustrates not only the two types of errors usually caught that I mentioned above (warnings, e.g. “vibrations may cause significant error”, which is from what is known as a frequency calculation for thermodynamics; and true errors, resulting in “Error termination request processed by link 9999” for convergence, or other numbers if the fatal error is due to something else).

As for that capitalized splice of a statement, Gaussian ends a successful (or mostly successful) calculation with a random quote from a library it has access to internally. Sometimes, searches WILL pick up pieces of this quote, and since it is random there is not clean way to avoid it (they are not necessarily all uppercase either…). Usually this is not a problem, and it is obvious here that it comes from that part, so just be aware.

Now, how about those convergence problems?

Example 3: Checking for convergence problems

$ grep -A 4 -i “item” formaldehyde.out

IF the instruction “opt” for optimization is not given to the input file, this grep search will result in NO output, as convergence steps will not have been taken regarding geometry and resultant force/displacement properties. IF “opt” is included, then results that look something like this should appear at the beginning stages:

Item Value Threshold Converged?

Maximum Force 0.059796 0.000450 NO

RMS Force 0.022952 0.000300 NO

Maximum Displacement 0.039392 0.001800 NO

RMS Displacement 0.021436 0.001200 NO

What is visible here are 4 columns labeled “Item,” “Value,” “Threshold,” and “Converged.” Briefly, maximum force/displacement and the root-mean-square of force/displacement are the “Item[s],” the “Value” is simply the calculated number for each “Item,” the “Threshold” is barrier that each “Value” should fall beneath to be considered converged (i.e., when very little change in each item’s value is occurring –> a minimum), and of course the final column is the simple answer to that question of converged or not.

When the calculation has fully completed, grep-ing these results again will yield a whole list of tables, all with NO in the “Converged?” columns until nearing the end when it (hopefully!) becomes all YES (see below).

Item Value Threshold Converged?

Maximum Force 0.000344 0.000450 YES

RMS Force 0.000183 0.000300 YES

Maximum Displacement 0.001163 0.001800 YES

RMS Displacement 0.001041 0.001200 YES

To avoid getting a giant list of all these tables, and as one last tip which also applies to any other value of interest that you only require the final one for, use a piping technique to a command called “tail,” which will print to the Terminal only the last X number of lines requested:

$ grep -A 4 -i “item” formaldehyde.out | tail -5

Tip: “tail” requires a value of 5 lines prior to the end in order to also get the table headers.

That’s it! Feel free to play with “grep” and see what else it can do, and peruse the output files for search words/terms to use to find your own desired information (some examples would be “$ grep -A 5 -i “distance matrix” ” –> see what happens after an optimization!).

Although this post was longer than I’d planned for a simple technique overview on a Linux command, I feel it was more than worth it to illuminate some cloudier ideas of working with Gaussian software from a Terminal (perhaps to well experienced Computer scientists this is simple, but for all those other folk out there trying to get a handle on using computers for their work, I hope this was useful!).

A final note about using “grep” –> there are many fancy options and other techniques that one can use to weild the power of “grep” more fully. Use “$ man grep” to see these options in the Terminal, or take a look at various online tutorials/forums (such as this one) for further exciting details and uses. Useful other options include adding the filename to output (for single files, -H) and printing the line number in the output file of the results (-n).

There’s always more to learn! Check back next time for something new!

~Len

~~~~~~~~~~~~~~~Notes/Asterisks/Credits~~~~~~~~~~~~~~~~

Note 1: “grep” exists on many/all of the Linux distros, so I will simply refer to my distro/environment as Linux, instead of by its distro name. For those that are curious, we are currently running CentOS on our clusters, and are transitioning to RedHat 7 soon.

Note 2: more detailed instructions on using Gaussian either via Terminal or the GaussView GUI can be found in this great text resource (“Exploring Chemistry with Electronic Structure Methods” by J.B. Foresman and AE Frisch), or from their website under “Support” (White pages, manuals, etc…).

*Z-matrices are a notation used for denoting geometries of molecules based on their connectivity, bond lengths, etc… instead of just spatial coordinates. Some optimization methods work more efficiently with Z-matrices versus XYZ, and it may be best to use Z-matrices when in high symmetry molecules (or whenever trying to avoid symmetry breaking). For more details and to also learn a bit about redundant internal coordinates, take a peak at this chemistry StackExchange post and selected answer by LordStryker.

**Regex language can get a bit technical, but for some semblance of reference, check out the wikipedia page. 

***I’m beginning to have too many asterisks… anyway, for a quick look at the types of atomic units, I’d recommend this wikipedia page (for the easiest quick reference). Alternatively, feel free to pick up a quantum chemistry introductory textbook, like Griffith’s “Introduction to Quantum Mechanics” or Landau’s “Quantum Mechanics.” 

 

Image Credit (w/ personal text modification):
Business image created by Kjpargeter – Freepik.com