Computational Materials Science Part 3 — Principal Component Analysis

This is a continuation of Computational Materials Science Part 2 — Multipoint Statistics, so I would highly recommend reading that first. That covered computing and visualized multipoint statistics which told us how parts of a structure were related.

Now comes the really important part, how we can actually use all of this to determine facts about the material.

Principal Component Analysis

A major challenge in computational materials science is the sheer amount of data, most of which is essentially useless in a given scenario. Think of how many different parameters you could use to painstakingly detail a material’s descriptors. Just like the classic peanut butter and jelly sandwhich demonstration shows how painstaking and exact one must be when coding, each material is essentially a different sandwhich with a very long procedure for how to make it.

So here’s where we distill all of the facts about a material into something useful. We are finding the … principal component (i.e. the important ones).

Let’s say we have data points xʲᵣ with r = 1, 2, … R data dimensions and j = 1, 2, … J number of data points. In a 2D plane, R would be 2, and J would be the sample size.

The data can be represented as:

If J << R, or if the sample size is significantly lower than the number of descriptors, “normal” analysis would make no sense, so the question becomes: how can we analyze the data in a lower dimension to get the main idea.

Here’s a simple example to demonstrate. Currently the dataset below is represented in 2 dimensions, x and y.

We can use a rotational transformation to distill it down into a single dimension as follows:

This process identifies the most objective rotational transformation that reorganizes the data into prioritized directions of maximal variance. In the second diagram, there is more variance in p₁, and almost none in p₂, thus p₁ is the variable of interest. The benefit of this process is that it can be unsupervised.

Mathematically,

where x-hat is the representation in the new reference frame, Q is the orthogonal transformation matrix, and ⟨xᵣ⟩ is the average of all data points.

Therefore the goal is to select ^xʲ₁ such that it has the highest variance, ^xʲ₂ so that it has the next highest, etc.

To solve this we can make use of some linear algebra. (Note, this isn’t super important, so feel free to skip this next bit). We can redefine x-hat as a vector, a square matrix, and another vector:

And then maximizing the variance is:

The condition simply being that {q} and {q} transpose are orthogonal.

The solution turns out to be that {q} is the eigenvector corresponding to the largest eigenvalue of the symmetric matrix [X]ᵗ[X].

After finding the first principal component, just repeat the process again except redefine the data matrix to remove all components of x₁. For example, in the xyz space, if the first principal component is x, then to find the next PC, just remove the x component, squashing the data into the yz plane, and repeat the process. Symbolically, this would be:

In summary:

{q} = principal components (PC)
^xʲk = PC scores/weights
eigenvalues are proportional to the variance along the PC
sum of the eigenvalues is proportional to the total variance in the ensemble

When do we truncate the PCs? How do we know if we need 2 variables or 20? Conveniently, the eigenvalues are related to the variance, so we truncate based on the decay of the eigenvalues. Thus this is a data driven process that does not need to be set ahead of time.

Using PCA on ensembles to classify them

Say we have the data matrix:

Then

where fᵣ is the ensemble mean, α are the PC scores, and ψ are the components of the orthogonal directions from PCA.

Reducing this as before gives:

An Example

Imagine we have the following dataset of images. The categories are all similar, but not exactly the same. So how can we clearly see how each of the groups are similar/different and to what degree?

Say each micrograph above has 3 × 1⁰⁶ pixels (i.e. R = 3 million) and J is ~150. Using the techniques from above, we are able to distill the information from 3 million variables to 3.

As you can see above, in the lower dimensional representation, it creates patterns and naturally classifies each of the image categories. (This also means that the datasets do not need to be prelabeled). We are also able to see the relative amount of variance — the green group is much more spread apart than the blue group. This could indicate that the treatment used for the green group is much less predictable than that used for the blue group.

Summary:

  • Principal component analysis enables us to convert data with a high number of variables into a simplified form that just shows the major characteristics
  • We do this through repeated rotational transformations and identifying the dimension of maximal variation
  • Nice relationships to eigenvalues and eigenvectors make analysis more convenient and allows for an entirely data driven approach
  • Through PCA we are able to classify sets of microstructures and determine their relative variance

And the main objective from this entire series is to better understand material process-structure-property (PSP) linkages.

Thank you for reading to the end of this series! Hope you enjoyed learning about principal component analysis and computational materials science. If you’re interested in discussing this or are in the field and would be willing to talk with me about materials science, you can get in touch with me at kiranamak@gmail.com.

Additional Resources:

At 17 years old, I love learning and am interested in materials science, education, and environmental sustainability.