Handling Memory Errors | Persistent Memory Programming | Intel Software


Hi, I’m Andy Rudoff from Intel. In this video, I’ll
provide an overview on how to handle memory errors
when developing programs for persistent memory, or PMEM. The error handling
strategy for an application is an important part of the
program’s overall reliability. For server applications,
this strategy directly affects availability. In other words, what percentage
of the time the application is expected to be
available to do its job. PMEM programming brings some
interesting new considerations to error handling. So let’s dig into the details. First, some background
on memory errors. The main memory on a server,
often referred to as DRAM, is protected using error
correction codes, or ECC. This is a hardware feature
that can automatically correct many memory
errors that happen due to transient hardware
issues, such as power spikes, media errors, and so on. But if an error
is serious enough, it will corrupt so many bits
that ECC can’t correct it. And the result is known
as an uncorrectable error. Most applications never worry
about uncorrectable errors. Server harbor has
become reliable enough that they are a rare occurrence. Using the Linux operating system
environment as an example, if a program experiences an
uncorrectable error in DRAM, the application is sent a
SIGBUS signal which kills it. While it’s possible
for an application to catch a SIGBUS
instead of exiting, it’s very rarely done
since the recovery logic would be very complicated. Instead, the application dies. And the DRAM
containing the error is returned to the system
where it is re-initialized before it can be used again. In this way, memory error
handling is very simple. The server application dies. The memory is returned
to the system. And typically,
the application is restarted where it begins
again with fresh DRAM. With that background,
let’s compare how uncorrectable
errors in PMEM differ from uncorrectable
errors in DRAM. The events start out the same. When an uncorrectable error
happens with persistent memory, a SIGBUS is sent to
kill the application. But since persistent memory
is, well, persistent, you may have already figured
out that the error doesn’t just go away because the
application dies. If you restart the application,
the most likely thing to happen next is that
the application hits the exact same uncorrectable
error in persistent memory and gets killed
again with a SIGBUS. For this reason,
the operating system keeps track of areas
in persistent memory where there are known
uncorrectable errors. Here you see an example of
the NDCTL command in Linux listing the known bad
blocks in persistent memory. The libraries in the persistent
memory developer kit, or PMDK, automatically look
at this information and will prevent a program from
opening a persistent memory pool if it contains
these errors. In this example, notice
how PMDK’s PMEM Pool command indicates
there are known errors. Putting all this
information together, the simplest way for an
application developer to handle memory errors is
to let the application die when it gets a SIGBUS. This avoids the
complicated programming of trying to handle
SIGBUS at runtime. On restart, the
application can detect that the persistent memory
pool contains errors using PMDK and can repair the data during
application initialization. For many applications,
this repair can be as simple as reverting
to a backup error-free copy of the data. You could see application
developers are faced with some interesting choices. But it isn’t hard to get started
by using the simplest, most common techniques initially. And only bringing more
complexity into your program if it turns out to be necessary. See the links provided
for example programs and more documentation,
tutorials, and videos on persistent
memory programming. Don’t forget to like
this video and subscribe. Thanks for watching. [MUSIC PLAYING]

Leave a Response

Your email address will not be published. Required fields are marked *