Heterogeneous Acceleration and Challenges for Scientific Computing on The Exascale

Good morning everyone
I’m Mary Hall and this is our last plenary talk of FCRC 2019 the
plenary speakers were nominated by the conference’s for their
cross-disciplinary impact of their work the common theme that we’ve seen this
week examines future applications that demand unprecedented scaling and today’s
talk is gonna examine how scaling makes possible scientific discovery and some
of the some of the caveats about how to adapt applications to new architectures
today’s speaker is Erik Lindahl he’s a professor of biophysics at Stockholm
University and a professor of theoretical biophysics at the Royal
Institute of Technology he leads grow max which is an
international molecular dynamic simulation project for which he’s
received several awards and he also cherished the prey steering committee
and so with that I’ll introduce Eric Lindahl all thank you so much Mary it’s
great to be here I have a ton of slides for you so I’m gonna get started pretty
much right away we’ve had a number of wonderful keynotes focusing on Big Data
artificial intelligence and everything but I’m gonna try to take you back in a
bit to number crunching because that’s very much where I’m coming from
correctly I’m not even coming from number crunching I’m a at heart I’m
still a biophysicist so what really makes my heart beat is
stuff like this nerve cells and understanding what’s happening in your
body and it’s not just figuratively but literally because every single heartbeat
in your bodies is due to signaling in these cells and over the last few years
we’ve started to understand quite a lot in detail what happens we know how
signals travels as electrical signals inside these not long nerve cells and in
particular we know roughly what happens in the so called synapse when a signal
has to jump from one nerve cell to the next nerve cell and the synapse itself
is super exciting because it’s our one chance to somehow influence the whole
signaling which in particular you want to design pharmaceuticals to treat nuru
and it happen or illogical disorders and everything that’s the one opportunity we
have the reason we can do this with computers
is that over the last 10 or 15 years we’ve been able to determine structures
so we know what these small receptors that bind the neurotransmitters work
right and we can actually put them in a computer and simulate them I used to do
this already when I was a PhD student and I I don’t remember how many months
or years I used you spent an eternity and events you I think it might be Steve
eases I had some like 10 nanoseconds of simulation which was fun in a way it
taught me about the field and everything in hindsight it’s an utter embarrassment
because today I could probably run that over lunch on my laptop on the other and
that also says something about what we can do with current supercomputers and
how much biology and chemistry we can do and that is where my I’m not sure
whether I consider myself cross-disciplinary but rather somehow
anti disciplinary I stand with one foot in the physics the other side somehow in
life sciences and then I guess a third leg in computer science computer science
is the one thing I’ve never had any formal training in and today it’s
probably 75% of my group is doing so what I’m gonna take it to a bit is
really focusing not so much on the applications or I forgot this the
obvious applications for this in neuro thing is anytime you’re sedated at
volatile anesthetics it was striking out these receptors and if you go out and
enjoy yourself tonight after a conference and have a drink that ethanol
is also going to bind and change these receptors but the tools we used to
understand this is based on molecular simulations that is that we can with
some sort of model describe a time evolution and then ask the computer to
predict with the positions of these atoms is gonna be a very short time
later and I I try to look up some statistics on this I think it’s so we’re
in the ballpark between 10 and 15% of all the HPC resources in the world are
used for molecular dynamics not only our program but it says something about how
this how important this is not just for the National Lab or researchers but even
well people in doing research and biomedical sciences and pharmaceutical
companies and although we couldn’t do a whole lot of this when I was a student
today there are tall some things we can do so I’m gonna tell you two stories here I’m gonna give you a little bit of an
example of how these applications have a mold what we’re coming from once upon a
long time ago I was sitting as a PhD student and my advisers lab and running
this on a single deck out that workstation and then over the years
we’ve taken this from Linux clusters to local supercomputers to some of the
largest computers in the world set as K or summat nowadays and it’s a wonderful
journey it’s I hope that it’s not over yet but we’re increasingly seeing that
it’s leading to some very big challenges as we try to put these two new hardware
so I’m going to share a little bit of that story with you and also point out
where the challenge is not just for this application but broad classes of
applications are on the so-called XS game and the other story then is really
for take a step back and see what are the scientific computing challenges in
general as we move to the XO scale there are all these prefixes right we
were at the gig s caled we were at the terascale we were at the petascale the
XS scale is fundamentally different I would argument and now we’re gonna come
back to that I’m not saying that it’s the end of scaling but I think it’s the
end of a whole lot of things we would take him for – for a very long time at
least in the scientific applications molecular dynamics is actually a dirt
centered problem it’s at least if you do this classically which we tend to do you
can describe my entire life which sat in a flow diagram with five boxes and that
the type there is a bit small but basically we start with some sort of
state that we typically get from an experiment then we need to calculate all
the interactions between these particles that we tend to do with classical
potentials and then we update the coordinates and then we just repeat this
again and again and again and again and the reason why this works so the parts
that we emphasize here is really statistical mechanics rather than going
for the most accurate representation of one molecule that would mean quantum
chemistry we rather focus on large molecules such as proteins or even
larger collections of cells and everything that moves and in that case
it’s important for us to sample more than one conformation so we frequently
compromise a bit with the accuracy and use these classical interactions very
simple models Springs balls traditional Coulomb electrostatics nothing
want them anywhere near here and then we use typically Newton’s equations of
motions to update this this is not perfect we certainly ignore a whole lot
of quantum phenomena but you look in a protein for instance it’s very rare that
we break any bond and we very rare that before many bonds so that it’s works
surprisingly well and with this dust bias even though it’s fairly expensive
to calculate all these pairwise interactions thesis orders of magnitudes
and then what I say orders of magnitude you should probably think like ten
orders of magnitude cheaper than doing it with quantum chemistry so this means
that we can actually cover realistic timescales I can do it at finite
temperature and I can do with water and everything to really mimic a biological
experiment which is completely out of the question with quantum chemistry for
now the problem though is that on the atomic scale things move fast so you
need to take time steps in the ballpark of tempio seconds or so and if you’re
going to get anywhere interesting whatsoever the water will bring there in
the background might be great fun if we were interested in the physics of
hydrogen bonds or something but if you want to study proteins or anything in
yourself we’re gonna need the others two microseconds or even milliseconds and
you’re gonna need a huge amount of steps here and here is the curse and this is a
sequential part of it every step depends on the previous step so I’m gonna need
to calculate one step as quickly as I can I’m very much focusing a latency so
that I can calculate the next step and this is gonna turn out to be a curse on
modern supercomputers the reason why this is a curse is that if you look at a
small system here tiny box here if you take something slightly larger like one
of these ion channels I showed or something we might have hundred thousand atoms the protists I can’t pick another system I’m interested in a particular
ion channel so I can’t pick something that’s larger that would scale better
and this system if I ignore things if I only account for things close in
proximity each atom might have five hundred neighboring atoms I should take
into account and then if you do a bit of mathematics here you maintain a list you
end up with something in the ballpark of 50 million interactions per step and
with the typical interactions we have that might be roughly 2 billion
floating-point operations which is certainly a lot of floating-point
operations or rather it used to be a lot of floating-point operations this
is nothing nowadays so the problem here really is said that I hardly have enough
floating-point operations with the sea-air out here is our big kernel and
it’s very simple because this is where we spend 90% of the time calculating one
over the square root of a squared distance so I get 1 over r the catholic
Coulomb’s law it’s embarrassing what fraction care my
carrier I spent just trying to optimize that function but to reach the
timescales here we need this to compete in way under a millisecond and this is
the problem as we’re becoming more and more parallel with current computers so
why don’t we scale up the system well scaling up the system would certainly
enable me to scale better right but I wouldn’t necessarily need reach longer
timescales and when it comes to the biology here the problem is as long as
I’m stuck on femtoseconds or Pico second scale I’m doing physics I might be able
to get some sort of chemistry when I move down to nanoseconds or something
but to really understand motions and these molecules understanding what’s
happening when you’re having that ring we need to reach at least something like
a microsecond ideally a microsecond per day we’re not quite there yet if you
want to get further you want to have some sort of predictive properties here
we probably need at least another order of magnitude and if we really if
biochemists are going to trust computers to replace their experiments we probably
need another order of magnitude above that and the green part down here that’s
the pipe dream that’s likely not gonna happen why is that not gonna happen it’s
an iterative algorithm and the problem with these iterative algorithms again
for hundred microseconds per step that’s doable it’s hard but it’s doable
forty-three microseconds per step I think that what I’m gonna show you today
will take us to 40 microseconds percept but here it starts to become really
difficult at 400 milliseconds per step this probably just a number to you but
in 400 milliseconds light does not even travel 100 meters and if you’re then
going to communicate like 20 or 30 times just start counting the lengths of those
InfiniBand cables and everything even if my code did not take any time
whatsoever to run even the hardware was perfect there is simply not enough
feed fertile there’s something of time for the lights to travel as far as we
would like to communicate so no matter how good we are here no matter how much
work within a few orders here we can hit some very hard walls in terms of physics
that is gonna make it impossible to scale further and if we would hope to
scale up the zetas scale machine that’s simply not gonna work if you look at
what these algorithms do and I so don’t expect to look at the details here it
turns out that that simple flowchart is way too simplistic
so to optimize things you tend to separate things that are independent and
everything and it turns out you end up with a ton of small communications so in
this every red item here is a communication and if we do it in a
flowchart here every single small green grey arrow here is communication so we
might have 10 or 20 communication steps in every set and remember those steps
should take less than a millisecond so that we’re way down in the hundreds of
well tens of microseconds per communication event or something which
is getting tough because networks are not improving at all quite as fast as
computers so what do we do about this well GPU acceleration this is actually a
real deep you that we have run grow Macs on there was a poor PhD student at
Stanford up at Hanrahan’s Ian but when I was opposed to at Stanford at the time
we came up with this idea that maybe just maybe we can run number crunching
codes on GPUs and this poor student spent nine months porting all my kernels
to this diffuse this was very much a cross-disciplinary project his path at
hand ran is a professor of computer science I at the time I hadn’t even
touched computer science I had no idea and then we were sitting Ian eventually
got all this code working and we’re sitting in the room and discussing where
to send this and I think Pat said that we should send this to Sigma off or
something and then I said but this is good work can’t we send it to a real
journal which I can since then I’ve kind of learned that conferences are more
important than journals in computing science no but that’s one of the fun
aspects of being cross-disciplinary this poor student these kernels never
appeared in grow max because there were they worked but they were a factor 20
slower than running on the C you but this poor student get a decent
job anyway because he’s now vice president that early they are
responsible for all good and acceleration and the reason why DP use
are tempting is that if CPUs are parallel every small green dot here is a
functional unit and this is an old GPU actually this is also an old GPU because
they keep updating them so frequent it’s pointless to update this slide because
you can’t even see the individual green dots here anymore there are like 6000
computing units or whatever you call them processor elements or something on
a single GPU and these you can buy for $499 which is pretty tough competition
but as a scientist I like that price the only problem is that you can’t use them
they’re completely useless there was a reason why they were 20 times slower
than CPUs right because these will only work if you have a data parallel
algorithm that they can all ideally use exactly the same type of data and that’s
not power algorithms work in molecular dynamics to tell the truth that smote
our algorithms works in most scientific applications which at least is whedon
and Europe it’s a reason why the number of large supercomputers that have been
turned over to accelerators is probably still less than 10% scientists are
stubborn or stupid because we don’t do our homework importing things so there
are a couple of different ways we can look into these acceleration approaches
if you’re really lucky you can use something like GPU libraries if you just
happen to have a code that else is it’s amazing because you can get the vendors
to do the homework for you unfortunately we don’t have an algorithm that’s
present in libraries you could use something like directives open a CC
OpenMP today that’s probably what I should have tried but when we started
doing this open ACC was not mature enough and we could in tax even when it
started to appear we couldn’t access the deep deep dungeons of the hardware we
begun you could do pure CUDA we do a whole lot of pure CUDA it’s okay the
only problem is that you end up with huge amounts of CUDA code thousands of
lines tens of thousands of blind hundreds of
thousands of lines and at some point year 1 million lines of CUDA code that’s
great if you only ever use in video hardware but if in the US the computer I
showed in the first slide is made by AMD there are now a lot of research groups
crying blood because they spent 10 years optimizing everything for CUDA and
they’re now gonna have to port everything again so what we have
increasingly moved to is heterogeneous cpu GPU acceleration where we use both
the CPU and the GPU that will look really stupid in some ways but I Marga I
think that’s what we all should hit the big problem here is the initial effort
and expertise required this hurts a lot to learn and again I’m not a computer
scientist I’ve had to learn it’s very much been on the job training for me but
I think you just pay that initial price everything else looks really beautiful
and it creates performant code that is quite maintainable and the reason why
this code is maintainable is mainly that instead of running everything on the GPU
that a whole lot of codes do today well that works the only problem is that it
also limits you a bit if you’re only gonna run things on one GPU and can then
just use parameter scanning or some training where you have lots of deep
used to training or inference in parallel then it works great but
remember my algorithms one step depends on the previous step so I can’t run
hundred thousands of GPUs in parallel and that means that the GPUs will have
to talk to each other and there are some architectures such as mblink where you
can have a few GPUs talk to each other in a node but the second the GPUs are in
different nodes you can now going to need to move either go out over the CPU
or at least try to do MPI on the GPU and this very quickly becomes very
complicated in particular because also have a whole lot of bookkeeping to do to
do neighbors searching and everything so what we have increasingly on is that is
had try to stay a bit on the cpu so we use something like OpenMP on the cpu
will lead with the CPU all the complicated bookkeeping anything that
has to do with random memory accesses where we can do every single trick we
want tricked is a dangerous word in climate science at least nowadays but
for us these tricks are usually smart algorithms that means that we can do
fewer floating-point operations and then we just offload the most expensive part
to a GPU that works very well if we accelerate the CPU and then we just run
either cuda or open sea on the accelerator devices we can have a
node with two GPUs that works perfectly fine they typically have two sockets or
I can of course also decide to overload the GPUs sorry overload the CPUs or
start to paralyze not just between three or nine but say even a hundred nodes
this way this is going to be a challenge because for everything I gonna need to
do where this GPU will have to communicate to that CPU that CPU will
have to send a message to the next node and then the next node will have to
communicate to the GPU but if the hardware is fast enough this just might
work this is somehow called be easy
programming model when a video presents its if this is the easier one I’m very
happy they didn’t take the difficult one so what we started to do first together
with the embark analysis just look at these algorithms I’m gonna show you a
bunch of these flowcharts don’t worry about the details but typically we have
one CPU thread here and one GPU thread here so part of the work we do in the
CPU and then we do a bunch of communication steps but here in the
middle we have the traditional compute part actually the laser is wearing out
or say in the middle you have the traditional compute bound part and in
this algorithms we know our algorithms are well Sai could address this without
any profiling it’s all these coulumb and Founder Waals interactions where we’re
calculating interactions between atoms that are just close to space those are
reasonably easy to offload to the dpu we send things over to the GPU as quickly
as we can the dpu does its job in the meantime
though the CPU doesn’t necessarily have to be idle because we can let the CPU do
some of the cheaper computations and then we just hope to get the GPU work
back in time and what we first started doing this we were so disappointed
because the GPUs were so slow so we always end up waiting for the DP you the
other problem though is that and I think this is the real challenge both for us
and a whole lotta rod because why we haven’t moved more to accelerators that
scientists have been smart and I’m not talking about me but the previous
generation that there was a wealth of amazing algorithms developed optimized
things for CPUs and those are great algorithms but remember the algorithm
are optimized for CPUs so in any type of molecular simulations are non bonding
interactions pairwise interactions we tend to create lists of neighbors so
that for each atom hearbeat particle number three and four there is a list of
what are the atoms that are close in space atom number three or at or number
four and in general those will of course be different there are also going to be
different numbers of atoms closest base but atoms eight atom nine is in all
three lists here this works great on a cpu because you look up an entry in this
list you compute the force and then you write it back on GPUs not so much
because the problem here on the GPUs is that GPUs require many elements to do
the same operation just on different types of data but this is going to be
complicated because they can’t all right data at the line at the same time here
because atom line here is gonna have updates from all three of them there are
some tricks we could do that we could of course create a gigantic matrix if I
have n atoms this would be an N square matrix and then every single element of
this matrix would be say four by four eight by eight tile and then I would
literally calculate every single interaction against every single other
interaction of the system this is what you see in almost every beginner book
CUDA showing have to accelerate things it’s an amazing way to get high
floating-point rates and it’s an absolutely horrible way to do
simulations because it scales as N squared nobody would have been that
stupid that’s why we came up with those Southwell that’s why brulee came up with
those algorithms in the 1960s and what has struck us over and over again I said
we have to go back and revisit these fundamental algorithms that had you
asked me 20 years ago if there was one thing that would never change in my
field I would have said it’s this algorithm the relay links algorithm will
never change and we’ve had to revisit it because it does not work on modern
hardware so what we ended up doing a modern hardware instead is that we had
to come up with algorithms that were both fast but they they should both
achieve high floating-point rates on the GPU but I don’t want to throw out the
baby with the water I don’t care about floating-point rates I care about
simulation performance so instead of doing the traditional approach here
just grid things the reason for grading things is to come up with an n log n
algorithm and then you’d up with these grid cells that some grid cells might
have five atoms so might have three and so might have four and that is exactly
this problem everything I don’t end up with tiles that are four by four or
eight by eight so you can adjust these algorithms just so slightly by only grid
things in two dimensions and in the third dimensions we would then bin it so
that I always have exactly four atoms per cell so these cells are no longer
going to be uniform and beautiful in space but I don’t really care it works
great anyway so from my original molecules on the top
left I end up with some sort of completely artificial tetrahedra and
then I if one atom in that region interacts with another atom in another
Tet read on the entire to that Reiter interacts so this means that I recover
this 4×4 interaction in this case but I only do that for the ones that are
actually within range which gets important if it’s a hundred thousand or
a million atoms this again scale says n log n but they will still work on GPUs and when we first or do this you cry
because I spent the vast majority of my career trying to remove floating-point
operations they’re an entire generation of scientists at least those of your old
I know what I talked about that the fastest floating-point operation is a
floating-point operation we don’t calculate but the problem with these
algorithms we end up with a huge amount of floating-point operations that are
wasted everything that’s purple there is a floating-point operation that would
have even better if I could have avoided so I’m wasting 50% of everything I
calculate and this is where it would be so embarrassed that I would just put a
brown paper bag over my head and resign but the amazing things that GPUs are so
fast that it’s anyway effective three faster and this is the other hard memory
here that again I spent my entire career thinking that everything is limited by
floating-point operations forget about that on GPUs everything is memory the
first approximation you have an infinite amount of floating-point operations
which is very hard to learn and I keep making that mistakes again and again
that I think we’re floating-point limited but this worked quite well for
us and the other neat thing is that when you do this things way with
heterogeneous acceleration we have very little
architecture-specific code I think it has grown a little bit since I made this
slide but we have some like 10 files that contain CUDA and it’s roughly 3,500
lines out of 3 million lines that are raw CUDA and this worked so well so
there was even a company online that contacted us called and helped us write
open CL kernels they did this kind of a pro bono thing and probably to show that
they’re good at open seal so now we have open CL acceleration too so we’re going
to work on that AMD frontier system when it comes to line and any other system
that’s gonna come online in the next year or two so the one that this makes
it reasonably happy there are a couple of reasons why I very much like
heterogeneous acceleration I’m gonna try to convince you why this is the way to
go you could of course say that you just entirely – CPU but that would be the
same thing as in the 1980s that you had this Intel processors they did not have
a floating-point unit you could of course emulate floating-point if you
absolutely needed it but most of his apps didn’t or you could say that you’re
only doing number crunching you don’t really only care about the
floating-point unit that would be the 287 coprocessor this old chips you could
plug it in physically but that’s equally stupid of course nobody would write a
loop variable in a floating-point and since the last 20-30 years every single
modern processor has a coding point processor built in of course and not
just one probably two or even four units and the way we see this is that you call
them GPUs accelerator whatever these are very much merging with processors you’re
gonna have processors that contain sets of latency focused units kind of like
traditional CPUs and other sets of units that are throughput focused that would
be the accelerator based ones today but if you give this ten years they’re gonna
be sitting on the same chip there is no PCI Express bus between them
and everything and you should simply decide you should send the integer unit
step into the unit and the floating point to the floating-point unit and
same thing with latency versus throughput bound
jobs what this gives you scientific applications is that rather than
focusing entirely on the floating-point use the CPUs for all these complicated
bookkeeping these fancy neighbor searching algorithms I showed you and
everything or if you’re going to do fancy parallelization paralyzation if
you can limit the amount of data you sit over you’re gonna do better but that
frequently means complicated grading algorithms and everything that the GPU
is not at all good at the CPU is on the other hand so use the CPU for the
communication and ideally in the meantime you should be able to use the
GPU too and the take-home message here is very much the bottom line here that
rather than seeing things as a GPU or CPU a modern computer is a collection of
devices this actually holds for the CPU to but think of you have a network
device maybe more network devices you might have IO devices you have
traditional CPU devices you have accelerator devices and ideally you
should try to use all of these to solve your problem as efficiently as possible
which initially is difficult but long term it’s frequently a help the problem
that what happens when we do it this way is that we very quickly run out of work
remember that is the opposite as I said historically we were always short of
floating-point operations and what’s happening now is that these accelerators
are so darn fast that they run out of work because I send my work to them and
it’s not even enough work to keep them busy these are complicated thoughts but
what this basically means that I need something like 100,000 atoms for a GPU
before the GPU is saturated otherwise the curt the amount of execution time I
use per particle goes up so this whole regime here to the left and the right
hand side basically shows that I’m not saturating the GPUs and a hundred
thousand atoms doesn’t sound so bad until you realize that was the entire
size of the systems I wanted to simulate so I can’t even use I can use one system
with one GPU instead of hundreds of them and this is still limiting us entirely
we can’t to get strong scaling and we need to get this down to 1,500 atoms or
something and we’re nowhere near there yet so we’ve become latency bound in the
sense that the the time the driver takes to issue the kernels on your
accelerators is too long not for one GPU mind you from one GPU
works fine but if I want to use a thousand GPUs I simply run out of work
and that’s again the curse of having spent 30 years removing work
floating-point operations I have some ideas I’m going to show you how to solve
this we can come up with smarter ways of trying to hide the latencies
so again if you have these flowcharts and if you run in parallel one thing
that happens is that there are different types of work I have some local work
that’s already present on my note I don’t need to wait for communication to
do that so I can let the GPU have two streams so that I first start to chunk
along and do computation on locally available data while the deep while the
CPU is doing communication to other notes the second that remote data is
present we have a non-local bonded interaction here so what this basically
gets a high priority signal sent and then I immediately try to calculate all
this data so I can send it back to where it came and then while I’m waiting for
communication I can chew along and do my local data this works quite well we gain
effective to on it and this is what’s required to get this below one
millisecond so with this we’re down to scaling to maybe 10 down to only 10,000
atoms per accelerator device or something but that’s still just means
that we can go to four ATP’s which is not good enough sorry
yes the remote data and the problem here is really that we’re bumping into armed
Oslo I had heard of harmed Oslo but it’s again it’s one of these embarrassing
things you’ve heard of it and then you keep forgetting it so up to 10 years we
we are we just hanging our head against armed Oslo and you’re not going to win
that battle there one way now that we need to push more work on the
accelerator because those same accelerators that were so slow that we
were cursing them when we started something happened because now suddenly
the accelerators is 10 times faster than the CPU so the accelerator is always
waiting for the CPU instead and that means that we increasing you have to
move war work to the accelerator and more work to the accelerator and more
work to the accelerator this does work yeah you can find other algorithms this
is saying algorithm called Keamy that used to do long range
interactions I won’t go into details and how we did it but it works quite well to
offload that into a third stream and you can even do four streams here and one
advantage of having multiple streams is that I avoid doing them sequentially it
takes a while for each stream to start executing but this way they can all
start in parallel and with this we get another factor two or three speed-up
which again a factor to speed up doesn’t sound amazing but I used to say in my
lab whenever a new student came in I used the matter say well if you can
speed up my code by two percent I will buy a both of champagne I only did that
once for the people doing GPU stuff because the problem they come back and
realize we didn’t speed it up by ten percent we spend it up like a factor of
two and then you wait a year and then they sped it up by another factor too
and this is both horrible and beautiful in a way because this of course how
science is supposed to work we thought that we were at the limits of what the
hardware could do and then we have these students and error factor for smarter
than we were but then it comes a relayer thing if there was one thing I know well
it’s how to program a CPU I think I wrote one no I think worth seven hundred
thousand lines of x86 assembly when I was a PhD student if you haven’t done
that do it at some point because it’s a great way of learning what a computer
looks like once you’ve done that don’t do it again because it’s completely
horrible code that is unmaintainable and for a long time that was applied in my
life that I was better at scheduling register allocation than the compiler
and then the compiler was equally good and then the compiler became better at
me the scheduler registers and then I I’m not going to look through seven
hundred thousand lines of code and try to reschedule my registers because there
are now more registers available that would it’s that and went back to the
compiler but still on the cpu was just avoiding floating-point calculations
right well it’s not so easy because a modern CPU does not look like a CPU used
to do when I was young even a single core on a modern CPU it’s full of all
these complicated devices and units but even any normal amount of my laptop
devices they will have at least two if not three heavy floating-point units
these floating-point units can handle just one instruction but they have this
single instruction multiple data so they can issue one instruction that operates
on for eight or sixteen floating-point numbers that’s a lot of floating-point
operations for one like 64 flops a single precision maybe but that’s one
core even a lost generation skylake they have saved like maybe 32 cores per chip
and then you have two such chips per note so that would mean roughly 4000
floating-point operations per cycle per note that’s lost generation the new tips
Intel just released they improving this by a factor of two so you’re having 8000
hold parallelism on one node on the CPU side of a modern computer if you sitting
and thinking you’re considering whether your code is parallel enough for that I
can can’t even it’s very unlikely that your code expresses eight thousand fold
parallelism and that’s before you’re even considering paralyzing over two
notes so CP u s– in a way they’re grown very much like look like a mini cluster
again this is not even a high end CPU unless it’s an Intel moment I think 18
cores and you kind of have to ring topologists you have complicated
networks your multiple dice this starting looked like really complicated
Maps I think that my first cluster was simpler than the die on a modern CPU
looks like so that internally a modern CPU who looks much more like an
accelerator than the CPU did when I was a student so in theory we might be able
to just take all those algorithms that we’ve developed for CPU for GPUs back to
the GPUs and it might not hurt performance that much which would of
course certainly simplify that we’re the code maintainability did I say not hurt
performance so much well it turns out that you can optimize this entirely to
the type of semi units you have and everything so this end up being a factor
of 30% faster than the CPU code I spent 10 years developing there are a bit of
complications here because you need to suddenly start altering the entire
algorithms you have to change your buffer sizes depending on how wide these
units are and everything in the interest of time I’m not going to go through that
but again it’s the same problem here that too
really reap the benefits of all this hardware you need to take a step back
and accept that you have to change your algorithms a bit because if you want
your algorithm to be fixed you’re not gonna get any of these advantages the
other thing that we came up with that if floating-point operations are so
cheap that we have an infinite amount of them couldn’t we try to use that so
normally we would try to minimize the only reason for having these buffers you
see there on the left that in principle I only want to calculate the
interactions inside the solid red circle but if I only update this list of
neighbors every 10 steps or so some of them might diffuse in right so I need to
have a bit of an extra buffer zone to account for particles that might move in
over the next 10 steps but if I had an infinite amount of floating-point
operations I could have a gigantic buffer zone there and make this 2 or 3
times larger and that would of course mean lots of wasted floating-point
operations but that’s not so bad and what it would gain me in return is
that I would not have to update the neighbor lists so much and it turns out
a modern hardware calculating a neighbor list is random access in memory it’s
much more costly than just calculating the interactions but the other thing
that we can do if you have that long neighbor list every few steps we could
stop and basically sit down and check with our neighbor lists is and try to
create a smaller neighbor list we call it and we prune the neighbor lists and
this actually works great even on a CPU on a CPU we get a factor 20% or
something from this on a GPU it’s a factor – and what of course would be
really amazing imagine that I had an idle device in my system that if this
was a device that was really good at floating point numbers and that were
part of the step or something this device was not doing anything else I
could just tell this device – if you don’t have anything else to do here is a
unit to work take this neighbor list and start pruning it if you’re not done it
doesn’t matter stop immediately because then I’m gonna have higher priority work
for you but when you’re idle you can try to optimize things for the
next step and of course we have such devices GPUs in particular so what we
new do now with deep use is that any time the GPU is idle because the CPU is
integrating or communicate we’re using the GPU to optimize our data
structures ahead of the next step so that we’ll be able to do better at the
next step and this gave us another 50% so so we’re now although we where we
started out we were factor 20 slower by using deep use at the start now we’re
roughly one order of magnitude faster the second I had an accelerator and all
those those accelerators are not cheap and particularly by the professional
ones but for us there is no question now we do everything based on accelerators
the problem though that no matter how good these accelerators are you end up
with problems at some time when we get down to these hundred microseconds per
step we don’t have enough work for the accelerators to do we end up wasting all
our time in the latencies and things simply will not scale anymore
these are some of the most expensive networks you have on the Crace and
everything but even there we have to give up or do we have to give up can we
find out other things to do well maybe we can instead of doing that one
dimensional flow charts we can started to write things like tasks graphs so
that every single square here is a small pass that could be calculated normal
interaction or integration and every arrow here is a dependency on the data
this is very popular now in particularly CUDA and I would imagine a whole lot
newer computer scientists are far better than this than I am but by reformulating
our entire algorithms as graphs we remove a whole lot of those global
synchronization points you could imagine doing this once per step and then we do
a second step but doing it once per step still means that we have this nasty
synchronization point in the middle and that’s what we would like to get rid of
so the obvious way to do this it if I have my atoms and this might be a
hundred thousand atoms on a node or something but group those are smaller
parts you can have they the first for the second for the third fourth and
there the last quarter of the atoms here if I already know the forces on those
four atoms I can send that on to the next step and integrate them in updated
because if I know all the forces on that atoms I can start processing them while
I’ll still calculate the forces on this atoms so what this enables us to do is
break the entire sequential approach so that you can have sub parts of the
system working on step two or even step three while other parts of the system is
falling one step behind and since we’re at very high levels of perilous and we
constantly have a bit of jitter in the system from other jobs and everything
this turns out to do miracles for the scaling what we would also like to do
this now to combine this both with accelerators
and the CPUs and everything in theory this is great there is only one problem
there are still no frameworks whatsoever available that can do this on the
latency levels we need there are beautiful graphs libraries out
there but we need these graphs libraries to be scheduled on a microsecond later
say when we execute each task and if any of you have any idea how to solve this I
would be so interested in it because then we’re gonna start using it because
for now we have to code all of this manual and I bet we’re not gonna be the
only application hitting this brick wall but what this has done compared to when
I was a student to tell the truth halfway through my piece this I was
thinking of quitting or changing subject because there is there is no way these
algorithms were ever be able to tell anything real about chemistry and
biology thank God I was wrong and this is not just due to us some of this is my
students work but I got to show you some of the feel that these are examples were
with the use molecular simulations resistors study the stratum corneum of
the skin and determine the entire structure of the lipid layers that are
responsible for the barrier getting into the skin and once you do this we have
modeled this based on cry electron microscopy data then you have an entire
approximate but reasonable models of what the skin looks like and when you do
this you can create systems with different modifiers or tests different
molecules and see how expensive is it for a particular model go through skin
now of course the whole point of skin is that it should not let molecules through
but some molecules will have a greater ability than others to go through skin
and what you would like to do this for is develop better skin patches so if
you’re in drug design and everything if you are designing a drug by far the best
drug is that you can deliver it through a transdermal patch because you don’t
need needles it’s cheap the patient won’t be bothered and it’s actually a
low continuous dose that is better in every single way the only problem is
that they’re virtually there’s less than one in a million chemicals that will do
that sponte so these are the two students of mine
behind this they actually went on to start a small company where they’re now
trying to sell this as consultancy service and optimizing compositions of
these patents and everything so that they can predict what chemicals will be
suitable to form transdermal patches and I’m I’m in shock and all that it works
people are using simulations to study things like cellulose and lignin that is
super important for say biofuels and everything and understanding if they
have microcell in non uh cellulose Sweden impractical Swede in particular
we have a whole lot of forest industry and this is super important for them
these we love the systems because given their size they scale amazingly well we
run them up to four or five hundred nodes of pit steins which is still the
sixth largest computer in the world and my particular pet subject these iron
channels that I talked about today it is not it’s virtually impossible for a
group to publish an experimental structure on an ion channel unless
they’ve also included a simulation I’m not sure whether I entirely agree with
that that I’m looking in hindsight it’s a completely different world because
when I started simulations was kind of sort of like the cat dragged in so on
the one that this makes us happy but we still have the problem that there is no
way this is gonna scale to some it’s the largest systems in the world this is an
old slide and I’ve I’ve updated it but I haven’t changed their thing is this is a
Jaguar and once upon a time was a nice computer it’s this long retired roughly
at the same time you had Eugene which is a three hundred thousand core machine in
Europe and if you just take these machines and apply Moore’s law we had
three hundred thousand cores around 2010 almost a decade and then we just scaled
this up 2014 we had sequoia three million course 2016 we would have
something like 10 million cores and we actually had pit science not quite
course there was all sort of processor elements on GPUs but the point is
Moore’s law appears to hold you get more functional units 20 18 30 million course
I think we just missed that benchmark in 2020 we’re supposed to have hundred
million Corsair elements we beat that summit ten hundred twenty-five million
elements and if you see the rumors there were these rumors saying that the
Chinese actually have a machine that’s twice as fast to sum it but they didn’t
they’re to put it on top five due to the trade negotiations and if you
just keep doing this extrapolation I’m fairly confident this will hold so
somewhere around 2024 the largest machines in the world will contain 1
billion functional elements and I see quite a few if you were young here and
that’s five years from now on you’re not gonna have tenure by 2024 so if you’re
going to be in computer science and if you argue that your work is going to be
relevant there’s pretty much only one question you should ask yourself and
that is how you’re gonna use a billion force if you you only scale to 1 million
force it will actually not even let you on these systems because you’re only
using 0.1% of it and then you’re not gonna be considered to scale
sufficiently well and this is a bit humbling I have to confess I’m not sure
how my jobs will be able to use a billion course but I see again if
anything that we’re ahead of this curve rather than behind it
I suspect that what’s gonna happen that somebody will come up with the idea that
we can just scale up we can have a gigantic system let’s simulate an entire
cell that’s gonna be a trillion atoms or something I had developed a special type
of time and machine Technology in Stockholm so I’m actually going to
predict what the result of that simulation is if you look very carefully
here this is gonna be the end result because we can certainly assimilate a
trillion atoms for one not a second and if you know anything about cells
absolutely nothing happens that involves an entire cell in one on a second so the
end result is going to look exactly the same as the starting conformation but I
bet it’s going to be on the cover of a glossy journal in five or ten years but
if you’re actually interested in science that’s pointless we need to take these
small systems and get them to cover longer scales in time instead and at the
same time that that’s the speed of light we can’t get them to do that well what
we are increasing look at I think a whole lot of other people do is
ensembles and there are a whole range of algorithms here but I’m in particular
gonna focus on one technique that we have come to love and it’s called Markov
state models because it’s not just brute force embarrassing parallelism that we
certainly could do to in many cases so if you look at a small protein moving
this the villain had piece there was the largest simulation in 1998 the first
microsecond simulation was done in the protein by using the largest
supercomputer in the US for three months the entire machine it’s a tour de force
result but even this so-called native state you see that is moving around a
bit and adopting slightly different confirmations that’s natural it’s room
temperatures it should do that but maybe we could classify those so if you start
took this protein and just started lots and lots of simulations completely
independent of us at first make the fairly short and then we stop the
computer after a few hours see are some of these sending up in the same place
maybe we can form some sort of clusters here and based on those clusters maybe
there are three macroscopic large states that corresponds to slightly different
confirmations that works really well actually and we can do it slightly more
mathematically that the whole idea with the markup state model is a Markov
process and physics is just the key word for the simplest process you can imagine
it has no memory the only thing that matters is where you are then you might
if you are in the red state there is a certain probability you would become
blue and another probability would become yellow but there is no influence
of history so if that works if you are in a state here we could say what is the
probability that I will stay in that state or what’s the probability that I
will go over to another state and then the probability of staying that state
etc now if we continue that at some point I have lots of transitions between
red and yellow and if I continue these simulations at some point I might
identify a new state the green one and then I will be able to calculate the
transition probabilities to the green because I now can choose to focus start
lots of simulations from the green and red states at some point I will find a
white state here and then I can choose to emphasize white green and yellow in
other simulations so it’s no longer brute force I can choose to bias my
simulations to be in the outskirts of what I sample this for and as I continue
here I’m not going to draw ten thousand of these but this works remarkably well
and here the scaling is not 100% the scaling we get for these four small
sisters is typically two to three hundred percent because I’m not doing
brute force I’m focusing my sampling on the parts of phase space that I’ve not
yet sampled and what this really gets me not just a beautiful movie of where the
protein but I do I literally get the entire
landscape I get the entire kinetics again what is the probability that if
the molecule is in a specific state the probability that I will end up in
another state a short while later we have done that and remember this is the
same protein the villain headpiece that scale I think was to do 256 cores in
1998 you can take a super small protein like this but because I run many
simulations here I can run say two or three hundred independent simulations
and easily use a few thousand course five or six thousand and these
simulations they’re short they’re gonna compete in four hours but when I do this
with Marco state modeling I can let the computer optimize everything
automatically I don’t need to pick the parts of face base to sample manually so
this small system that is roughly eight thousand atoms since I can run that
effectively on six thousand course that I means that I have scaling to 1.3 atoms
per core which is like factor a hundred better that I can ever imagine doing it
with my Excel algorithms the other cool thing is that what
happens after less than half a day the white state is the native one from the
experiment but after less than half a day the computer says not only did we I
not only did we reach the white state but the computer can say that the red
state is not quite identical is the one that is predicted to be most stable
without any information whatsoever about what the two experimental state was and
that was not two twenty years ago because twenty years ago we cheated and
we stopped the Columbia at the white state so molecular simulations are
actually good enough now to literally fold produce it’s probably not the way
you would predict protein structure but that means that we can predict what’s
happening and for instance if there is a mutation here how is that interesting
the protein and it was a beautiful paper by a colleague in Elife just a few weeks
ago so this is John Shahdara who has used mark of state models to simulate
methyl transferases so these are small proteins involved in several cancer
diseases and arel forms of cancer and they modify all the proteins and their
hypothesis is that certain mutations in this protein will preferentially
stabilize some confirmations but not others and what they they can do they
used I think ten thousand different innovations in parallel and they started
a ton of different mutations and everything and what they could then do
not only could they show movies but they could actually show with statistical
certainty that some ups that some of these states that some of these states
have become more stable and you see in the lower one there when we have a small
molecule bound that tends to stabilize some of the states that are less visited
in the their so called April form the one that doesn’t have it bound but the
amazing they said this weekend for hundreds of mutations all the mutations
that MS cases see has found to be related to cancer and rather than just
saying that some of these mutations are related to cancer we can now Excel I
because they tend to stabilize certain confirmations and then you can use that
to try to find can we then find a drug that somehow destabilizes that
conformation so rather than just using simulations as some sort of post diction
that we can understand things that already known experiments we’re actually
having simulations predict not just structures of molecules but their entire
kinetics and how they move and I’m so super impressed with this
the other cool thing is that they’re using machine learning to identify how
they should build these models and everything and the timescales they cover
here is several milliseconds even though the individual runs here there’s not a
single individual run that is longer than one microseconds but by using marco
state modeling we effectively extend the timescales we can cover by at least
three orders of magnitude so I’m almost done here but I’m gonna add with a few
that because I think this approach has lakes and it’s just it’s used not just
in molecular simulations but in virtually all the scientific problems
and this dawned upon me when we wrote the scientific case for computing in
Europe if you about a year ago so the most obvious one has to do with any type
of predictions in particular hurricane predictions what you do with hurricane
predictions that we know that just as molecular dynamics it’s a chaotic
process so you try to alter your initial conditions just so slightly and run many
many many trajectories and then you end up with these climb of potential
different scenarios so short term we know exactly where it’s gonna end up but
long term sandy here might end up either in New York or go north or south you can
certainly do this with better prediction – you can do a lot
parallel simulations and then try to determine what is what is the
uncertainty here so rather than just having a number we can actually apply a
standard error just as any experimentalist would do
this works remarkably well but the other thing that this does you can start
comparing scientific codes because I’m really not all scientific codes are
equal so Emma and Jose in particular it was very close that they had to order an
evacuation of Miami right and of course if it’s a live death matter you would
evacuate if you’re uncertain but evacuating Miami I think $500,000,000 is
probably counting low yet you can’t afford to do that every time there is a
potential hurricane in Florida and a week from now and if you start looking
at these codes four or five days out the best code here which happens to be
European one has an average error below 200 kilometers while the worst one has
an error that is 700 kilometers like that’s kind of the distance between
Miami and Jacksonville and computers are certainly large and expensive and
everything but if you started comparing running for 24 or even 72 hours and
these computers with the cost of evacuating Miami doing this computer
sister cheap but the other lesson here is that as much as we want those
computers we should invest more in a code development and be comparing codes
I’m not sure exactly why this code is so much better but this of course something
that all the codes should learn from so that all the codes should be as good as
the best prediction we’re seeing this in weather prediction – this is an
initiative that has been start very much in Switzerland where predicting whether
there in Switzerland is horribly difficult in contrast to Phoenix because
you have all these micro climates in the valleys but of course if you can have
accurate predictions while 20 days would be awesome but in Switzerland even if
you could do it accurate for 10 days that has major implications for
agriculture tourism and production if you can make sure to harvest the grapes
before the rain and what’s happening Europe that I’m so what enthusiastic
about it that as part of this you rate specie project there is now almost 10
topical centers of excellence where we’re getting the scientists involved
together with computing experts encode experts and everything and focusing for
instance on fluid dynamics life sciences personalized medicine weather prediction
materials or related research and I think this works
really well because the scientists are the ones that owned the problem the
scientists must be there to keep all the research so all the programmers focus on
what the real goal is but this is a way for them to get help from outstanding
experts and a few of these cases resulted in tremendous both scientific
and actually impact and the final part of this course has to do with energy
perovskite solar cells solar cells is obviously something we would like to use
in Arizona in particular right the problem with the solar cells is that we
know that perovskites are great they have outstanding efficiency but they
were out they were out very quickly and they need lead to be reasonably stable
and LED is good for many things but it’s not environmentally friendly and if
you’re looking just as the efficiency improvements here give this five or ten
years and based on quantum chemistry we will have efficient perovskite solar
cells that will require that what you have up here is a Swiss Swedish company
ABB designing high voltage cables it’s also it might seem remarkably unsexy but
what people are using are they using molecular simulations to design better
insulators and with better insulators you can increase the voltage from
hundred kilovolts to five hundred kilovolts and or up to over one mega
volt and the higher the voltage is the lower the losses and that means that we
can in the future you might be able to take energy from Arizona and send it to
say Alaska or something where you need basically be able to transport energy
because today that’s one of the main reasons that we can take energy say from
Sahara and use it in Sweden so with that I’m almost done on the one hand many of
the things I’ve said are positive I think we are getting way better at using
computers but there are also some very tough messages here the free lunch here
is completely over you’re not going to get faster computers and if worse is not
going to become faster they’re going to become slower so people like me I only
told you about my good codes sadly I also have a large fraction of very bad
codes that are single threaded and these single threaded codes they now run
slower for every generation of supercomputers that come out because the
core performance is going down the whole frequency is going down you don’t really
see that now because we still shrink the the architectures are still shrinking we
had bit of hiccup Iran 40 nanometer and no
idea how well 1000 metres gonna work but the point is that we’re seeing the end
of this we will see 700 metres potentially 5 and I would guess that 5
nanometer is going to be it at some point we’re gonna bump into the loss of
physics but earlier that we’re likely gonna bump into the loss of finance that
Intel will not it’s gonna be so costly to develop the next generation
lithography that it’s simply not worth it for the small performance
improvements you’re gonna be getting and this is worse because even GPUs as
amazing that’s the accelerators have been that development has also come from
having a more and more transistors on the chip and we’re not we’re not really
going to get those transistors so as amazing as the accelerators are you’re
gonna need to move two completely different architects just likely within
ten years and that scares me a bit because we nowhere near being able to do
good work even with accelerators the other final thing I want to comment on
is that for us it was mostly we like being open and to tell the truth that we
didn’t get any license in camp our code so we made these codes open source 20
years ago the more I see of this development than how difficult that is
the more I believe in the importance of open codes there was a very sad event of
David Chandler who died unfortunately a few years ago that there was a bit of a
war of a super cold water where two groups was fighting about this for over
seven years and after seven years it eventually don’t that this was a small
bug in the chandler code and this was a code that had not been willing to share
so as this because codes become more and more complicated with accelerators and
everything we need to get better at looking into each other’s codes and
testing them because that’s the hardware out there is amazing the sad part is the
scientific software is nowhere near there yet on the other hand if you’re a
young person this field that means that there is a whole lot of potential to
improve because my generation didn’t really do our homework here so with that
to sum things up that if there’s one thing I would like to emphasize here
spend the time on algorithms love your algorithms but also don’t be married to
your algorithms throw out the algorithms that my generation produced because most
of them are not good for modern hardware think accelerators not so much because
they are accelerators because that’s how all modern
it’s gonna look like and you will actually have to move beyond
accelerators within ten years but the other part I stress is really this if
you’re a young person in this field you can do miracles even if it’s not your
own code but you can’t do it in an afternoon to find an application problem
work with an application group or pick one yourself and you will likely be able
to have a tremendous impact I rely did the math you years ago I think my code
gets 20 citations per day so once per hour there’s a paper published with my
code and this is long since the day I’m not it’s no longer improving my index
and everything but to me this gives a bit of a sense of purpose what I do in a
day to day basis is meaningful to some other people and as dead as traditional
scaling is I think that we have lots of exciting things coming up when we’re
simply accepting that we can’t scale the traditional way we have to go all in on
I samples and the one final thing to remember is that the chemistry lab
doesn’t look like this anymore not even I don’t have a sink I have a
large experimental lab but I don’t have a single person to censor 100% but even
in our experimentalist increasingly said in front of computers and we’re seeing
an entire new generation of experimentalist that use eight computers
and then be robots but they don’t necessarily stand in the lab themselves
and it’s certainly going to be a change for chemistry possibly more so than for
computer scientists with that I’d like to thank you very much for your
attention and I would take this so you have time for one or two
questions in the audience I please come to the mic and say your name the VEX of
cargo gatech Eric thanks a great talk I know in molecular dynamics people have
also over the decades tried to build special purpose hardware grape Anton and
now we’re entering an era where mainstream mainstream hardware vendors
are talking about having dozens of different kinds of accelerators do you
see some opportunities for domain-specific hardware maybe even
something far out like analog computing or something that optimizes Markov state
five years ago I would said no way even grow max was growing myself kroning a
machine for chemical simulation we gave up on the hardware 20 years ago because
we realize how fast general-purpose processor are developing but as I say
the the problem is that we’re coming up to this brick wall of physics right and
we pay a lot for programmable hardware and Anton is very expensive in
particular to develop but at some point dedicated application specific circuits
will be able to go beyond the physical limits of programmable hardware I don’t
think it’s going to be something you have in every lab it’s too expensive and
too limited because it also means your algorithm is encoded in the silicon
right you can’t change your algorithm if you have a smart idea but using it for
some of the initial sampling and then augmenting that with well seeding your
mark of state models for instance or something I definitely see a role for it
okay but it’s going to be very expensive to develop thank you let’s go ahead and
think Eric excellent talk and now general chair vivec Sarkar will
make a few closing announcements so good afternoon once again I’m Vivek Sircar
from Georgia Tech and I would like to just share some very brief closing
remarks it’s hard to believe that it was only five days ago that we had during
lecture in the space by Geoff Hinton and Yann Laocoon because so much has
happened since then and there was a general high energy
level throughout the conference that your participation has contributed to I
for one have taken away many topics that I want to share with my colleagues when
I get back from the talks even from Eric’s talk today and I hope your
experience has been the same so I would just really like to conclude by
acknowledging the hard work of everyone all the people who contributed to the
success of f crc on sunday i was able to thank Donna capo who led the entire
administration team and today I would like to specially thank members of the
executive events team who have been working on site pre-event
on all aspects of all our plenary talks conferences tutorials workshops so they
are Ashley maroon Brenda Ramirez Regan Robertson Christian Rottweiler Rose
Shapiro Jill scuba Morgan wick and of course their fearless leader who’s
running things offstage Shannon Killian so let’s give them a round of thanks and then all of you on the conference
committees put in several months of effort leading up to the FCR C programs
which is absolutely critical for the success and quality of the papers over
here and your input also played a key role in shaping our plenary speaker
program so on that note I would also like to offer big round of thanks to
Mary holla if crc plenary speaker chair for putting together an amazing line-up
of these plenary talks so thank you Mary and thanks to all the plenary speakers
who accepted our invitations and also to Geoff Hinton and yarn laocoön for
picking FCR C as the venue for their cheering lecture and finally thanks to
all of you for participating I you know your participation is ultimately what
makes a conference success so I hope you really enjoy the last set of sessions
this afternoon safe travels back and we look forward to a great F crc get
together again in 2023 so thank you

Leave a Reply

Your email address will not be published. Required fields are marked *