I didn't even know what that meant, and I still don't, really.
They were
used for floating-point arithmetic. As you can imagine, computers only deal in 1s and 0s. The processors of the time didn't have dedicated silicon for advanced arithmetic. Which means they had to essentially use software algorithms to do any complex mathematical operation. And software is
much slower than dedicated hardware inside the chip for these operations.
So if you didn't have a math coprocessor (FPU), your computer could still do all those calculations, entirely in software. Which wasn't fast enough for those games. If you had the FPU, then it would simply route all those instructions to the FPU, and you had plenty of performance for games.
Now, as you mention, any modern processor will have that function built in.
For modern computers, the analogy would be the graphics cards (or GPUs) needed for games. Essentially the same thing--complex graphics rendering takes an extraordinary amount of computing power to emulate in software. But if you can have dedicated silicon that do the necessary functions in hardware, you can not only get it done quickly but leave the main processor (CPU) to use its resources on other things.
---------------
But... Fun story. In 2001, in my first job out of school, I worked for a company that produced programmable logic (FPGA) chips. These were "general" chips full of logic that you could use to map out complex logic and still have it run "in hardware", which was important for MANY functions if you needed the speed of hardware but whatever you were doing didn't lend itself to actually having the dedicated chips designed and fabricated to do it.
Well, one of the things it had at the time was a software-designed (known as a soft core) embedded processor function. Meaning you could emulate a processor in the logic, and use it to run software as opposed to dedicated complex logic. As it was new, the company had an internal design competition to show ways to use the processor. The group I was in... Designed an FPU to go along with it as it didn't have one natively in the design. And we tested software processing of floating-point arithmetic vs our "coprocessor", and our FPU showed a 100-fold reduction in number of clock cycles to perform calculations compared to emulating it in software.