I'm going to ignore the MDA part of this, because MDA/Hercules is potentially a huge rabbit hole. Long and short of it is Hercules was *never* a "BIOS supported" video standard; the PC BIOS supports *MDA* in text-only mode, there's zero BIOS support for the graphics extensions. Complicating matters further is apparently there were sometimes significant implementation differences in clone Hercules cards, so honestly I'm not comfortable going 100% on the record saying that it's impossible that this person you knew had some bad brand-X semi-compatible Hercules card that was MDA compatible enough to work fine for text but had problems in graphics mode.
I always thought that the only difference between MDA and graphical mode was that in MDA mode the pixels were generated by ROM in graphical mode by RAM. Same for CGA text an graphical mode. But having a look at a BIOS for a XT, I noticed there were differences in the values that have to be written to the 6845 registers for the various modes. I'm not familiar with the 6845 so I have no idea why these differences.
Aaaanyway. On the CGA side of the fence, the main reason you need to reprogram the 6845 when you go from text to graphics mode is because the 6845's main job other than generating video mode sync and timing signals is acting as a video ram address generator, IE, it's the thing that dictates what location in RAM memory is accessed to get the block of pixels that's next in line to get shoved out through the pixel generation hardware as the refresh scan progresses through the screen.
The memory layout is different between the graphics and text modes, so the CRTC has to be reprogrammed to generate the correct progression of memory addresses.
I don't have time to lay it out in complete detail, but the short version is:
A: In (80 column) text mode the CRTC is programmed to count from 0 to 1999 in 80 character blocks of addresses, each of which is repeated 8 times (for the 8 scan lines of each character), and these addresses are fed into pixel generation hardware which fetches 2 consecutive bytes for each character (the glyph code and an attribute) which then dictate the state and color palette used for the next 8 pixels to be shifted out to the monitor.
B: In graphics mode the CRTC is programmed to count from 0 to 3999, in 40 byte blocks. (Why 40 when a 640 pixel B/W screen is 80 bytes wide? Like in text mode the hardware fetches 2 successive bytes for each address.) Each one of these blocks of addresses is repeated *twice*, and for every other line the hardware adds an 8192 byte offset to the address; this is why CGA graphics modes have a 2:1 logical interlace; there aren't enough address lines wired between the CRTC and the video RAM memory for it to create a completely linear framebuffer so one of the character row lines is used as the highest bit.
It would in theory at least have been
possible for IBM to design the system so the graphics and text modes would use the same CRTC settings; the CRTC settings other than character height and total number of characters per screen are already the same between
40 column text mode and graphics; if IBM had opted to make people deal with 8-way interlacing instead of 2-way you'd be able to flip between the 40 column text mode and the two true graphics modes just by flipping the (not part of the CRTC) hardware bit that selects the appropriate pixel output hardware. But that's not what they opted to do.