Another way to scale computing
I was pondering upon the future of scaling computation since we’re pretty much out of hertz now… okay, I was reading this paper: http://research.cs.wisc.edu/multifacet/papers/tr2011-1coherencestays.pdf . Great to hear about cache coherency’s bright future but… what about when you need to move a lot of data around? Caches don’t help so much then.
And so I thought of DMA types of things and thought that it would be very useful to do DMA stuff… with inline computation. If you play it just right, you could maintain the bandwidth, there would be just a slight delay. You could have them use a feedback approach (or stream cache idea below) with local cache to do multiple computations over the same data and across data.
One possible application would be to handle VPN encryption and such by sitting between the CPU and network interface.
You could do memory to memory with computation copy operations. Take an entire array of main memory, send it through this programmable data processing engine, and it goes right back to main memory with no CPU involvement at all and at full memory bandwidth (with computation delay). I imagine scientific computing could benefit from that.
These could be “stream”-like processors where the cache was just to store a certain amount of the data. So, you’d have a moving window of data available via the local cache. From the programmer’s perspective, you’d have an array available each clock that represented up to a certain amount of data back in the stream.
I have an application I’m working on now that would be very useful to have network traffic go through one of these processors and then both to main memory and to disk. Logging and auditing could be offloaded to these processors so the CPU never had to consider it.
It would be interesting to be in a place where the “CPU” was just a manager of arrays of other processors, directing traffic but not doing that much number crunching itself.