Abstract: The design space for edge AI hardware supporting large language model (LLM) inference and continual learning is underexplored. We present 3D-CIMlet, a thermal-aware modeling and co-design ...
Abstract: Processing-In-Memory (PIM) architectures alleviate the memory bottleneck in the decode phase of large language model (LLM) inference by performing operations like GEMV and Softmax in memory.