Inside Apple's M4 Neural Engine: Reverse Engineering Reveals Graph Execution Architecture
Original: Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering View original →
Reverse Engineering the M4 Neural Engine
A detailed reverse engineering investigation of Apple's M4 Neural Engine (codename H16G) has uncovered fundamental architectural insights that challenge common assumptions about Apple's AI hardware. The research garnered significant attention on Hacker News, reflecting the AI community's deep interest in understanding these increasingly important chips.
A Graph Execution Engine, Not a Traditional Processor
The most significant finding: the M4 ANE is not a traditional GPU or CPU. It's a graph execution engine — rather than processing individual instructions, it accepts pre-compiled neural network graphs and executes them atomically. The system features 16 cores, a queue depth supporting 127 simultaneous evaluation requests, independent dynamic voltage/frequency scaling, and power gating that reduces consumption to zero when idle.
Hidden APIs Bypassing CoreML
A major breakthrough was discovering that CoreML is not the only access path to the ANE. The private _ANEClient class in AppleNeuralEngine.framework provides direct compilation, loading, and evaluation capabilities. Researchers identified over 40 undocumented private classes and implemented in-memory compilation using _ANEInMemoryModelDescriptor, which accepts MIL (Machine Learning Intermediate Language) text directly without filesystem round-trips — critical for training applications.
Apple's '38 TOPS' Claim Is Misleading
Testing revealed that Apple's published 38 TOPS specification is misleading. Expressing matrix multiplication as 1x1 convolution achieves significantly higher throughput than native matmul operations — suggesting convolution is the ANE's primary compute primitive. The E5 binary format also revealed something unexpected: the compiled output describes parameterized compute primitive configurations rather than traditional machine code.
Unexplored Territory
Several discovered classes hint at untapped capabilities including model chaining support, GPU-ANE synchronization primitives, and potentially accessible hardware performance counters — promising areas for future investigation.
Related Articles
A researcher published a reverse engineering analysis of the Apple M4 chip's Neural Engine, revealing its CoreML-based architecture, 6.6 FLOPS/W energy efficiency, and the ability to completely shut down when idle.
HN latched onto the RAM shortage because the uncomfortable link is physical: HBM demand for AI data centers is now shaping prices for phones, laptops, and handhelds.
Google has redesigned its TPU roadmap around agent workloads instead of one-size-fits-all acceleration. TPU 8t targets giant training runs with nearly 3x per-pod compute and 121 exaflops, while TPU 8i focuses on low-latency inference with 19.2 Tb/s interconnect and up to 5x lower on-chip latency for collectives.
Comments (0)
No comments yet. Be the first to comment!