Google Cloud A4X Max, AI cluster를 50,000 GPU와 2배 network로 키웠다

트윗이 드러낸 것

Google Cloud Tech는 새 AI infrastructure의 scale을 숫자로 못 박았다. 핵심 문장은 A4X Max bare-metal instances support clusters of up to 50,000 GPUs with double the network bandwidth 이다. 이 내용이 중요한 이유는 frontier model training과 high-throughput inference가 GPU 개수만으로 결정되지 않기 때문이다. 실제 bottleneck은 network fabric, placement, quota, storage path, 그리고 수천 개 accelerator에 data를 계속 공급하는 능력에서 자주 생긴다.

Google Cloud Tech 계정은 Google Cloud의 developer-facing channel로, how-to, demo, product update, technical docs를 주로 다룬다. 이번 post도 짧은 social claim으로 끝나지 않고 Compute Engine 문서의 A4X Max와 A4X machine series로 연결된다. 그래서 이 tweet은 marketing slogan보다 infrastructure spec을 따라가야 하는 item에 가깝다.

docs가 보여주는 배경

연결된 문서는 A4X Max와 A4X를 GPU-accelerated AI, ML, HPC workload용 accelerator-optimized family 안에 둔다. Google Cloud docs에 따르면 A4X Max는 NVIDIA GB300 Ultra Superchip과 B300 GPU를 쓰는 exascale platform이며, A4X는 GB200 Superchip과 B200 GPU를 쓴다. 두 series 모두 NVIDIA NVL72 rack-scale architecture를 기반으로 한다. 하나의 NVL72 domain은 18 instances와 72 GPUs로 구성되고, GPU당 1,800 GBps bidirectional NVLink bandwidth를 제공한다고 설명된다.

A4X Max section은 foundation model training과 serving을 직접 겨냥한다. 문서는 a4x-maxgpu-4g-metal bare-metal machine type을 제시하고, 4개의 B300 GPU가 붙는다고 설명한다. 또 A4X Max가 NVL72 domain당 최대 20 TB total GPU memory, GPU당 약 279 GB memory를 제공한다고 적는다. 이 조합은 large context model, mixture-of-experts routing, multimodal training, dense inference fleet를 cloud cluster 위에서 비교하는 team에게 의미 있는 signal이다.

하지만 제약도 headline number만큼 중요하다. docs table은 A4X Max와 A4X가 일반 on-demand, Spot, Flex-start resource가 아니며, AI Hypercomputer의 Future Reservations 경로로 제공된다고 보여준다. 즉 50,000 GPU라는 숫자는 즉시 self-service로 잡는 capacity라기보다, 큰 run을 미리 계획하는 customer를 위한 reserved infrastructure 성격이 강하다.

다음 관전점은 region별 availability, reservation lead time, pricing, 그리고 50,000 GPU ceiling이 실제 하나의 job에서 어느 정도까지 쓰이는지다. large domain에서의 reliability data, NCCL behavior, GKE 또는 Vertex AI와의 integration이 반복 가능한 training throughput을 좌우할 것이다. 출처: Google Cloud Tech source tweet · Google Cloud A4X Max docs

Google Cloud A4X Max, AI cluster를 50,000 GPU와 2배 network로 키웠다

트윗이 드러낸 것

docs가 보여주는 배경

Related Articles

Anthropic $65B 조달, Claude 경쟁의 단위가 compute로 이동

AI agent 병목이 GPU에서 CPU로 이동, NVIDIA Vera의 1.8배 승부

AI compute가 싸다는 전제에 HN이 금을 그었다

Related Articles

Anthropic $65B 조달, Claude 경쟁의 단위가 compute로 이동
Claude 수요를 감당할 자금 전쟁이 한 단계 커졌다. Anthropic은 $65B Series H로 post-money valuation $965B를 찍었고, run-rate revenue가 이달 초 $47B를 넘었다고 밝혔다.

AI agent 병목이 GPU에서 CPU로 이동, NVIDIA Vera의 1.8배 승부

AI compute가 싸다는 전제에 HN이 금을 그었다
AI Hacker News Apr 18, 2026 1 min read