NVIDIA Nemotron-Personas-Korea, 7M 합성 사용자로 agent 현지화

NVIDIA의 새 한국 dataset은 agent localization이 영어 prompt를 한국어로 바꾸는 일보다 훨씬 어렵다는 사실을 드러낸다. 4월 21일 Hugging Face에 올라온 Nemotron-Personas-Korea는 training, evaluation, system-prompt grounding에 쓸 수 있는 synthetic Korean personas를 구조화해 제공한다.

원문은 오늘날 대부분의 AI agent가 주로 English web data로 학습돼, 한국어 honorifics, 지역별 직업 패턴, local institutional context를 놓치기 쉽다고 지적한다. 이는 고위험 workflow에서 중요하다. Korean public health system을 다뤄야 하는 assistant가 U.S. healthcare scheduling 관습을 적용하거나, 60세 환자에게 banmal을 쓰면 단순히 어색한 수준을 넘어 production-ready가 아니다.

dataset table은 total personas를 7 million으로 제시한다. 1 million records에 각각 7개 persona variant를 붙인 구조다. field는 26개로, persona field, attribute, demographic and geographic context, unique identifier를 포함한다. Coverage는 한국 17개 시도와 25개 district를 아우른다. NVIDIA는 약 209,000개 unique names, 118개 surnames, 약 21,400개 given names, 그리고 technology, manufacturing, public sector 등을 반영한 2,000개 이상 occupation category도 적었다. License는 CC BY 4.0이다.

기술적으로 중요한 부분은 생성 방식이다. NVIDIA는 Nemotron-Personas-Korea가 synthetic data용 open-source compound AI system인 NeMo Data Designer로 만들어졌다고 설명한다. Pipeline은 statistical grounding을 위한 Apache-2.0 probabilistic graphical model과 Korean-language narrative generation을 위한 Gemma-4-31B를 결합한다. Population data는 KOSIS의 2020-2026 release, name distribution은 Supreme Court of Korea 자료에서 가져왔다. NAVER Cloud는 design 단계에서 seed data와 domain expertise를 제공했다.

Agent builder에게 바로 쓸 수 있는 지점도 있다. Developer는 occupation, region, age, life stage로 persona를 filter한 뒤, 선택한 persona를 system prompt에 넣어 agent behavior를 조정할 수 있다. NVIDIA 예시는 Korean public-health persona를 formal Korean을 쓰고, local public-health policy를 따르며, generic clinic 대신 보건소 맥락을 아는 assistant로 만든다.

더 큰 신호는 sovereign AI가 model weight만의 문제가 아니라 dataset과 evaluation scaffold의 문제로 내려오고 있다는 점이다. Model은 유창한 한국어를 내면서도 사람이 실제로 사는 방식, 일하는 방식, 도움을 요청하는 방식을 놓칠 수 있다. Synthetic personas가 모든 문제를 해결하지는 않지만, 7 million 규모의 Korean corpus는 builder가 audit, adaptation, comparison을 시작할 수 있는 구체적 층을 제공한다. 출처: NVIDIA on Hugging Face.

NVIDIA Nemotron-Personas-Korea, 7M 합성 사용자로 agent 현지화

Related Articles

NVIDIA Vera, agent loop용 CPU에서 x86 대비 1.8배 per-core 성능

MCP 7월 28일 stateless 전환, GitHub 서버가 먼저 맞춘 agent 인프라 변화

기업용 에이전트가 75% 문의 처리, OpenAI Presence의 의미