r/MachineLearning: preflight, label leakage와 NaN을 학습 전에 막는 PyTorch pre-training validator

GPU 시간을 태우기 전에 dataset와 pipeline부터 검사하자는 접근

2026년 3월 15일 r/MachineLearning에서는 preflight 소개 글이 56 points와 13 comments를 기록했다. 작성자는 training run이 crash 없이 끝났는데도 model이 아무것도 배우지 못했고, 원인을 찾는 데 3일이 걸렸다고 적었다. 문제는 label leakage였다. 그래서 긴 training job을 시작하기 전에 dataset와 model wiring의 기본 건강 상태를 짧게 검사하는 CLI를 만들었다는 것이 이 post의 핵심이다.

GitHub README에 따르면 preflight는 preflight run --dataloader my_dataloader.py 같은 형태로 실행된다. check는 총 10개이며 severity는 FATAL, WARN, INFO 세 단계로 나뉜다. FATAL failure가 하나라도 있으면 exit code 1을 반환해 CI를 차단한다. 항목에는 nan_inf_detection, label_leakage, shape_mismatch, gradient_check, normalisation_sanity, channel_ordering, vram_estimation, class_imbalance, split_sizes, duplicate_samples가 포함된다. README 예시는 30초 안팎의 pre-flight check를 목표로 하고, JSON output과 GitHub Actions integration도 제공한다.

이 프로젝트의 위치 설정도 현실적이다. 작성자는 pytest를 대체하려는 것이 아니라고 분명히 말한다. code logic test는 pytest가 맡고, comprehensive ML validation platform은 Deepchecks 같은 도구가 맡고, experiment tracking은 WandB나 MLflow가 맡는다. preflight가 노리는 지점은 그 사이, 즉 'code는 돌아가는데 model은 망가진다'는 가장 비싼 실패 구간이다. 특히 label leakage, NaN, wrong channel ordering, dead gradients처럼 Python exception을 던지지 않는 silent failure를 미리 잡아내겠다는 설명은 많은 실무자에게 바로 와닿는다.

설정 방법도 과하게 무겁지 않다. model과 loss를 추가로 넘기면 shape와 gradient, VRAM 관련 검사를 켤 수 있고, .preflight.toml으로 threshold 조정이나 특정 check 비활성화도 가능하다. roadmap에는 --fix 자동 수정, drift detection, dry-run mode, plugin 형태 확장도 적혀 있다. 아직 v0.1.x 초기 도구이지만, 커뮤니티가 반응한 이유는 명확하다. 긴 training run에서 가장 비싼 비용은 compute보다도 늦게 발견되는 silent bug일 때가 많기 때문이다.

Primary source: preflight GitHub repository. Community discussion: r/MachineLearning.

r/MachineLearning: preflight, label leakage와 NaN을 학습 전에 막는 PyTorch pre-training validator

GPU 시간을 태우기 전에 dataset와 pipeline부터 검사하자는 접근

Related Articles

preflight 공개, PyTorch 학습 전 침묵형 오류를 잡는 사전 점검 CLI

저VRAM optimizer 'Rose', r/MachineLearning이 먼저 꺼낸 건 실험 설계

LMSYS, DeepSeek-V4 Day-0 지원에서 H200 266 tok/s 성능을 제시

Comments (0)

Leave a Comment

Related Articles

preflight 공개, PyTorch 학습 전 침묵형 오류를 잡는 사전 점검 CLI
AI Reddit Mar 16, 2026 1 min read

저VRAM optimizer 'Rose', r/MachineLearning이 먼저 꺼낸 건 실험 설계

LMSYS, DeepSeek-V4 Day-0 지원에서 H200 266 tok/s 성능을 제시