NVIDIA GPU开发环境完整搭建教程
CUDA(Compute Unified Device Architecture)是NVIDIA推出的并行计算平台和编程模型,允许开发者使用NVIDIA GPU进行通用计算。它是深度学习、科学计算和大规模数据处理的基础。
在安装前,请确认各组件版本之间的兼容性:
| CUDA版本 | 驱动版本要求 | 支持的GPU架构 | 推荐用途 |
|---|---|---|---|
| CUDA 12.x | >= 525.60.13 | Ampere, Hopper, Ada | 新项目首选 |
| CUDA 11.8 | >= 450.80.02 | Turing, Ampere, Hopper | 稳定版本 |
| CUDA 11.3 | >= 450.80.02 | Pascal, Turing, Ampere | 兼容性考虑 |
# 查看GPU型号
lspci | grep -i nvidia
# 查看NVIDIA驱动状态
nvidia-smi
# 查看GPU详细信息
nvidia-smi -q
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y build-essential dkms
# CentOS/RHEL
sudo yum groupinstall -y "Development Tools"
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
# 创建黑名单文件
echo "blacklist nouveau
options nouveau modeset=0" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
# 更新initramfs
sudo update-initramfs -u
# 重启系统
sudo reboot
# Ubuntu - 添加graphics-drivers PPA
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
ubuntu-drivers devices # 查看推荐驱动
sudo ubuntu-drivers autoinstall # 自动安装推荐驱动
# 或安装特定版本
sudo apt-get install -y nvidia-driver-535
# 下载驱动(以535.104.05为例)
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run
# 进入文本模式安装
sudo systemctl isolate multi-user.target
sudo chmod +x NVIDIA-Linux-x86_64-535.104.05.run
sudo ./NVIDIA-Linux-x86_64-535.104.05.run
# 重启图形界面
sudo systemctl start graphical.target
# 检查驱动版本
nvidia-smi
# 预期输出示例
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB On | 00000000:00:04.0 Off | 0 |
| N/A 35C P0 45W / 300W | 0MiB / 81920MiB | 0% Default |
+-----------------------------------------+----------------------+----------------------+
# 下载CUDA Toolkit 12.2
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
# 执行安装
sudo sh cuda_12.2.0_535.54.03_linux.run
# 安装选项:
# - 取消勾选"Install NVIDIA Driver"(已单独安装)
# - 保持其他默认选项
# 添加NVIDIA仓库
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# 安装CUDA Toolkit
sudo apt-get install -y cuda-toolkit-12-2
# 安装完成后,添加环境变量
echo 'export PATH=/usr/local/cuda-12.2/bin${PATH:+:${PATH}}' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}' >> ~/.bashrc
source ~/.bashrc
# 检查CUDA版本
nvcc --version
# 编译并运行示例程序
cp -r /usr/local/cuda/samples ~/cuda-samples
cd ~/cuda-samples/1_Utilities/deviceQuery
make
./deviceQuery
# 预期输出末尾显示:Result = PASS
访问 NVIDIA cuDNN下载页面,下载与CUDA版本匹配的cuDNN库。
# 解压下载的文件
tar -xvf cudnn-linux-x86_64-8.9.5.30_cuda12-archive.tar.xz
# 复制文件到CUDA目录
sudo cp cudnn-linux-x86_64-8.9.5.30_cuda12-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp cudnn-linux-x86_64-8.9.5.30_cuda12-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
# 验证安装
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
# 下载Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# 创建虚拟环境
conda create -n pytorch python=3.10
conda activate pytorch
# CUDA 12.1版本
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# 或使用pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 验证PyTorch GPU可用
python -c "import torch; print(f'PyTorch版本: {torch.__version__}'); print(f'CUDA可用: {torch.cuda.is_available()}'); print(f'GPU数量: {torch.cuda.device_count()}')"
# TensorFlow 2.x GPU版本
pip install tensorflow
# 验证TensorFlow GPU
python -c "import tensorflow as tf; print(f'TF版本: {tf.__version__}'); print(f'GPU列表: {tf.config.list_physical_devices(\"GPU\")}')"
# 添加仓库
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# 安装
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 测试GPU容器
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
# 运行PyTorch容器
docker run --gpus all -it --rm \
-v $(pwd):/workspace \
pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
| 问题 | 原因 | 解决方案 |
|---|---|---|
| Nouveau驱动冲突 | 开源驱动未禁用 | 按4.1步骤禁用并重启 |
| 内核版本不匹配 | kernel-devel版本不一致 | 安装匹配的内核开发包 |
| Secure Boot阻止 | UEFI安全启动 | BIOS中禁用Secure Boot |
# 问题:CUDA out of memory
# 解决:设置环境变量限制显存
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# 问题:CUDA版本不匹配
# 解决:检查LD_LIBRARY_PATH
ldconfig -p | grep cuda
# 问题:多CUDA版本切换
# 解决:使用update-alternatives
sudo update-alternatives --config cuda
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 250 # 设置250W功率限制