llama.cpp : インストール2024/02/22

	Meta の Llama (Large Language Model Meta AI) モデルのインターフェースである [llama.cpp] をインストールします。
[1]	こちらを参考に Python をインストールしておきます。
[2]	こちらを参考に CUDA をインストールしておきます。
[3]	cuDNN (CUDA Deep Neural Network library) をダウンロードしてインストールします。 ⇒ https://developer.nvidia.com/rdp/cudnn-download CUDA と cuDNN のサポート対応表は以下に記載してあります。 ⇒ https://docs.nvidia.com/deeplearning/cudnn/reference/support-matrix.html

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

PS C:\Users\Administrator> Invoke-WebRequest -Uri https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn_9.0.0_windows.exe -OutFile "cudnn_9.0.0_windows.exe" 

# サイレントモードでインストール
PS C:\Users\Administrator> ./cudnn_9.0.0_windows.exe -s 

# インストールプロセスが起動
PS C:\Users\Administrator> Get-Process -Name "cud*", "setup*" 

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
    329      19     3144      15580      67.41   1252   0 cudnn_9.0.0_windows
    425      22    10604      24400       2.31   2136   0 setup

# 上記プロセスが終了すればインストール完了
PS C:\Users\Administrator> Get-Process -Name "cud*", "setup*" 


# 追加で JSON データを見やすくできる jq.exe もダウンロード
PS C:\Users\Administrator> Invoke-WebRequest -Uri https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-windows-amd64.exe -OutFile "C:\WINDOWS\system32\jq.exe"

[4]	[llama.cpp] をインストールします。

# エクステンション コピー
PS C:\Users\Administrator> Copy-Item "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\extras\visual_studio_integration\MSBuildExtensions\*" "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\" 

# cmake のパスを通す
PS C:\Users\Administrator> $currentPath = [Environment]::GetEnvironmentVariable("Path", "Machine") 
PS C:\Users\Administrator> $currentPath += ";C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin" 
PS C:\Users\Administrator> [Environment]::SetEnvironmentVariable("Path", $currentPath, "Machine") 
PS C:\Users\Administrator> $env:Path = [System.Environment]::GetEnvironmentVariable("Path","Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path","User") 

PS C:\Users\Administrator> Invoke-WebRequest -Uri https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip -OutFile "llama.cpp-master.zip" 

# ビルド
PS C:\Users\Administrator> Expand-Archive -Path ./llama.cpp-master.zip 
PS C:\Users\Administrator> cd llama.cpp-master/llama.cpp-master 
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> cmake ./ -DLLAMA_CUBLAS=ON 
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.20348.
-- The C compiler identification is MSVC 19.39.33519.0
-- The CXX compiler identification is MSVC 19.39.33519.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find Git (missing: GIT_EXECUTABLE)
CMake Warning at scripts/build-info.cmake:14 (message):
  Git not found.  Build info will not be accurate.
Call Stack (most recent call first):
  CMakeLists.txt:129 (include)


-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/include (found version "12.3.107")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.3.107
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM:
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Failed
-- Performing Test HAS_AVX512_2
-- Performing Test HAS_AVX512_2 - Failed
CMake Warning at common/CMakeLists.txt:24 (message):
  Git repository not found; to enable automatic generation of build info,
  make sure Git is installed and the project is a Git repository.


-- Configuring done (82.2s)
-- Generating done (1.2s)
-- Build files have been written to: C:/Users/Administrator/llama.cpp-master/llama.cpp-master

PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> cmake --build ./ --config Release 

.....
.....

  Building Custom Rule C:/Users/Administrator/llama.cpp-master/llama.cpp-master/pocs/vdot/CMakeLists.txt
  vdot.cpp
  vdot.vcxproj -> C:\Users\Administrator\llama.cpp-master\llama.cpp-master\bin\Release\vdot.exe
  Building Custom Rule C:/Users/Administrator/llama.cpp-master/llama.cpp-master/CMakeLists.txt

[5]

GGML 形式のモデルをダウンロードして GGUF 形式に変換し、[llama-cpp] を起動します。
モデルは下記サイトからダウンロードできます。当例では [llama-2-13b-chat.ggmlv3.q8_0.bin] を使用します。

⇒ https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main

PS C:\Users\Administrator> cd ~\llama.cpp-master\llama.cpp-master 
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> Invoke-WebRequest -Uri https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q8_0.bin?download=true -OutFile "llama-2-13b-chat.ggmlv3.q8_0.bin" 

# GGUF に変換
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> pip install numpy 
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> python ./convert-llama-ggml-to-gguf.py --input ./llama-2-13b-chat.ggmlv3.q8_0.bin --output ./llama-2-13b-chat.ggmlv3.q8_0.gguf 

# ファイアウォールルール追加
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> New-NetFirewallRule `
-Name "Llama-cpp Server Port" `
-DisplayName "Llama-cpp Server Port" `
-Description 'Allow Llama-cpp Server Port' `
-Profile Any `
-Direction Inbound `
-Action Allow `
-Protocol TCP `
-Program Any `
-LocalAddress Any `
-LocalPort 8000 

# [--n-gpu-layers] : GPU に配置するレイヤーの数
# -- よくわからない場合は [-1] を指定
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> ./bin/Release/server.exe --model ./llama-2-13b-chat.ggmlv3.q8_0.gguf --n-gpu-layers -1 --host 0.0.0.0 --port 8000 
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
{"timestamp":1708579945,"level":"INFO","function":"main","line":2574,"message":"build info","build":0,"commit":"unknown"}
{"timestamp":1708579945,"level":"INFO","function":"main","line":2581,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://0.0.0.0:8000

{"timestamp":1708579945,"level":"INFO","function":"main","line":2731,"message":"HTTP server listening","hostname":"0.0.0.0","port":"8000"}
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ./llama-2-13b-chat.ggmlv3.q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama-2-13b-chat.ggmlv3.q8_0.bin
llama_model_loader: - kv   2:                        general.description str              = converted from legacy GGJTv3 MOSTLY_Q...
llama_model_loader: - kv   3:                          general.file_type u32              = 7
llama_model_loader: - kv   4:                       llama.context_length u32              = 2048
llama_model_loader: - kv   5:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   6:                          llama.block_count u32              = 40
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   8:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   9:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv  10:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000005
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:  282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 5.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name     = llama-2-13b-chat.ggmlv3.q8_0.bin
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size = 13189.86 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    12.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    80.00 MiB
llama_new_context_with_model: graph splits (measure): 1
Available slots:
 -> Slot 0 - max context: 512
{"timestamp":1708579973,"level":"INFO","function":"main","line":2752,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache

[6]	簡単な質問を投入して動作確認します。質問の内容や使用しているモデルによって、応答時間や応答内容は変わります。ちなみに、当例では、8 vCPU + 16G メモリ + GeForce RTX 3060 (12G) のマシンで実行しています。

PS C:\Users\Administrator> curl.exe -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions `
-d '{\"messages\": [{\"role\": \"user\", \"content\": \"Who r you?\"}]}' | jq.exe 

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "I'm just an AI, I don't have a name. I'm here to help answer questions and provide information. What can I assist you with today?",
        "role": "assistant"
      }
    }
  ],
  "created": 1708581594,
  "id": "chatcmpl-q3So7lVzkRuWaqZ0yxnaeVf9oNXVwgyo",
  "model": "unknown",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 43,
    "prompt_tokens": 32,
    "total_tokens": 75
  }
}