llama.cpp : インストール2024/02/22 |
Meta の Llama (Large Language Model Meta AI) モデルのインターフェースである [llama.cpp] をインストールします。 |
|
[1] | |
[2] | |
[3] |
cuDNN (CUDA Deep Neural Network library) をダウンロードしてインストールします。 ⇒ https://docs.nvidia.com/deeplearning/cudnn/reference/support-matrix.html |
Windows PowerShell Copyright (C) Microsoft Corporation. All rights reserved. PS C:\Users\Administrator> Invoke-WebRequest -Uri https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn_9.0.0_windows.exe -OutFile "cudnn_9.0.0_windows.exe" # サイレントモードでインストール PS C:\Users\Administrator> ./cudnn_9.0.0_windows.exe -s # インストールプロセスが起動 PS C:\Users\Administrator> Get-Process -Name "cud*", "setup*" Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName ------- ------ ----- ----- ------ -- -- ----------- 329 19 3144 15580 67.41 1252 0 cudnn_9.0.0_windows 425 22 10604 24400 2.31 2136 0 setup # 上記プロセスが終了すればインストール完了 PS C:\Users\Administrator> Get-Process -Name "cud*", "setup*" # 追加で JSON データを見やすくできる jq.exe もダウンロード PS C:\Users\Administrator> Invoke-WebRequest -Uri https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-windows-amd64.exe -OutFile "C:\WINDOWS\system32\jq.exe" |
[4] | [llama.cpp] をインストールします。 |
# エクステンション コピー PS C:\Users\Administrator> Copy-Item "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\extras\visual_studio_integration\MSBuildExtensions\*" "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\" # cmake のパスを通す PS C:\Users\Administrator> $currentPath = [Environment]::GetEnvironmentVariable("Path", "Machine") PS C:\Users\Administrator> $currentPath += ";C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin" PS C:\Users\Administrator> [Environment]::SetEnvironmentVariable("Path", $currentPath, "Machine") PS C:\Users\Administrator> $env:Path = [System.Environment]::GetEnvironmentVariable("Path","Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path","User") PS C:\Users\Administrator> Invoke-WebRequest -Uri https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip -OutFile "llama.cpp-master.zip" # ビルド PS C:\Users\Administrator> Expand-Archive -Path ./llama.cpp-master.zip PS C:\Users\Administrator> cd llama.cpp-master/llama.cpp-master PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> cmake ./ -DLLAMA_CUBLAS=ON -- Building for: Visual Studio 17 2022 -- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.20348. -- The C compiler identification is MSVC 19.39.33519.0 -- The CXX compiler identification is MSVC 19.39.33519.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Could NOT find Git (missing: GIT_EXECUTABLE) CMake Warning at scripts/build-info.cmake:14 (message): Git not found. Build info will not be accurate. Call Stack (most recent call first): CMakeLists.txt:129 (include) -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - not found -- Found Threads: TRUE -- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/include (found version "12.3.107") -- cuBLAS found -- The CUDA compiler identification is NVIDIA 12.3.107 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/bin/nvcc.exe - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Using CUDA architectures: 52;61;70 -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF -- CMAKE_SYSTEM_PROCESSOR: AMD64 -- CMAKE_GENERATOR_PLATFORM: -- x86 detected -- Performing Test HAS_AVX_1 -- Performing Test HAS_AVX_1 - Success -- Performing Test HAS_AVX2_1 -- Performing Test HAS_AVX2_1 - Success -- Performing Test HAS_FMA_1 -- Performing Test HAS_FMA_1 - Success -- Performing Test HAS_AVX512_1 -- Performing Test HAS_AVX512_1 - Failed -- Performing Test HAS_AVX512_2 -- Performing Test HAS_AVX512_2 - Failed CMake Warning at common/CMakeLists.txt:24 (message): Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository. -- Configuring done (82.2s) -- Generating done (1.2s) -- Build files have been written to: C:/Users/Administrator/llama.cpp-master/llama.cpp-master PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> cmake --build ./ --config Release ..... ..... Building Custom Rule C:/Users/Administrator/llama.cpp-master/llama.cpp-master/pocs/vdot/CMakeLists.txt vdot.cpp vdot.vcxproj -> C:\Users\Administrator\llama.cpp-master\llama.cpp-master\bin\Release\vdot.exe Building Custom Rule C:/Users/Administrator/llama.cpp-master/llama.cpp-master/CMakeLists.txt |
[5] |
GGML 形式のモデルをダウンロードして GGUF 形式に変換し、[llama-cpp] を起動します。 ⇒ https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main ⇒ https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main |
PS C:\Users\Administrator> cd ~\llama.cpp-master\llama.cpp-master PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> Invoke-WebRequest -Uri https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q8_0.bin?download=true -OutFile "llama-2-13b-chat.ggmlv3.q8_0.bin" # GGUF に変換 PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> pip install numpy PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> python ./convert-llama-ggml-to-gguf.py --input ./llama-2-13b-chat.ggmlv3.q8_0.bin --output ./llama-2-13b-chat.ggmlv3.q8_0.gguf # ファイアウォールルール追加 PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> New-NetFirewallRule ` -Name "Llama-cpp Server Port" ` -DisplayName "Llama-cpp Server Port" ` -Description 'Allow Llama-cpp Server Port' ` -Profile Any ` -Direction Inbound ` -Action Allow ` -Protocol TCP ` -Program Any ` -LocalAddress Any ` -LocalPort 8000 # [--n-gpu-layers] : GPU に配置するレイヤーの数 # -- よくわからない場合は [-1] を指定 PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> ./bin/Release/server.exe --model ./llama-2-13b-chat.ggmlv3.q8_0.gguf --n-gpu-layers -1 --host 0.0.0.0 --port 8000 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes {"timestamp":1708579945,"level":"INFO","function":"main","line":2574,"message":"build info","build":0,"commit":"unknown"} {"timestamp":1708579945,"level":"INFO","function":"main","line":2581,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "} llama server listening at http://0.0.0.0:8000 {"timestamp":1708579945,"level":"INFO","function":"main","line":2731,"message":"HTTP server listening","hostname":"0.0.0.0","port":"8000"} llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ./llama-2-13b-chat.ggmlv3.q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = llama-2-13b-chat.ggmlv3.q8_0.bin llama_model_loader: - kv 2: general.description str = converted from legacy GGJTv3 MOSTLY_Q... llama_model_loader: - kv 3: general.file_type u32 = 7 llama_model_loader: - kv 4: llama.context_length u32 = 2048 llama_model_loader: - kv 5: llama.embedding_length u32 = 5120 llama_model_loader: - kv 6: llama.block_count u32 = 40 llama_model_loader: - kv 7: llama.feed_forward_length u32 = 13824 llama_model_loader: - kv 8: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 9: llama.attention.head_count u32 = 40 llama_model_loader: - kv 10: llama.attention.head_count_kv u32 = 40 llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000005 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q8_0: 282 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 5120 llm_load_print_meta: n_embd_v_gqa = 5120 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 5.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 12.88 GiB (8.50 BPW) llm_load_print_meta: general.name = llama-2-13b-chat.ggmlv3.q8_0.bin llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/41 layers to GPU llm_load_tensors: CPU buffer size = 13189.86 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 400.00 MiB llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 12.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 80.00 MiB llama_new_context_with_model: graph splits (measure): 1 Available slots: -> Slot 0 - max context: 512 {"timestamp":1708579973,"level":"INFO","function":"main","line":2752,"message":"model loaded"} all slots are idle and system prompt is empty, clear the KV cache |
[6] | 簡単な質問を投入して動作確認します。 質問の内容や使用しているモデルによって、応答時間や応答内容は変わります。 ちなみに、当例では、8 vCPU + 16G メモリ + GeForce RTX 3060 (12G) のマシンで実行しています。 |
PS C:\Users\Administrator> curl.exe -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions `
-d '{\"messages\": [{\"role\": \"user\", \"content\": \"Who r you?\"}]}' | jq.exe
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "I'm just an AI, I don't have a name. I'm here to help answer questions and provide information. What can I assist you with today?",
"role": "assistant"
}
}
],
"created": 1708581594,
"id": "chatcmpl-q3So7lVzkRuWaqZ0yxnaeVf9oNXVwgyo",
"model": "unknown",
"object": "chat.completion",
"usage": {
"completion_tokens": 43,
"prompt_tokens": 32,
"total_tokens": 75
}
}
|