llama.cpp : Install2024/02/22

	Install [llama.cpp] taht is the interface for Meta's Llama (Large Language Model Meta AI) model. The example below is with GPU.
[1]	Install Python 3, refer to here.
[2]	Install CUDA, refer to here.
[3]	Download and Install cuDNN (CUDA Deep Neural Network library) from the NVIDIA official site. ⇒ https://developer.nvidia.com/rdp/cudnn-download CUDA and cuDNN support matrix is here. ⇒ https://docs.nvidia.com/deeplearning/cudnn/reference/support-matrix.html

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

PS C:\Users\Administrator> Invoke-WebRequest -Uri https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn_9.0.0_windows.exe -OutFile "cudnn_9.0.0_windows.exe" 

# install on silent mode
PS C:\Users\Administrator> ./cudnn_9.0.0_windows.exe -s 

# installation processes are running
PS C:\Users\Administrator> Get-Process -Name "cud*", "setup*" 

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
    329      19     3144      15580      67.41   1252   0 cudnn_9.0.0_windows
    425      22    10604      24400       2.31   2136   0 setup

# after finishing installation, processes above finish
PS C:\Users\Administrator> Get-Process -Name "cud*", "setup*" 


# additionally, 
# download jq.exe to format JSON data for easy viewing
PS C:\Users\Administrator> Invoke-WebRequest -Uri https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-windows-amd64.exe -OutFile "C:\WINDOWS\system32\jq.exe"

[4]	Install [llama.cpp].

# copy extensions
PS C:\Users\Administrator> Copy-Item "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\extras\visual_studio_integration\MSBuildExtensions\*" "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\" 

# set path of cmake
PS C:\Users\Administrator> $currentPath = [Environment]::GetEnvironmentVariable("Path", "Machine") 
PS C:\Users\Administrator> $currentPath += ";C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin" 
PS C:\Users\Administrator> [Environment]::SetEnvironmentVariable("Path", $currentPath, "Machine") 
PS C:\Users\Administrator> $env:Path = [System.Environment]::GetEnvironmentVariable("Path","Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path","User") 

PS C:\Users\Administrator> Invoke-WebRequest -Uri https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip -OutFile "llama.cpp-master.zip" 

# build
PS C:\Users\Administrator> Expand-Archive -Path ./llama.cpp-master.zip 
PS C:\Users\Administrator> cd llama.cpp-master/llama.cpp-master 
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> cmake ./ -DLLAMA_CUBLAS=ON 
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.20348.
-- The C compiler identification is MSVC 19.39.33519.0
-- The CXX compiler identification is MSVC 19.39.33519.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find Git (missing: GIT_EXECUTABLE)
CMake Warning at scripts/build-info.cmake:14 (message):
  Git not found.  Build info will not be accurate.
Call Stack (most recent call first):
  CMakeLists.txt:129 (include)


-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/include (found version "12.3.107")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.3.107
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM:
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Failed
-- Performing Test HAS_AVX512_2
-- Performing Test HAS_AVX512_2 - Failed
CMake Warning at common/CMakeLists.txt:24 (message):
  Git repository not found; to enable automatic generation of build info,
  make sure Git is installed and the project is a Git repository.


-- Configuring done (82.2s)
-- Generating done (1.2s)
-- Build files have been written to: C:/Users/Administrator/llama.cpp-master/llama.cpp-master

PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> cmake --build ./ --config Release 

.....
.....

  Building Custom Rule C:/Users/Administrator/llama.cpp-master/llama.cpp-master/pocs/vdot/CMakeLists.txt
  vdot.cpp
  vdot.vcxproj -> C:\Users\Administrator\llama.cpp-master\llama.cpp-master\bin\Release\vdot.exe
  Building Custom Rule C:/Users/Administrator/llama.cpp-master/llama.cpp-master/CMakeLists.txt

[5]

Download the GGML format model and convert it to GGUF format.
It's possible to download models from the following site. In this example, we will use [llama-2-13b-chat.ggmlv3.q8_0.bin].

⇒ https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main
⇒ https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main

PS C:\Users\Administrator> cd ~\llama.cpp-master\llama.cpp-master 
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> Invoke-WebRequest -Uri https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q8_0.bin?download=true -OutFile "llama-2-13b-chat.ggmlv3.q8_0.bin" 

# convert to GGUF format
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> pip install numpy 
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> python ./convert-llama-ggml-to-gguf.py --input ./llama-2-13b-chat.ggmlv3.q8_0.bin --output ./llama-2-13b-chat.ggmlv3.q8_0.gguf 

# add firewall rule
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> New-NetFirewallRule `
-Name "Llama-cpp Server Port" `
-DisplayName "Llama-cpp Server Port" `
-Description 'Allow Llama-cpp Server Port' `
-Profile Any `
-Direction Inbound `
-Action Allow `
-Protocol TCP `
-Program Any `
-LocalAddress Any `
-LocalPort 8000 

# [--n_gpu_layers] : number of layers to put on the GPU
# -- specify [-1] to use all if you do not know
PS C:\Users\Administrator\llama.cpp-master\llama.cpp-master> ./bin/Release/server.exe --model ./llama-2-13b-chat.ggmlv3.q8_0.gguf --n-gpu-layers -1 --host 0.0.0.0 --port 8000 
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
{"timestamp":1708579945,"level":"INFO","function":"main","line":2574,"message":"build info","build":0,"commit":"unknown"}
{"timestamp":1708579945,"level":"INFO","function":"main","line":2581,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://0.0.0.0:8000

{"timestamp":1708579945,"level":"INFO","function":"main","line":2731,"message":"HTTP server listening","hostname":"0.0.0.0","port":"8000"}
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ./llama-2-13b-chat.ggmlv3.q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama-2-13b-chat.ggmlv3.q8_0.bin
llama_model_loader: - kv   2:                        general.description str              = converted from legacy GGJTv3 MOSTLY_Q...
llama_model_loader: - kv   3:                          general.file_type u32              = 7
llama_model_loader: - kv   4:                       llama.context_length u32              = 2048
llama_model_loader: - kv   5:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   6:                          llama.block_count u32              = 40
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   8:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   9:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv  10:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000005
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:  282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 5.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name     = llama-2-13b-chat.ggmlv3.q8_0.bin
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size = 13189.86 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    12.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    80.00 MiB
llama_new_context_with_model: graph splits (measure): 1
Available slots:
 -> Slot 0 - max context: 512
{"timestamp":1708579973,"level":"INFO","function":"main","line":2752,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache

[6]	Post some questions like follows and verify it works normally. The response time and response contents will vary depending on the question and the model used. By the way, this example is running on a machine with 8 vCPU + 16G memory + GeForce RTX 3060 (12G).

PS C:\Users\Administrator> curl.exe -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions `
-d '{\"messages\": [{\"role\": \"user\", \"content\": \"Who r you?\"}]}' | jq.exe 

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "I'm just an AI, I don't have a name. I'm here to help answer questions and provide information. What can I assist you with today?",
        "role": "assistant"
      }
    }
  ],
  "created": 1708581594,
  "id": "chatcmpl-q3So7lVzkRuWaqZ0yxnaeVf9oNXVwgyo",
  "model": "unknown",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 43,
    "prompt_tokens": 32,
    "total_tokens": 75
  }
}

Matched Content