happyme531 commited on
Commit
99615b9
·
verified ·
1 Parent(s): 28a4acd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -40
README.md CHANGED
@@ -18,7 +18,7 @@ ONNX/RKNN2部署Florence-2视觉多模态大模型!
18
 
19
  ## 使用方法
20
 
21
- 1. 克隆项目到本地
22
 
23
  2. 安装依赖
24
 
@@ -26,20 +26,14 @@ ONNX/RKNN2部署Florence-2视觉多模态大模型!
26
  pip install transformers onnxruntime pillow numpy<2 rknn-toolkit-lite2
27
  ```
28
 
29
- 3. 修改项目路径
30
- 分词器和预处理配置仍然需要使用原项目中的文件. 将onnx/onnxrun.py或onnx/rknnrun.py中的对应路径修改为项目所在路径.
31
- ```python
32
- AutoProcessor.from_pretrained(
33
- "path/to/Florence-2-base-ft-ONNX-RKNN2",
34
- trust_remote_code=True
35
- )
36
- ```
37
-
38
- 4. 运行
39
  ```bash
40
- python onnx/onnxrun.py # 或 python onnx/rknnrun.py
 
41
  ```
42
 
 
 
43
  ## RKNN模型转换
44
 
45
  你需要提前安装rknn-toolkit2 v2.3.2或更高版本.
@@ -56,6 +50,7 @@ python convert.py all
56
  - ~~在同分辨率下推理精度相比onnxruntime有显著下降, 已确定问题出在vision encoder部分. 如果对精度要求高, 这部分可以换成onnxruntime~~ (已解决, 不过现在貌似仍然会丢失一点精度)
57
  - ~~vision encoder推理需要1.5秒, 占比最大, 但其中大半时间都在做Transpose, 也许还可以优化~~ (已优化)
58
  - decode阶段因为kvcache的长度不断变化, 貌似无法简单的使用NPU推理. 不过用onnxruntime应该也足够了
 
59
 
60
  ## 参考
61
  - [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
@@ -65,57 +60,48 @@ python convert.py all
65
 
66
  # English README
67
 
68
- # Florence-2-base-ft-ONNX-RKNN2
69
-
70
- ONNX/RKNN2 deployment for Florence-2 visual-language multimodal large model!
71
 
72
- - Inference speed (RKNN2): RK3588 inference with a 768x768 image, using the `<MORE_DETAILED_CAPTION>` instruction, takes ~4 seconds in total.
73
- - Memory usage (RKNN2): Approximately 2GB
74
 
75
  ## Usage
76
 
77
- 1. Clone the project locally
78
-
79
- 2. Install dependencies
80
 
 
81
  ```bash
82
  pip install transformers onnxruntime pillow numpy<2 rknn-toolkit-lite2
83
  ```
84
 
85
- 3. Modify project paths
86
- The tokenizer and preprocessing configurations still need to use files from the original project. Modify the corresponding paths in onnx/onnxrun.py or onnx/rknnrun.py to the project's location.
87
- ```python
88
- AutoProcessor.from_pretrained(
89
- "path/to/Florence-2-base-ft-ONNX-RKNN2",
90
- trust_remote_code=True
91
- )
92
- ```
93
-
94
- 4. Run
95
  ```bash
96
- python onnx/onnxrun.py # or python onnx/rknnrun.py
 
97
  ```
98
 
 
 
99
  ## RKNN Model Conversion
100
 
101
- You need to install rknn-toolkit2 v2.3.2 or higher in advance.
102
 
103
  ```bash
104
  cd onnx
105
  python convert.py all
106
  ```
107
 
108
- Note: The RKNN model does not support arbitrary input lengths at runtime, so you need to determine the input shape in advance. You can modify `vision_size`, `vision_tokens`, and `prompt_tokens` in convert.py to change the input shape.
109
 
110
- ## Existing Issues (rknn)
111
 
112
- - ~~It is known that converting the vision encoder with an input resolution >= 640x640 will result in a `Buffer overflow!` error and conversion failure. Therefore, the current input is 512x512, which will reduce inference quality.~~ (Solved.)
113
- - ~~Inference accuracy is significantly lower compared to onnxruntime at the same resolution. The problem has been identified in the vision encoder part. If high accuracy is required, this part can be replaced with onnxruntime.~~ (Solved, although there is still some precision loss present.)
114
- - ~~Vision encoder inference takes 1.5 seconds, accounting for the largest proportion, but most of this time is spent on Transpose op, which may be further optimized.~~ (Solved.)
115
- - The decode phase seems unable to simply use NPU inference because the length of kvcache keeps changing. However, using onnxruntime should be sufficient.
 
116
 
117
  ## References
118
-
119
  - [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
120
  - [onnx-community/Florence-2-base-ft](https://huggingface.co/onnx-community/Florence-2-base-ft)
121
- - [florence2-webgpu](https://huggingface.co/spaces/Xenova/florence2-webgpu)
 
18
 
19
  ## 使用方法
20
 
21
+ 1. 克隆项目到本地(开发板上)
22
 
23
  2. 安装依赖
24
 
 
26
  pip install transformers onnxruntime pillow numpy<2 rknn-toolkit-lite2
27
  ```
28
 
29
+ 3. 运行
 
 
 
 
 
 
 
 
 
30
  ```bash
31
+ cd onnx
32
+ python ./run.py ./test.jpg "<MORE_DETAILED_CAPTION>"
33
  ```
34
 
35
+ 你可以修改`run.py`最上方的import来切换使用ONNX或RKNN推理。
36
+
37
  ## RKNN模型转换
38
 
39
  你需要提前安装rknn-toolkit2 v2.3.2或更高版本.
 
50
  - ~~在同分辨率下推理精度相比onnxruntime有显著下降, 已确定问题出在vision encoder部分. 如果对精度要求高, 这部分可以换成onnxruntime~~ (已解决, 不过现在貌似仍然会丢失一点精度)
51
  - ~~vision encoder推理需要1.5秒, 占比最大, 但其中大半时间都在做Transpose, 也许还可以优化~~ (已优化)
52
  - decode阶段因为kvcache的长度不断变化, 貌似无法简单的使用NPU推理. 不过用onnxruntime应该也足够了
53
+ - 理论上可以使用rkllm推理decoder, 但因为rkllm缺少固定位置编码的支持暂时无法实现。参考:https://github.com/airockchip/rknn-llm/issues/296
54
 
55
  ## 参考
56
  - [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
 
60
 
61
  # English README
62
 
63
+ Deploy Florence-2 vision multi-modal large model with ONNX/RKNN2!
 
 
64
 
65
+ - **Inference Speed (RKNN2):** On RK3588, inferring a 768x768 image with the `<MORE_DETAILED_CAPTION>` instruction takes ~4 seconds in total.
66
+ - **Memory Usage (RKNN2):** About 2GB.
67
 
68
  ## Usage
69
 
70
+ 1. Clone the project locally (on your development board).
 
 
71
 
72
+ 2. Install dependencies:
73
  ```bash
74
  pip install transformers onnxruntime pillow numpy<2 rknn-toolkit-lite2
75
  ```
76
 
77
+ 3. Run:
 
 
 
 
 
 
 
 
 
78
  ```bash
79
+ cd onnx
80
+ python ./run.py ./test.jpg "<MORE_DETAILED_CAPTION>"
81
  ```
82
 
83
+ You can modify the import at the top of `run.py` to switch between ONNX and RKNN inference.
84
+
85
  ## RKNN Model Conversion
86
 
87
+ You need to install rknn-toolkit2 v2.3.2 or a higher version beforehand.
88
 
89
  ```bash
90
  cd onnx
91
  python convert.py all
92
  ```
93
 
94
+ Note: RKNN models do not support dynamic input shapes at runtime, so you need to define the input shape in advance. You can modify `vision_size`, `vision_tokens`, and `prompt_tokens` in `convert.py` to change the input shape.
95
 
96
+ ## Known Issues (rknn)
97
 
98
+ - ~~When converting the vision encoder, an input resolution of >=640x640 would cause a `Buffer overflow!` error and fail the conversion. Therefore, a 512x512 input was used, which degraded inference quality.~~ (Resolved)
99
+ - ~~Inference precision was significantly lower compared to onnxruntime at the same resolution. The issue was identified in the vision encoder part. For higher precision requirements, onnxruntime could be used for this part.~~ (Resolved, though there might still be a slight loss of precision)
100
+ - ~~Vision encoder inference took 1.5 seconds, accounting for the largest portion of time, with most of it spent on Transpose operations. There might be room for optimization.~~ (Optimized)
101
+ - In the decode stage, due to the constantly changing length of the kvcache, it seems that NPU inference cannot be simply used. However, onnxruntime should be sufficient.
102
+ - Theoretically, rkllm could be used for decoder inference, but it's not currently feasible due to the lack of support for fixed positional encoding in rkllm. See: https://github.com/airockchip/rknn-llm/issues/296
103
 
104
  ## References
 
105
  - [microsoft/Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)
106
  - [onnx-community/Florence-2-base-ft](https://huggingface.co/onnx-community/Florence-2-base-ft)
107
+ - [florence2-webgpu](https://huggingface.co/spaces/Xenova/florence2-webgpu)