This commit is contained in:
lightislost 2023-09-28 10:58:58 +08:00
commit 3e171509b8
158 changed files with 2472830 additions and 0 deletions

6
.gitignore vendored Normal file
View File

@ -0,0 +1,6 @@
**/__pycache__
knowledge_base
logs
jupyter_work
model_config.py
server_config.py

11
Dockerfile Normal file
View File

@ -0,0 +1,11 @@
From python:3.9-bookworm
WORKDIR /home/user
COPY ./docker_requirements.txt /home/user/docker_requirements.txt
COPY ./jupyter_start.sh /home/user/jupyter_start.sh
RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install -r /home/user/docker_requirements.txt
CMD ["bash"]

7
LEGAL.md Normal file
View File

@ -0,0 +1,7 @@
Legal Disclaimer
Within this source code, the comments in Chinese shall be the original, governing version. Any comment in other languages are for reference only. In the event of any conflict between the Chinese language version comments and other language version comments, the Chinese language version shall prevail.
法律免责声明
关于代码注释部分,中文注释为官方版本,其它语言注释仅做参考。中文注释可能与其它语言注释存在不一致,当中文注释与其它语言注释存在不一致时,请以中文注释为准。

214
LICENSE.md Normal file
View File

@ -0,0 +1,214 @@
Copyright [2023] [Ant Group]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

189
README.md Normal file
View File

@ -0,0 +1,189 @@
# <p align="center">Codefuse-ChatBot: Development by Private Knowledge Augmentation</p>
<p align="center">
<a href="README.md"><img src="https://img.shields.io/badge/文档-中文版-yellow.svg" alt="ZH doc"></a>
<a href="README_EN.md"><img src="https://img.shields.io/badge/document-英文版-yellow.svg" alt="EN doc"></a>
<img src="https://img.shields.io/github/license/codefuse-ai/codefuse-chatbot" alt="License">
<a href="https://github.com/codefuse-ai/codefuse-chatbot/issues">
<img alt="Open Issues" src="https://img.shields.io/github/issues-raw/codefuse-ai/codefuse-chatbot" />
</a>
<br><br>
</p>
本项目是一个开源的 AI 智能助手专为软件开发的全生命周期而设计涵盖设计、编码、测试、部署和运维等阶段。通过知识检索、工具使用和沙箱执行Codefuse-ChatBot 能解答您开发过程中的各种专业问题、问答操作周边独立分散平台。
## 🔔 更新
- [2023.09.15] 本地/隔离环境的沙盒功能开放基于爬虫实现指定url知识检索
## 📜 目录
- [🤝 介绍](#-介绍)
- [🎥 演示视频](#-演示视频)
- [🧭 技术路线](#-技术路线)
- [🌐 模型接入](#-模型接入)
- [🚀 快速使用](#-快速使用)
- [🤗 致谢](#-致谢)
## 🤝 介绍
💡 本项目旨在通过检索增强生成Retrieval Augmented GenerationRAG、工具学习Tool Learning和沙盒环境来构建软件开发全生命周期的AI智能助手涵盖设计、编码、测试、部署和运维等阶段。 逐渐从各处资料查询、独立分散平台操作的传统开发运维模式转变到大模型问答的智能化开发运维模式,改变人们的开发运维习惯。
- 📚 知识库管理DevOps专业高质量知识库 + 企业级知识库自助构建 + 对话实现快速检索开源/私有技术文档
- 🐳 隔离沙盒环境:实现代码的快速编译执行测试
- 🔄 React范式支撑代码的自我迭代、自动执行
- 🛠️ Prompt管理实现各种开发、运维任务的prompt管理
- 🔌 丰富的领域插件:执行各种定制开发任务
- 🚀 对话驱动:需求设计、系分设计、代码生成、开发测试、部署运维自动化
🌍 依托于开源的 LLM 与 Embedding 模型,本项目可实现基于开源模型的离线私有部署。此外,本项目也支持 OpenAI API 的调用。
👥 核心研发团队长期专注于 AIOps + NLP 领域的研究。我们发起了 Codefuse-ai 项目,希望大家广泛贡献高质量的开发和运维文档,共同完善这套解决方案,以实现“让天下没有难做的开发”的目标。
<div align=center>
<img src="sources/docs_imgs/objective_v4.png" alt="图片" width="600" height="333">
</div>
## 🎥 演示视频
为了帮助您更直观地了解 Codefuse-ChatBot 的功能和使用方法,我们录制了一个演示视频。您可以通过观看此视频,快速了解本项目的主要特性和操作流程。
[演示视频](https://www.youtube.com/watch?v=UGJdTGaVnNY&t=2s&ab_channel=HaotianZhu)
## 🧭 技术路线
<div align=center>
<img src="sources/docs_imgs/devops-chatbot-module.png" alt="图片" width="600" height="503">
</div>
- 🕷️ **Web Crawl**:实现定期网络文档爬取,确保数据的及时性,并依赖于开源社区的持续补充。
- 🗂️ **DocLoader & TextSplitter**:对从多种来源爬取的数据进行数据清洗、去重和分类,并支持私有文档的导入。
- 🗄️ **Vector Database**结合Text Embedding模型对文档进行Embedding并在Milvus中存储。
- 🔌 **Connector**作为调度中心负责LLM与Vector Database之间的交互调度基于Langchain技术实现。
- 📝 **Prompt Control**从开发和运维角度设计为不同问题分类并为Prompt添加背景确保答案的可控性和完整性。
- 💬 **LLM**默认使用GPT-3.5-turbo并为私有部署和其他涉及隐私的场景提供专有模型选择。
- 🔤 **Text Embedding**默认采用OpenAI的Text Embedding模型支持私有部署和其他隐私相关场景并提供专有模型选择。
- 🚧 **SandBox**对于生成的输出如代码为帮助用户判断其真实性提供了一个交互验证环境基于FaaS并支持用户进行调整。
具体实现明细见:[技术路线明细](sources/readme_docs/roadmap.md)
## 模型接入
有需要接入的model可以提issue
| model_name | model_size | gpu_memory | quantize | HFhub | ModelScope |
| ------------------ | ---------- | ---------- | -------- | ----- | ---------- |
| chatgpt | - | - | - | - | - |
| codellama-34b-int4 | 34b | 20g | int4 | coming soon| [link](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits/summary) |
## 🚀 快速使用
请自行安装 nvidia 驱动程序,本项目已在 Python 3.9.18CUDA 11.7 环境下Windows、X86 架构的 macOS 系统中完成测试。
1、python 环境准备
- 推荐采用 conda 对 python 环境进行管理(可选)
```bash
# 准备 conda 环境
conda create --name devopsgpt python=3.9
conda activate devopsgpt
```
- 安装相关依赖
```bash
cd DevOps-ChatBot
# python=3.9notebook用最新即可python=3.8用notebook=6.5.6
pip install -r requirements.txt
```
2、沙盒环境准备
- windows Docker 安装:
[Docker Desktop for Windows](https://docs.docker.com/desktop/install/windows-install/) 支持 64 位版本的 Windows 10 Pro且必须开启 Hyper-V若版本为 v1903 及以上则无需开启 Hyper-V或者 64 位版本的 Windows 10 Home v1903 及以上版本。
- [【全面详细】Windows10 Docker安装详细教程](https://zhuanlan.zhihu.com/p/441965046)
- [Docker 从入门到实践](https://yeasy.gitbook.io/docker_practice/install/windows)
- [Docker Desktop requires the Server service to be enabled 处理](https://blog.csdn.net/sunhy_csdn/article/details/106526991)
- [安装wsl或者等报错提示](https://learn.microsoft.com/zh-cn/windows/wsl/install)
- Linux Docker 安装:
Linux 安装相对比较简单,请自行 baidu/google 相关安装
- Mac Docker 安装
- [Docker 从入门到实践](https://yeasy.gitbook.io/docker_practice/install/mac)
```bash
# 构建沙盒环境的镜像notebook版本问题见上述
bash docker_build.sh
```
3、模型下载可选
如需使用开源 LLM 与 Embedding 模型可以从 HuggingFace 下载。
此处以 THUDM/chatglm2-6bm 和 text2vec-base-chinese 为例:
```
# install git-lfs
git lfs install
# install LLM-model
git lfs clone https://huggingface.co/THUDM/chatglm2-6b
# install Embedding-model
git lfs clone https://huggingface.co/shibing624/text2vec-base-chinese
```
4、基础配置
```bash
# 修改服务启动的基础配置
cd configs
cp model_config.py.example model_config.py
cp server_config.py.example server_config.py
# model_config#11~12 若需要使用openai接口openai接口key
os.environ["OPENAI_API_KEY"] = "sk-xxx"
# 可自行替换自己需要的api_base_url
os.environ["API_BASE_URL"] = "https://api.openai.com/v1"
# vi model_config#95 你需要选择的语言模型
LLM_MODEL = "gpt-3.5-turbo"
# vi model_config#33 你需要选择的向量模型
EMBEDDING_MODEL = "text2vec-base"
# vi model_config#19 修改成你的本地路径如果能直接连接huggingface则无需修改
"text2vec-base": "/home/user/xx/text2vec-base-chinese",
# 是否启动本地的notebook用于代码解释默认启动docker的notebook
# vi server_config#35True启动docker的notebookfalse启动local的notebook
"do_remote": False, / "do_remote": True,
```
5、启动服务
默认只启动webui相关服务未启动fastchat可选
```bash
# 若需要支撑codellama-34b-int4模型需要给fastchat打一个补丁
# cp examples/gptq.py ~/site-packages/fastchat/modules/gptq.py
# dev_opsgpt/service/llm_api.py#258 修改为 kwargs={"gptq_wbits": 4},
# start llm-service可选
python dev_opsgpt/service/llm_api.py
```
```bash
cd examples
# python ../dev_opsgpt/service/llm_api.py 若需使用本地大语言模型,可执行该命令
bash start_webui.sh
```
## 🤗 致谢
本项目基于[langchain-chatchat](https://github.com/chatchat-space/Langchain-Chatchat)和[codebox-api](https://github.com/shroominic/codebox-api),在此深深感谢他们的开源贡献!

181
README_en.md Normal file
View File

@ -0,0 +1,181 @@
# <p align="center">Codefuse-ChatBot: Development by Private Knowledge Augmentation</p>
<p align="center">
<a href="README.md"><img src="https://img.shields.io/badge/文档-中文版-yellow.svg" alt="ZH doc"></a>
<a href="README_EN.md"><img src="https://img.shields.io/badge/document-英文版-yellow.svg" alt="EN doc"></a>
<img src="https://img.shields.io/github/license/codefuse-ai/codefuse-chatbot" alt="License">
<a href="https://github.com/codefuse-ai/codefuse-chatbot/issues">
<img alt="Open Issues" src="https://img.shields.io/github/issues-raw/codefuse-ai/codefuse-chatbot" />
</a>
<br><br>
</p>
This project is an open-source AI intelligent assistant, specifically designed for the entire lifecycle of software development, covering design, coding, testing, deployment, and operations. Through knowledge retrieval, tool utilization, and sandbox execution, Codefuse-ChatBot can answer various professional questions during your development process and perform question-answering operations on standalone, disparate platforms.
## 🔔 Updates
- [2023.09.15] Sandbox features for local/isolated environments are now available, implementing specified URL knowledge retrieval based on web crawling.
## 📜 Contents
- [🤝 Introduction](#-introduction)
- [🧭 Technical Route](#-technical-route)
- [🌐 模型接入](#-模型接入)
- [🚀 Quick Start](#-quick-start)
- [🤗 Acknowledgements](#-acknowledgements)
## 🤝 Introduction
💡 The aim of this project is to construct an AI intelligent assistant for the entire lifecycle of software development, covering design, coding, testing, deployment, and operations, through Retrieval Augmented Generation (RAG), Tool Learning, and sandbox environments. It transitions gradually from the traditional development and operations mode of querying information from various sources and operating on standalone, disparate platforms to an intelligent development and operations mode based on large-model Q&A, changing people's development and operations habits.
- 📚 Knowledge Base Management: Professional high-quality Codefuse knowledge base + enterprise-level knowledge base self-construction + dialogue-based fast retrieval of open-source/private technical documents.
- 🐳 Isolated Sandbox Environment: Enables quick compilation, execution, and testing of code.
- 🔄 React Paradigm: Supports code self-iteration and automatic execution.
- 🛠️ Prompt Management: Manages prompts for various development and operations tasks.
- 🚀 Conversation Driven: Automates requirement design, system analysis design, code generation, development testing, deployment, and operations.
🌍 Relying on open-source LLM and Embedding models, this project can achieve offline private deployments based on open-source models. Additionally, this project also supports the use of the OpenAI API.
👥 The core development team has been long-term focused on research in the AIOps + NLP domain. We initiated the CodefuseGPT project, hoping that everyone could contribute high-quality development and operations documents widely, jointly perfecting this solution to achieve the goal of "Making Development Seamless for Everyone."
<div align=center>
<img src="sources/docs_imgs/objective_v4.png" alt="Image" width="600" height="333">
</div>
🌍 Relying on open-source LLM and Embedding models, this project can achieve offline private deployments based on open-source models. Additionally, this project also supports the use of the OpenAI API.
👥 The core development team has been long-term focused on research in the AIOps + NLP domain. We initiated the DevOpsGPT project, hoping that everyone could contribute high-quality development and operations documents widely, jointly perfecting this solution to achieve the goal of "Making Development Seamless for Everyone."
## 🧭 Technical Route
<div align=center>
<img src="sources/docs_imgs/devops-chatbot-module.png" alt="Image" width="600" height="503">
</div>
- 🕷️ **Web Crawl**: Implements periodic web document crawling to ensure data timeliness and relies on continuous supplementation from the open-source community.
- 🗂️ **DocLoader & TextSplitter**: Cleans, deduplicates, and categorizes data crawled from various sources and supports the import of private documents.
- 🗄️ **Vector Database**: Integrates Text Embedding models to embed documents and store them in Milvus.
- 🔌 **Connector**: Acts as the scheduling center, responsible for coordinating interactions between LLM and Vector Database, implemented based on Langchain technology.
- 📝 **Prompt Control**: Designs from development and operations perspectives, categorizes different problems, and adds backgrounds to prompts to ensure the controllability and completeness of answers.
- 💬 **LLM**: Uses GPT-3.5-turbo by default and provides proprietary model options for private deployments and other privacy-related scenarios.
- 🔤 **Text Embedding**: Uses OpenAI's Text Embedding model by default, supports private deployments and other privacy-related scenarios, and provides proprietary model options.
- 🚧 **SandBox**: For generated outputs, like code, to help users judge their authenticity, an interactive verification environment is provided (based on FaaS), allowing user adjustments.
For implementation details, see: [Technical Route Details](sources/readme_docs/roadmap.md)
## 模型接入
有需要接入的model可以提issue
| model_name | model_size | gpu_memory | quantize | HFhub | ModelScope |
| ------------------ | ---------- | ---------- | -------- | ----- | ---------- |
| chatgpt | - | - | - | - | - |
| codellama-34b-int4 | 34b | 20g | int4 | coming soon| [link](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits/summary) |
## 🚀 Quick Start
Please install the Nvidia driver yourself; this project has been tested on Python 3.9.18, CUDA 11.7, Windows, and X86 architecture macOS systems.
1. Preparation of Python environment
- It is recommended to use conda to manage the python environment (optional)
```bash
# Prepare conda environment
conda create --name Codefusegpt python=3.9
conda activate Codefusegpt
```
- Install related dependencies
```bash
cd Codefuse-ChatBot
# python=3.9use notebook-latestpython=3.8 use notebook==6.5.5
pip install -r requirements.txt
```
2. Preparation of Sandbox Environment
- Windows Docker installation:
[Docker Desktop for Windows](https://docs.docker.com/desktop/install/windows-install/) supports 64-bit versions of Windows 10 Pro, with Hyper-V enabled (not required for versions v1903 and above), or 64-bit versions of Windows 10 Home v1903 and above.
- [Comprehensive Detailed Windows 10 Docker Installation Tutorial](https://zhuanlan.zhihu.com/p/441965046)
- [Docker: From Beginner to Practitioner](https://yeasy.gitbook.io/docker_practice/install/windows)
- [Handling Docker Desktop requires the Server service to be enabled](https://blog.csdn.net/sunhy_csdn/article/details/106526991)
- [Install wsl or wait for error prompt](https://learn.microsoft.com/en-us/windows/wsl/install)
- Linux Docker Installation:
Linux installation is relatively simple, please search Baidu/Google for installation instructions.
- Mac Docker Installation
- [Docker: From Beginner to Practitioner](https://yeasy.gitbook.io/docker_practice/install/mac)
```bash
# Build images for the sandbox environment, see above for notebook version issues
bash docker_build.sh
```
3. Model Download (Optional)
If you need to use open-source LLM and Embed
ding models, you can download them from HuggingFace.
Here, we use THUDM/chatglm2-6b and text2vec-base-chinese as examples:
```
# install git-lfs
git lfs install
# install LLM-model
git lfs clone https://huggingface.co/THUDM/chatglm2-6b
# install Embedding-model
git lfs clone https://huggingface.co/shibing624/text2vec-base-chinese
```
4. Basic Configuration
```bash
# Modify the basic configuration for service startup
cd configs
cp model_config.py.example model_config.py
cp server_config.py.example server_config.py
# model_config#11~12 If you need to use the openai interface, openai interface key
os.environ["OPENAI_API_KEY"] = "sk-xxx"
# You can replace the api_base_url yourself
os.environ["API_BASE_URL"] = "https://api.openai.com/v1"
# vi model_config#95 You need to choose the language model
LLM_MODEL = "gpt-3.5-turbo"
# vi model_config#33 You need to choose the vector model
EMBEDDING_MODEL = "text2vec-base"
# vi model_config#19 Modify to your local path, if you can directly connect to huggingface, no modification is needed
"text2vec-base": "/home/user/xx/text2vec-base-chinese",
# Whether to start the local notebook for code interpretation, start the docker notebook by default
# vi server_config#35, True to start the docker notebook, false to start the local notebook
"do_remote": False, / "do_remote": True,
```
5. Start the Service
By default, only webui related services are started, and fastchat is not started (optional).
```bash
# if use codellama-34b-int4, you should replace fastchat's gptq.py
# cp examples/gptq.py ~/site-packages/fastchat/modules/gptq.py
# dev_opsgpt/service/llm_api.py#258 => kwargs={"gptq_wbits": 4},
# start llm-service可选
python dev_opsgpt/service/llm_api.py
```
```bash
cd examples
# python ../dev_opsgpt/service/llm_api.py If you need to use the local large language model, you can execute this command
bash start_webui.sh
```
## 🤗 Acknowledgements
This project is based on [langchain-chatchat](https://github.com/chatchat-space/Langchain-Chatchat) and [codebox-api](https://github.com/shroominic/codebox-api). We deeply appreciate their contributions to open source!

4
configs/__init__.py Normal file
View File

@ -0,0 +1,4 @@
from .model_config import *
from .server_config import *
VERSION = "v0.0.1"

View File

@ -0,0 +1,199 @@
import os
import logging
import torch
# 日志格式
LOG_FORMAT = "%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s"
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.basicConfig(format=LOG_FORMAT)
# os.environ["OPENAI_PROXY"] = "socks5h://127.0.0.1:13659"
os.environ["OPENAI_API_KEY"] = ""
os.environ["DUCKDUCKGO_PROXY"] = "socks5://127.0.0.1:13659"
import platform
system_name = platform.system()
# 在以下字典中修改属性值以指定本地embedding模型存储位置
# 如将 "text2vec": "GanymedeNil/text2vec-large-chinese" 修改为 "text2vec": "User/Downloads/text2vec-large-chinese"
# 此处请写绝对路径
embedding_model_dict = {
"ernie-tiny": "nghuyong/ernie-3.0-nano-zh",
"ernie-base": "nghuyong/ernie-3.0-base-zh",
"text2vec-base": "shibing624/text2vec-base-chinese",
"text2vec": "GanymedeNil/text2vec-large-chinese",
"text2vec-paraphrase": "shibing624/text2vec-base-chinese-paraphrase",
"text2vec-sentence": "shibing624/text2vec-base-chinese-sentence",
"text2vec-multilingual": "shibing624/text2vec-base-multilingual",
"m3e-small": "moka-ai/m3e-small",
"m3e-base": "moka-ai/m3e-base",
"m3e-large": "moka-ai/m3e-large",
"bge-small-zh": "BAAI/bge-small-zh",
"bge-base-zh": "BAAI/bge-base-zh",
"bge-large-zh": "BAAI/bge-large-zh"
}
# 选用的 Embedding 名称
EMBEDDING_MODEL = "text2vec-base"
# Embedding 模型运行设备
EMBEDDING_DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
llm_model_dict = {
"chatglm-6b": {
"local_model_path": "THUDM/chatglm-6b",
"api_base_url": "http://localhost:8888/v1", # "name"修改为fastchat服务中的"api_base_url"
"api_key": "EMPTY"
},
"chatglm-6b-int4": {
"local_model_path": "THUDM/chatglm2-6b-int4/",
"api_base_url": "http://localhost:8888/v1", # "name"修改为fastchat服务中的"api_base_url"
"api_key": "EMPTY"
},
"chatglm2-6b": {
"local_model_path": "THUDM/chatglm2-6b",
"api_base_url": "http://localhost:8888/v1", # URL需要与运行fastchat服务端的server_config.FSCHAT_OPENAI_API一致
"api_key": "EMPTY"
},
"chatglm2-6b-int4": {
"local_model_path": "THUDM/chatglm2-6b-int4",
"api_base_url": "http://localhost:8888/v1", # URL需要与运行fastchat服务端的server_config.FSCHAT_OPENAI_API一致
"api_key": "EMPTY"
},
"chatglm2-6b-32k": {
"local_model_path": "THUDM/chatglm2-6b-32k", # "THUDM/chatglm2-6b-32k",
"api_base_url": "http://localhost:8888/v1", # "URL需要与运行fastchat服务端的server_config.FSCHAT_OPENAI_API一致
"api_key": "EMPTY"
},
"vicuna-13b-hf": {
"local_model_path": "",
"api_base_url": "http://localhost:8888/v1", # "name"修改为fastchat服务中的"api_base_url"
"api_key": "EMPTY"
},
# 调用chatgpt时如果报出 urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.openai.com', port=443):
# Max retries exceeded with url: /v1/chat/completions
# 则需要将urllib3版本修改为1.25.11
# 如果依然报urllib3.exceptions.MaxRetryError: HTTPSConnectionPool则将https改为http
# 参考https://zhuanlan.zhihu.com/p/350015032
# 如果报出raise NewConnectionError(
# urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001FE4BDB85E0>:
# Failed to establish a new connection: [WinError 10060]
# 则是因为内地和香港的IP都被OPENAI封了需要切换为日本、新加坡等地
"gpt-3.5-turbo": {
"local_model_path": "gpt-3.5-turbo",
"api_base_url": os.environ.get("API_BASE_URL"),
"api_key": os.environ.get("OPENAI_API_KEY")
},
}
# LLM 名称
LLM_MODEL = "gpt-3.5-turbo"
# LLM 运行设备
LLM_DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
# 日志存储路径
LOG_PATH = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "logs")
if not os.path.exists(LOG_PATH):
os.mkdir(LOG_PATH)
# 知识库默认存储路径
SOURCE_PATH = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "sources")
# 知识库默认存储路径
KB_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "knowledge_base")
# nltk 模型存储路径
NLTK_DATA_PATH = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "nltk_data")
# 代码存储路径
JUPYTER_WORK_PATH = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "jupyter_work")
# WEB_CRAWL存储路径
WEB_CRAWL_PATH = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "sources/docs")
for _path in [LOG_PATH, SOURCE_PATH, KB_ROOT_PATH, NLTK_DATA_PATH, JUPYTER_WORK_PATH, WEB_CRAWL_PATH]:
if not os.path.exists(_path):
os.mkdir(_path)
# 数据库默认存储路径。
# 如果使用sqlite可以直接修改DB_ROOT_PATH如果使用其它数据库请直接修改SQLALCHEMY_DATABASE_URI。
DB_ROOT_PATH = os.path.join(KB_ROOT_PATH, "info.db")
SQLALCHEMY_DATABASE_URI = f"sqlite:///{DB_ROOT_PATH}"
# 可选向量库类型及对应配置
kbs_config = {
"faiss": {
},
# "milvus": {
# "host": "127.0.0.1",
# "port": "19530",
# "user": "",
# "password": "",
# "secure": False,
# },
# "pg": {
# "connection_uri": "postgresql://postgres:postgres@127.0.0.1:5432/langchain_chatchat",
# }
}
# 默认向量库类型。可选faiss, milvus, pg.
DEFAULT_VS_TYPE = "faiss"
# 缓存向量库数量
CACHED_VS_NUM = 1
# 知识库中单段文本长度
CHUNK_SIZE = 250
# 知识库中相邻文本重合长度
OVERLAP_SIZE = 50
# 知识库匹配向量数量
VECTOR_SEARCH_TOP_K = 5
# 知识库匹配相关度阈值取值范围在0-1之间SCORE越小相关度越高取到1相当于不筛选建议设置在0.5左右
# Mac 可能存在无法使用normalized_L2的问题因此调整SCORE_THRESHOLD至 0~1100
FAISS_NORMALIZE_L2 = True if system_name in ["Linux", "Windows"] else False
SCORE_THRESHOLD = 1 if system_name in ["Linux", "Windows"] else 1100
# 搜索引擎匹配结题数量
SEARCH_ENGINE_TOP_K = 5
# 基于本地知识问答的提示词模版
PROMPT_TEMPLATE = """【指令】根据已知信息,简洁和专业的来回答问题。如果无法从中得到答案,请说 “根据已知信息无法回答该问题”,不允许在答案中添加编造成分,答案请使用中文。
【已知信息】{context}
【问题】{question}"""
# API 是否开启跨域默认为False如果需要开启请设置为True
# is open cross domain
OPEN_CROSS_DOMAIN = False
# Bing 搜索必备变量
# 使用 Bing 搜索需要使用 Bing Subscription Key,需要在azure port中申请试用bing search
# 具体申请方式请见
# https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/create-bing-search-service-resource
# 使用python创建bing api 搜索实例详见:
# https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/quickstarts/rest/python
BING_SEARCH_URL = "https://api.bing.microsoft.com/v7.0/search"
# 注意不是bing Webmaster Tools的api key
# 此外如果是在服务器上报Failed to establish a new connection: [Errno 110] Connection timed out
# 是因为服务器加了防火墙需要联系管理员加白名单如果公司的服务器的话就别想了GG
BING_SUBSCRIPTION_KEY = ""
# 是否开启中文标题加强,以及标题增强的相关配置
# 通过增加标题判断判断哪些文本为标题并在metadata中进行标记
# 然后将文本与往上一级的标题进行拼合,实现文本信息的增强。
ZH_TITLE_ENHANCE = False

View File

@ -0,0 +1,111 @@
from .model_config import LLM_MODEL, LLM_DEVICE
# API 是否开启跨域默认为False如果需要开启请设置为True
# is open cross domain
OPEN_CROSS_DOMAIN = False
# 各服务器默认绑定host
DEFAULT_BIND_HOST = "127.0.0.1"
# webui.py server
WEBUI_SERVER = {
"host": DEFAULT_BIND_HOST,
"port": 8501,
}
# api.py server
API_SERVER = {
"host": DEFAULT_BIND_HOST,
"port": 7861,
}
# fastchat openai_api server
FSCHAT_OPENAI_API = {
"host": DEFAULT_BIND_HOST,
"port": 8888, # model_config.llm_model_dict中模型配置的api_base_url需要与这里一致。
}
# sandbox api server
CONTRAINER_NAME = "devopsgt_default"
IMAGE_NAME = "devopsgpt:pypy38"
SANDBOX_SERVER = {
"host": DEFAULT_BIND_HOST,
"port": 5050,
"url": "http://localhost:5050",
"do_remote": True,
}
# fastchat model_worker server
# 这些模型必须是在model_config.llm_model_dict中正确配置的。
# 在启动startup.py时可用通过`--model-worker --model-name xxxx`指定模型不指定则为LLM_MODEL
FSCHAT_MODEL_WORKERS = {
LLM_MODEL: {
"host": DEFAULT_BIND_HOST,
"port": 20002,
"device": LLM_DEVICE,
# todo: 多卡加载需要配置的参数
"gpus": None,
"numgpus": 1,
# 以下为非常用参数,可根据需要配置
# "max_gpu_memory": "20GiB",
# "load_8bit": False,
# "cpu_offloading": None,
# "gptq_ckpt": None,
# "gptq_wbits": 16,
# "gptq_groupsize": -1,
# "gptq_act_order": False,
# "awq_ckpt": None,
# "awq_wbits": 16,
# "awq_groupsize": -1,
# "model_names": [LLM_MODEL],
# "conv_template": None,
# "limit_worker_concurrency": 5,
# "stream_interval": 2,
# "no_register": False,
},
}
# fastchat multi model worker server
FSCHAT_MULTI_MODEL_WORKERS = {
# todo
}
# fastchat controller server
FSCHAT_CONTROLLER = {
"host": DEFAULT_BIND_HOST,
"port": 20001,
"dispatch_method": "shortest_queue",
}
# 以下不要更改
def fschat_controller_address() -> str:
host = FSCHAT_CONTROLLER["host"]
port = FSCHAT_CONTROLLER["port"]
return f"http://{host}:{port}"
def fschat_model_worker_address(model_name: str = LLM_MODEL) -> str:
if model := FSCHAT_MODEL_WORKERS.get(model_name):
host = model["host"]
port = model["port"]
return f"http://{host}:{port}"
def fschat_openai_api_address() -> str:
host = FSCHAT_OPENAI_API["host"]
port = FSCHAT_OPENAI_API["port"]
return f"http://{host}:{port}"
def api_address() -> str:
host = API_SERVER["host"]
port = API_SERVER["port"]
return f"http://{host}:{port}"
def webui_address() -> str:
host = WEBUI_SERVER["host"]
port = WEBUI_SERVER["port"]
return f"http://{host}:{port}"

View File

@ -0,0 +1,8 @@
from .base_chat import Chat
from .knowledge_chat import KnowledgeChat
from .llm_chat import LLMChat
from .search_chat import SearchChat
__all__ = [
"Chat", "KnowledgeChat", "LLMChat", "SearchChat"
]

View File

@ -0,0 +1,164 @@
from fastapi import Body, Request
from fastapi.responses import StreamingResponse
import asyncio, json
from typing import List, AsyncIterable
from langchain.chat_models import ChatOpenAI
from langchain import LLMChain
from langchain.callbacks import AsyncIteratorCallbackHandler
from langchain.prompts.chat import ChatPromptTemplate
from dev_opsgpt.chat.utils import History, wrap_done
from configs.model_config import (llm_model_dict, LLM_MODEL, VECTOR_SEARCH_TOP_K, SCORE_THRESHOLD)
from dev_opsgpt.utils import BaseResponse
from loguru import logger
def getChatModel(callBack: AsyncIteratorCallbackHandler = None):
if callBack is None:
model = ChatOpenAI(
streaming=True,
verbose=True,
openai_api_key=llm_model_dict[LLM_MODEL]["api_key"],
openai_api_base=llm_model_dict[LLM_MODEL]["api_base_url"],
model_name=LLM_MODEL
)
else:
model = ChatOpenAI(
streaming=True,
verbose=True,
callBack=[callBack],
openai_api_key=llm_model_dict[LLM_MODEL]["api_key"],
openai_api_base=llm_model_dict[LLM_MODEL]["api_base_url"],
model_name=LLM_MODEL
)
return model
class Chat:
def __init__(
self,
engine_name: str = "",
top_k: int = 1,
stream: bool = False,
) -> None:
self.engine_name = engine_name
self.top_k = top_k
self.stream = stream
def check_service_status(self, ) -> BaseResponse:
return BaseResponse(code=200, msg=f"okok")
def chat(
self,
query: str = Body(..., description="用户输入", examples=["hello"]),
history: List[History] = Body(
[], description="历史对话",
examples=[[{"role": "user", "content": "我们来玩成语接龙,我先来,生龙活虎"}]]
),
engine_name: str = Body(..., description="知识库名称", examples=["samples"]),
top_k: int = Body(VECTOR_SEARCH_TOP_K, description="匹配向量数"),
score_threshold: float = Body(SCORE_THRESHOLD, description="知识库匹配相关度阈值取值范围在0-1之间SCORE越小相关度越高取到1相当于不筛选建议设置在0.5左右", ge=0, le=1),
stream: bool = Body(False, description="流式输出"),
local_doc_url: bool = Body(False, description="知识文件返回本地路径(true)或URL(false)"),
request: Request = None,
):
self.engine_name = engine_name if isinstance(engine_name, str) else engine_name.default
self.top_k = top_k if isinstance(top_k, int) else top_k.default
self.score_threshold = score_threshold if isinstance(score_threshold, float) else score_threshold.default
self.stream = stream if isinstance(stream, bool) else stream.default
self.local_doc_url = local_doc_url if isinstance(local_doc_url, bool) else local_doc_url.default
self.request = request
return self._chat(query, history)
def _chat(self, query: str, history: List[History]):
history = [History(**h) if isinstance(h, dict) else h for h in history]
## check service dependcy is ok
service_status = self.check_service_status()
if service_status.code!=200: return service_status
def chat_iterator(query: str, history: List[History]):
model = getChatModel()
result ,content = self.create_task(query, history, model)
if self.stream:
for token in content["text"]:
result["answer"] = token
yield json.dumps(result, ensure_ascii=False)
else:
for token in content["text"]:
result["answer"] += token
yield json.dumps(result, ensure_ascii=False)
return StreamingResponse(chat_iterator(query, history),
media_type="text/event-stream")
def achat(
self,
query: str = Body(..., description="用户输入", examples=["hello"]),
history: List[History] = Body(
[], description="历史对话",
examples=[[{"role": "user", "content": "我们来玩成语接龙,我先来,生龙活虎"}]]
),
engine_name: str = Body(..., description="知识库名称", examples=["samples"]),
top_k: int = Body(VECTOR_SEARCH_TOP_K, description="匹配向量数"),
score_threshold: float = Body(SCORE_THRESHOLD, description="知识库匹配相关度阈值取值范围在0-1之间SCORE越小相关度越高取到1相当于不筛选建议设置在0.5左右", ge=0, le=1),
stream: bool = Body(False, description="流式输出"),
local_doc_url: bool = Body(False, description="知识文件返回本地路径(true)或URL(false)"),
request: Request = None,
):
self.engine_name = engine_name if isinstance(engine_name, str) else engine_name.default
self.top_k = top_k if isinstance(top_k, int) else top_k.default
self.score_threshold = score_threshold if isinstance(score_threshold, float) else score_threshold.default
self.stream = stream if isinstance(stream, bool) else stream.default
self.local_doc_url = local_doc_url if isinstance(local_doc_url, bool) else local_doc_url.default
self.request = request
return self._achat(query, history)
def _achat(self, query: str, history: List[History]):
history = [History(**h) if isinstance(h, dict) else h for h in history]
## check service dependcy is ok
service_status = self.check_service_status()
if service_status.code!=200: return service_status
async def chat_iterator(query, history):
callback = AsyncIteratorCallbackHandler()
model = getChatModel()
task, result = self.create_atask(query, history, model, callback)
if self.stream:
for token in callback["text"]:
result["answer"] = token
yield json.dumps(result, ensure_ascii=False)
else:
for token in callback["text"]:
result["answer"] += token
yield json.dumps(result, ensure_ascii=False)
await task
return StreamingResponse(chat_iterator(query, history),
media_type="text/event-stream")
def create_task(self, query: str, history: List[History], model):
'''构建 llm 生成任务'''
chat_prompt = ChatPromptTemplate.from_messages(
[i.to_msg_tuple() for i in history] + [("human", "{input}")]
)
chain = LLMChain(prompt=chat_prompt, llm=model)
content = chain({"input": query})
return {"answer": "", "docs": ""}, content
def create_atask(self, query, history, model, callback: AsyncIteratorCallbackHandler):
chat_prompt = ChatPromptTemplate.from_messages(
[i.to_msg_tuple() for i in history] + [("human", "{input}")]
)
chain = LLMChain(prompt=chat_prompt, llm=model)
task = asyncio.create_task(wrap_done(
chain.acall({"input": query}), callback.done
))
return task, {"answer": "", "docs": ""}

View File

@ -0,0 +1,79 @@
from fastapi import Request
import os, asyncio
from urllib.parse import urlencode
from typing import List
from langchain import LLMChain
from langchain.callbacks import AsyncIteratorCallbackHandler
from langchain.prompts.chat import ChatPromptTemplate
from configs.model_config import (
llm_model_dict, LLM_MODEL, PROMPT_TEMPLATE,
VECTOR_SEARCH_TOP_K, SCORE_THRESHOLD)
from dev_opsgpt.chat.utils import History, wrap_done
from dev_opsgpt.utils import BaseResponse
from .base_chat import Chat
from dev_opsgpt.service.kb_api import search_docs, KBServiceFactory
from loguru import logger
class KnowledgeChat(Chat):
def __init__(
self,
engine_name: str = "",
top_k: int = VECTOR_SEARCH_TOP_K,
stream: bool = False,
score_thresold: float = SCORE_THRESHOLD,
local_doc_url: bool = False,
request: Request = None,
) -> None:
super().__init__(engine_name, top_k, stream)
self.score_thresold = score_thresold
self.local_doc_url = local_doc_url
self.request = request
def check_service_status(self) -> BaseResponse:
kb = KBServiceFactory.get_service_by_name(self.engine_name)
if kb is None:
return BaseResponse(code=404, msg=f"未找到知识库 {self.engine_name}")
return BaseResponse(code=200, msg=f"找到知识库 {self.engine_name}")
def _process(self, query: str, history: List[History], model):
'''process'''
docs = search_docs(query, self.engine_name, self.top_k, self.score_threshold)
context = "\n".join([doc.page_content for doc in docs])
source_documents = []
for inum, doc in enumerate(docs):
filename = os.path.split(doc.metadata["source"])[-1]
if self.local_doc_url:
url = "file://" + doc.metadata["source"]
else:
parameters = urlencode({"knowledge_base_name": self.engine_name, "file_name":filename})
url = f"{self.request.base_url}knowledge_base/download_doc?" + parameters
text = f"""出处 [{inum + 1}] [{filename}]({url}) \n\n{doc.page_content}\n\n"""
source_documents.append(text)
chat_prompt = ChatPromptTemplate.from_messages(
[i.to_msg_tuple() for i in history] + [("human", PROMPT_TEMPLATE)]
)
chain = LLMChain(prompt=chat_prompt, llm=model)
result = {"answer": "", "docs": source_documents}
return chain, context, result
def create_task(self, query: str, history: List[History], model):
'''构建 llm 生成任务'''
logger.debug(f"query: {query}, history: {history}")
chain, context, result = self._process(query, history, model)
try:
content = chain({"context": context, "question": query})
except Exception as e:
content = {"text": str(e)}
return result, content
def create_atask(self, query, history, model, callback: AsyncIteratorCallbackHandler):
chain, context, result = self._process(query, history, model)
task = asyncio.create_task(wrap_done(
chain.acall({"context": context, "question": query}), callback.done
))
return task, result

View File

@ -0,0 +1,41 @@
import asyncio
from typing import List
from langchain import LLMChain
from langchain.callbacks import AsyncIteratorCallbackHandler
from langchain.prompts.chat import ChatPromptTemplate
from dev_opsgpt.chat.utils import History, wrap_done
from .base_chat import Chat
from loguru import logger
class LLMChat(Chat):
def __init__(
self,
engine_name: str = "",
top_k: int = 1,
stream: bool = False,
) -> None:
super().__init__(engine_name, top_k, stream)
def create_task(self, query: str, history: List[History], model):
'''构建 llm 生成任务'''
chat_prompt = ChatPromptTemplate.from_messages(
[i.to_msg_tuple() for i in history] + [("human", "{input}")]
)
chain = LLMChain(prompt=chat_prompt, llm=model)
content = chain({"input": query})
return {"answer": "", "docs": ""}, content
def create_atask(self, query, history, model, callback: AsyncIteratorCallbackHandler):
chat_prompt = ChatPromptTemplate.from_messages(
[i.to_msg_tuple() for i in history] + [("human", "{input}")]
)
chain = LLMChain(prompt=chat_prompt, llm=model)
task = asyncio.create_task(wrap_done(
chain.acall({"input": query}), callback.done
))
return task, {"answer": "", "docs": ""}

View File

@ -0,0 +1,151 @@
from fastapi import Request
import os, asyncio
from urllib.parse import urlencode
from typing import List, Optional, Dict
from langchain import LLMChain
from langchain.callbacks import AsyncIteratorCallbackHandler
from langchain.utilities import BingSearchAPIWrapper, DuckDuckGoSearchAPIWrapper
from langchain.prompts.chat import ChatPromptTemplate
from langchain.docstore.document import Document
from configs.model_config import (
PROMPT_TEMPLATE, SEARCH_ENGINE_TOP_K, BING_SUBSCRIPTION_KEY, BING_SEARCH_URL,
VECTOR_SEARCH_TOP_K, SCORE_THRESHOLD)
from dev_opsgpt.chat.utils import History, wrap_done
from dev_opsgpt.utils import BaseResponse
from .base_chat import Chat
from loguru import logger
from duckduckgo_search import DDGS
def bing_search(text, result_len=SEARCH_ENGINE_TOP_K):
if not (BING_SEARCH_URL and BING_SUBSCRIPTION_KEY):
return [{"snippet": "please set BING_SUBSCRIPTION_KEY and BING_SEARCH_URL in os ENV",
"title": "env info is not found",
"link": "https://python.langchain.com/en/latest/modules/agents/tools/examples/bing_search.html"}]
search = BingSearchAPIWrapper(bing_subscription_key=BING_SUBSCRIPTION_KEY,
bing_search_url=BING_SEARCH_URL)
return search.results(text, result_len)
def duckduckgo_search(
query: str,
result_len: int = SEARCH_ENGINE_TOP_K,
region: Optional[str] = "wt-wt",
safesearch: str = "moderate",
time: Optional[str] = "y",
backend: str = "api",
):
with DDGS(proxies=os.environ.get("DUCKDUCKGO_PROXY")) as ddgs:
results = ddgs.text(
query,
region=region,
safesearch=safesearch,
timelimit=time,
backend=backend,
)
if results is None:
return [{"Result": "No good DuckDuckGo Search Result was found"}]
def to_metadata(result: Dict) -> Dict[str, str]:
if backend == "news":
return {
"date": result["date"],
"title": result["title"],
"snippet": result["body"],
"source": result["source"],
"link": result["url"],
}
return {
"snippet": result["body"],
"title": result["title"],
"link": result["href"],
}
formatted_results = []
for i, res in enumerate(results, 1):
if res is not None:
formatted_results.append(to_metadata(res))
if len(formatted_results) == result_len:
break
return formatted_results
# def duckduckgo_search(text, result_len=SEARCH_ENGINE_TOP_K):
# search = DuckDuckGoSearchAPIWrapper()
# return search.results(text, result_len)
SEARCH_ENGINES = {"duckduckgo": duckduckgo_search,
"bing": bing_search,
}
def search_result2docs(search_results):
docs = []
for result in search_results:
doc = Document(page_content=result["snippet"] if "snippet" in result.keys() else "",
metadata={"source": result["link"] if "link" in result.keys() else "",
"filename": result["title"] if "title" in result.keys() else ""})
docs.append(doc)
return docs
def lookup_search_engine(
query: str,
search_engine_name: str,
top_k: int = SEARCH_ENGINE_TOP_K,
):
results = SEARCH_ENGINES[search_engine_name](query, result_len=top_k)
docs = search_result2docs(results)
return docs
class SearchChat(Chat):
def __init__(
self,
engine_name: str = "",
top_k: int = VECTOR_SEARCH_TOP_K,
stream: bool = False,
) -> None:
super().__init__(engine_name, top_k, stream)
def check_service_status(self) -> BaseResponse:
if self.engine_name not in SEARCH_ENGINES.keys():
return BaseResponse(code=404, msg=f"未支持搜索引擎 {self.engine_name}")
return BaseResponse(code=200, msg=f"支持搜索引擎 {self.engine_name}")
def _process(self, query: str, history: List[History], model):
'''process'''
docs = lookup_search_engine(query, self.engine_name, self.top_k)
context = "\n".join([doc.page_content for doc in docs])
source_documents = [
f"""出处 [{inum + 1}] [{doc.metadata["source"]}]({doc.metadata["source"]}) \n\n{doc.page_content}\n\n"""
for inum, doc in enumerate(docs)
]
chat_prompt = ChatPromptTemplate.from_messages(
[i.to_msg_tuple() for i in history] + [("human", PROMPT_TEMPLATE)]
)
chain = LLMChain(prompt=chat_prompt, llm=model)
result = {"answer": "", "docs": source_documents}
return chain, context, result
def create_task(self, query: str, history: List[History], model):
'''构建 llm 生成任务'''
chain, context, result = self._process(query, history, model)
content = chain({"context": context, "question": query})
return result, content
def create_atask(self, query, history, model, callback: AsyncIteratorCallbackHandler):
chain, context, result = self._process(query, history, model)
task = asyncio.create_task(wrap_done(
chain.acall({"context": context, "question": query}), callback.done
))
return task, result

30
dev_opsgpt/chat/utils.py Normal file
View File

@ -0,0 +1,30 @@
import asyncio
from typing import Awaitable
from pydantic import BaseModel, Field
async def wrap_done(fn: Awaitable, event: asyncio.Event):
"""Wrap an awaitable with a event to signal when it's done or an exception is raised."""
try:
await fn
except Exception as e:
# TODO: handle exception
print(f"Caught exception: {e}")
finally:
# Signal the aiter to stop.
event.set()
class History(BaseModel):
"""
对话历史
可从dict生成
h = History(**{"role":"user","content":"你好"})
也可转换为tuple
h.to_msy_tuple = ("human", "你好")
"""
role: str = Field(...)
content: str = Field(...)
def to_msg_tuple(self):
return "ai" if self.role=="assistant" else "human", self.content

View File

@ -0,0 +1,6 @@
from .json_loader import JSONLoader
from .jsonl_loader import JSONLLoader
__all__ = [
"JSONLoader", "JSONLLoader"
]

View File

@ -0,0 +1,41 @@
import json
from pathlib import Path
from typing import AnyStr, Callable, Dict, List, Optional, Union
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from dev_opsgpt.utils.common_utils import read_json_file
class JSONLoader(BaseLoader):
def __init__(
self,
file_path: Union[str, Path],
schema_key: str = "all_text",
content_key: Optional[str] = None,
metadata_func: Optional[Callable[[Dict, Dict], Dict]] = None,
text_content: bool = True,
):
self.file_path = Path(file_path).resolve()
self.schema_key = schema_key
self._content_key = content_key
self._metadata_func = metadata_func
self._text_content = text_content
def load(self, ) -> List[Document]:
"""Load and return documents from the JSON file."""
docs: List[Document] = []
datas = read_json_file(self.file_path)
self._parse(datas, docs)
return docs
def _parse(self, datas: List, docs: List[Document]) -> None:
for idx, sample in enumerate(datas):
metadata = dict(
source=str(self.file_path),
seq_num=idx,
)
text = sample.get(self.schema_key, "")
docs.append(Document(page_content=text, metadata=metadata))

View File

@ -0,0 +1,41 @@
import json
from pathlib import Path
from typing import AnyStr, Callable, Dict, List, Optional, Union
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from dev_opsgpt.utils.common_utils import read_jsonl_file
class JSONLLoader(BaseLoader):
def __init__(
self,
file_path: Union[str, Path],
schema_key: str = "all_text",
content_key: Optional[str] = None,
metadata_func: Optional[Callable[[Dict, Dict], Dict]] = None,
text_content: bool = True,
):
self.file_path = Path(file_path).resolve()
self.schema_key = schema_key
self._content_key = content_key
self._metadata_func = metadata_func
self._text_content = text_content
def load(self, ) -> List[Document]:
"""Load and return documents from the JSON file."""
docs: List[Document] = []
datas = read_jsonl_file(self.file_path)
self._parse(datas, docs)
return docs
def _parse(self, datas: List, docs: List[Document]) -> None:
for idx, sample in enumerate(datas):
metadata = dict(
source=str(self.file_path),
seq_num=idx,
)
text = sample.get(self.schema_key, "")
docs.append(Document(page_content=text, metadata=metadata))

View File

@ -0,0 +1,37 @@
from typing import List
from langchain.embeddings.base import Embeddings
from langchain.schema import Document
class BaseVSCService:
def do_create_kb(self):
pass
def do_drop_kb(self):
pass
def do_add_doc(self, docs: List[Document], embeddings: Embeddings):
pass
def do_clear_vs(self):
pass
def vs_type(self) -> str:
return "default"
def do_init(self):
pass
def do_search(self):
pass
def do_insert_multi_knowledge(self):
pass
def do_insert_one_knowledge(self):
pass
def do_delete_doc(self):
pass

View File

@ -0,0 +1,12 @@
from functools import lru_cache
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from configs.model_config import embedding_model_dict
from loguru import logger
@lru_cache(1)
def load_embeddings(model: str, device: str):
logger.info("load embedding model: {}, {}".format(model, embedding_model_dict[model]))
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_dict[model],
model_kwargs={'device': device})
return embeddings

View File

@ -0,0 +1,22 @@
from .db import _engine, Base
from loguru import logger
__all__ = [
]
def create_tables():
Base.metadata.create_all(bind=_engine)
def reset_tables():
Base.metadata.drop_all(bind=_engine)
create_tables()
def check_tables_exist(table_name) -> bool:
table_exist = _engine.dialect.has_table(_engine.connect(), table_name, schema=None)
return table_exist
def table_init():
if (not check_tables_exist("knowledge_base")) or (not check_tables_exist ("knowledge_file")):
create_tables()

View File

@ -0,0 +1,10 @@
from .document_file_cds import *
from .document_base_cds import *
__all__ = [
"add_kb_to_db", "list_kbs_from_db", "kb_exists",
"load_kb_from_db", "delete_kb_from_db", "get_kb_detail",
"list_docs_from_db", "add_doc_to_db", "delete_file_from_db",
"delete_files_from_db", "doc_exists", "get_file_detail",
]

View File

@ -0,0 +1,89 @@
from dev_opsgpt.orm.db import with_session, _engine
from dev_opsgpt.orm.schemas.base_schema import KnowledgeBaseSchema
# @with_session
# def _query_by_condition(session, schema, query_kargs, query_type="first"):
# if len(query_kargs) >0:
# if query_type == "first":
# return session.query(schema).filter_by(query_kargs).first()
# elif query_type == "all":
# return session.query(schema).filter_by(query_kargs).first()
# @with_session
# def _add_to_db(session, schema, query_kargs):
# kb = schema(**query_kargs)
# session.add(kb)
# return True
# @with_session
# def add_to_db(session, schema, query_kargs):
# kb = _query_by_condition(session, schema, query_kargs, query_type="first")
# if not kb:
# _add_to_db(session, schema, query_kargs)
# else: # update kb with new vs_type and embed_model
# for k, v in query_kargs.items():
# if k in kb:
# kb[k] = v
# return True
@with_session
def add_kb_to_db(session, kb_name, vs_type, embed_model):
# 创建知识库实例
kb = session.query(KnowledgeBaseSchema).filter_by(kb_name=kb_name).first()
if not kb:
kb = KnowledgeBaseSchema(kb_name=kb_name, vs_type=vs_type, embed_model=embed_model)
session.add(kb)
else: # update kb with new vs_type and embed_model
kb.vs_type = vs_type
kb.embed_model = embed_model
return True
@with_session
def list_kbs_from_db(session, min_file_count: int = -1):
kbs = session.query(KnowledgeBaseSchema.kb_name).filter(KnowledgeBaseSchema.file_count > min_file_count).all()
kbs = [kb[0] for kb in kbs]
return kbs
@with_session
def kb_exists(session, kb_name):
kb = session.query(KnowledgeBaseSchema).filter_by(kb_name=kb_name).first()
status = True if kb else False
return status
@with_session
def load_kb_from_db(session, kb_name):
kb = session.query(KnowledgeBaseSchema).filter_by(kb_name=kb_name).first()
if kb:
kb_name, vs_type, embed_model = kb.kb_name, kb.vs_type, kb.embed_model
else:
kb_name, vs_type, embed_model = None, None, None
return kb_name, vs_type, embed_model
@with_session
def delete_kb_from_db(session, kb_name):
kb = session.query(KnowledgeBaseSchema).filter_by(kb_name=kb_name).first()
if kb:
session.delete(kb)
return True
@with_session
def get_kb_detail(session, kb_name: str) -> dict:
kb: KnowledgeBaseSchema = session.query(KnowledgeBaseSchema).filter_by(kb_name=kb_name).first()
if kb:
return {
"kb_name": kb.kb_name,
"vs_type": kb.vs_type,
"embed_model": kb.embed_model,
"file_count": kb.file_count,
"create_time": kb.create_time,
}
else:
return {}

View File

@ -0,0 +1,87 @@
from dev_opsgpt.orm.db import with_session, _engine
from dev_opsgpt.orm.schemas.base_schema import KnowledgeFileSchema, KnowledgeBaseSchema
from dev_opsgpt.orm.utils import DocumentFile
@with_session
def list_docs_from_db(session, kb_name):
files = session.query(KnowledgeFileSchema).filter_by(kb_name=kb_name).all()
docs = [f.file_name for f in files]
return docs
@with_session
def add_doc_to_db(session, kb_file: DocumentFile):
kb = session.query(KnowledgeBaseSchema).filter_by(kb_name=kb_file.kb_name).first()
if kb:
# 如果已经存在该文件,则更新文件版本号
existing_file = session.query(KnowledgeFileSchema).filter_by(file_name=kb_file.filename,
kb_name=kb_file.kb_name).first()
if existing_file:
existing_file.file_version += 1
# 否则,添加新文件
else:
new_file = KnowledgeFileSchema(
file_name=kb_file.filename,
file_ext=kb_file.ext,
kb_name=kb_file.kb_name,
document_loader_name=kb_file.document_loader_name,
text_splitter_name=kb_file.text_splitter_name or "SpacyTextSplitter",
)
kb.file_count += 1
session.add(new_file)
return True
@with_session
def delete_file_from_db(session, kb_file: DocumentFile):
existing_file = session.query(KnowledgeFileSchema).filter_by(file_name=kb_file.filename,
kb_name=kb_file.kb_name).first()
if existing_file:
session.delete(existing_file)
session.commit()
kb = session.query(KnowledgeBaseSchema).filter_by(kb_name=kb_file.kb_name).first()
if kb:
kb.file_count -= 1
session.commit()
return True
@with_session
def delete_files_from_db(session, knowledge_base_name: str):
session.query(KnowledgeFileSchema).filter_by(kb_name=knowledge_base_name).delete()
kb = session.query(KnowledgeBaseSchema).filter_by(kb_name=knowledge_base_name).first()
if kb:
kb.file_count = 0
session.commit()
return True
@with_session
def doc_exists(session, kb_file: DocumentFile):
existing_file = session.query(KnowledgeFileSchema).filter_by(file_name=kb_file.filename,
kb_name=kb_file.kb_name).first()
return True if existing_file else False
@with_session
def get_file_detail(session, kb_name: str, filename: str) -> dict:
file: KnowledgeFileSchema = (session.query(KnowledgeFileSchema)
.filter_by(file_name=filename,
kb_name=kb_name).first())
if file:
return {
"kb_name": file.kb_name,
"file_name": file.file_name,
"file_ext": file.file_ext,
"file_version": file.file_version,
"document_loader": file.document_loader_name,
"text_splitter": file.text_splitter_name,
"create_time": file.create_time,
}
else:
return {}

59
dev_opsgpt/orm/db.py Normal file
View File

@ -0,0 +1,59 @@
from contextlib import contextmanager
from sqlalchemy.engine import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from configs.model_config import SQLALCHEMY_DATABASE_URI
_engine = create_engine(SQLALCHEMY_DATABASE_URI)
session_factory = sessionmaker(bind=_engine)
Base = declarative_base()
def init_session():
session = session_factory()
try:
yield session
finally:
try:
session.commit()
except Exception as e:
session.rollback()
raise e
finally:
session.close()
def with_session(func):
def wrapper(*args, **kwargs):
session = session_factory()
try:
return func(session, *args, **kwargs)
finally:
try:
session.commit()
except Exception as e:
session.rollback()
raise e
finally:
session.close()
return wrapper
@contextmanager
def session_scope():
"""上下文管理器用于自动获取 Session, 避免错误"""
session = session_factory(autoflush=True)
try:
session.commit()
except Exception as e:
session.rollback()
raise e
finally:
session.close()

View File

@ -0,0 +1,48 @@
from sqlalchemy import Column, Integer, String, DateTime, func
from dev_opsgpt.orm.db import Base
class KnowledgeBaseSchema(Base):
"""
知识库模型
"""
__tablename__ = 'knowledge_base'
id = Column(Integer, primary_key=True, autoincrement=True, comment='知识库ID')
kb_name = Column(String, comment='知识库名称')
vs_type = Column(String, comment='嵌入模型类型')
embed_model = Column(String, comment='嵌入模型名称')
file_count = Column(Integer, default=0, comment='文件数量')
create_time = Column(DateTime, default=func.now(), comment='创建时间')
def __repr__(self):
return f"""<KnowledgeBase(id='{self.id}',
kb_name='{self.kb_name}',
vs_type='{self.vs_type}',
embed_model='{self.embed_model}',
file_count='{self.file_count}',
create_time='{self.create_time}')>"""
class KnowledgeFileSchema(Base):
"""
知识文件模型
"""
__tablename__ = 'knowledge_file'
id = Column(Integer, primary_key=True, autoincrement=True, comment='知识文件ID')
file_name = Column(String, comment='文件名')
file_ext = Column(String, comment='文件扩展名')
kb_name = Column(String, comment='所属知识库名称')
document_loader_name = Column(String, comment='文档加载器名称')
text_splitter_name = Column(String, comment='文本分割器名称')
file_version = Column(Integer, default=1, comment='文件版本')
create_time = Column(DateTime, default=func.now(), comment='创建时间')
def __repr__(self):
return f"""<KnowledgeFile(id='{self.id}',
file_name='{self.file_name}',
file_ext='{self.file_ext}',
kb_name='{self.kb_name}',
document_loader_name='{self.document_loader_name}',
text_splitter_name='{self.text_splitter_name}',
file_version='{self.file_version}',
create_time='{self.create_time}')>"""

18
dev_opsgpt/orm/utils.py Normal file
View File

@ -0,0 +1,18 @@
import os
from dev_opsgpt.utils.path_utils import get_file_path, get_LoaderClass, SUPPORTED_EXTS
class DocumentFile:
def __init__(
self, filename: str, knowledge_base_name: str) -> None:
self.kb_name = knowledge_base_name
self.filename = filename
self.ext = os.path.splitext(filename)[-1].lower()
if self.ext not in SUPPORTED_EXTS:
raise ValueError(f"暂未支持的文件格式 {self.ext}")
self.filepath = get_file_path(knowledge_base_name, filename)
self.docs = None
self.document_loader_name = get_LoaderClass(self.ext)
# TODO: 增加依据文件格式匹配text_splitter
self.text_splitter_name = None

View File

@ -0,0 +1,6 @@
from .basebox import CodeBoxResponse
from .pycodebox import PyCodeBox
__all__ = [
"CodeBoxResponse", "PyCodeBox"
]

View File

@ -0,0 +1,142 @@
from pydantic import BaseModel
from typing import Optional
from pathlib import Path
import sys
from abc import ABC, abstractclassmethod
from configs.server_config import SANDBOX_SERVER
class CodeBoxResponse(BaseModel):
code_text: str = ""
code_exe_response: str = ""
code_exe_type: str = ""
code_exe_status: int
do_code_exe: bool
def __str__(self,):
return f"""status: {self.code_exe_status}, type: {self.code_exe_type}, response: {self.code_exe_response}"""
class CodeBoxStatus(BaseModel):
status: str
class CodeBoxFile(BaseModel):
"""
Represents a file returned from a CodeBox instance.
"""
name: str
content: Optional[bytes] = None
def __str__(self):
return self.name
def __repr__(self):
return f"File({self.name})"
class BaseBox(ABC):
enter_status = False
def __init__(
self,
remote_url: str = "",
remote_ip: str = SANDBOX_SERVER["host"],
remote_port: str = SANDBOX_SERVER["port"],
token: str = "mytoken",
do_code_exe: bool = False,
do_remote: bool = False
):
self.token = token
self.remote_url = remote_url or remote_ip + ":" + str(remote_port)
self.remote_ip = remote_ip
self.remote_port = remote_port
self.do_code_exe = do_code_exe
self.do_remote = do_remote
self.local_pyenv = Path(sys.executable).absolute()
self.ws = None
self.aiohttp_session = None
self.kernel_url = ""
def chat(self, text: str, file_path: str = None, do_code_exe: bool = None) -> CodeBoxResponse:
'''执行流'''
do_code_exe = self.do_code_exe if do_code_exe is None else do_code_exe
if not do_code_exe:
return CodeBoxResponse(
code_exe_response=text, code_text=text, code_exe_type="text", code_exe_status=200,
do_code_exe=do_code_exe
)
try:
code_text = self.decode_code_from_text(text)
return self.run(code_text, file_path)
except Exception as e:
return CodeBoxResponse(
code_exe_response=str(e), code_text=text, code_exe_type="error", code_exe_status=500,
do_code_exe=do_code_exe
)
async def achat(self, text: str, file_path: str = None, do_code_exe: bool = None) -> CodeBoxResponse:
do_code_exe = self.do_code_exe if do_code_exe is None else do_code_exe
if not do_code_exe:
return CodeBoxResponse(
code_exe_response=text, code_text=text, code_exe_type="text", code_exe_status=200,
do_code_exe=do_code_exe
)
try:
code_text = self.decode_code_from_text(text)
return await self.arun(code_text, file_path)
except Exception as e:
return CodeBoxResponse(
code_exe_response=str(e), code_text=text, code_exe_type="error", code_exe_status=500,
do_code_exe=do_code_exe
)
def run(
self,
code_text: str = None,
file_path: str = None,
retry=3,
) -> CodeBoxResponse:
'''执行代码'''
pass
async def arun(
self,
code_text: str = None,
file_path: str = None,
retry=3,
) -> CodeBoxResponse:
'''执行代码'''
pass
def decode_code_from_text(self, text):
pass
def start(self,):
pass
async def astart(self, ):
pass
@abstractclassmethod
def stop(self) -> CodeBoxStatus:
"""Terminate the CodeBox instance"""
def __enter__(self, ) -> "BaseBox":
if not self.enter_status:
self.start()
return self
async def __aenter__(self, ) -> "BaseBox":
if not self.enter_status:
await self.astart()
return self
def __exit__(self, exc_type, exc_value, traceback) -> None:
self.stop()

View File

@ -0,0 +1,411 @@
import time, os, docker, requests, json, uuid, subprocess, time, asyncio, aiohttp, re, traceback
import psutil
from typing import List, Optional, Union
from pathlib import Path
from loguru import logger
from websockets.sync.client import connect as ws_connect_sync
from websockets.client import connect as ws_connect
from websocket import create_connection
from websockets.client import WebSocketClientProtocol, ClientConnection
from websockets.exceptions import ConnectionClosedError
from configs.server_config import SANDBOX_SERVER
from .basebox import BaseBox, CodeBoxResponse, CodeBoxStatus, CodeBoxFile
class PyCodeBox(BaseBox):
enter_status: bool = False
def __init__(
self,
remote_url: str = "",
remote_ip: str = SANDBOX_SERVER["host"],
remote_port: str = SANDBOX_SERVER["port"],
token: str = "mytoken",
do_code_exe: bool = False,
do_remote: bool = False
):
super().__init__(remote_url, remote_ip, remote_port, token, do_code_exe, do_remote)
self.enter_status = True
asyncio.run(self.astart())
def decode_code_from_text(self, text: str) -> str:
pattern = r'```.*?```'
code_blocks = re.findall(pattern, text, re.DOTALL)
code_text: str = "\n".join([block.strip('`') for block in code_blocks])
code_text = code_text[6:] if code_text.startswith("python") else code_text
code_text = code_text.replace("python\n", "").replace("code", "")
return code_text
def run(
self, code_text: Optional[str] = None,
file_path: Optional[os.PathLike] = None,
retry = 3,
) -> CodeBoxResponse:
if not code_text and not file_path:
return CodeBoxResponse(
code_exe_response="Code or file_path must be specifieds!",
code_text=code_text,
code_exe_type="text",
code_exe_status=502,
do_code_exe=self.do_code_exe,
)
if code_text and file_path:
return CodeBoxResponse(
code_exe_response="Can only specify code or the file to read_from!",
code_text=code_text,
code_exe_type="text",
code_exe_status=502,
do_code_exe=self.do_code_exe,
)
if file_path:
with open(file_path, "r", encoding="utf-8") as f:
code_text = f.read()
# run code in jupyter kernel
if retry <= 0:
raise RuntimeError("Could not connect to kernel")
if not self.ws:
raise RuntimeError("Jupyter not running. Make sure to start it first")
logger.debug(f"code_text: {json.dumps(code_text, ensure_ascii=False)}")
self.ws.send(
json.dumps(
{
"header": {
"msg_id": (msg_id := uuid.uuid4().hex),
"msg_type": "execute_request",
},
"parent_header": {},
"metadata": {},
"content": {
"code": code_text,
"silent": True,
"store_history": True,
"user_expressions": {},
"allow_stdin": False,
"stop_on_error": True,
},
"channel": "shell",
"buffers": [],
}
)
)
result = ""
while True:
try:
if isinstance(self.ws, WebSocketClientProtocol):
raise RuntimeError("Mixing asyncio and sync code is not supported")
received_msg = json.loads(self.ws.recv())
except ConnectionClosedError:
logger.debug("box start, ConnectionClosedError!!!")
self.start()
return self.run(code_text, file_path, retry - 1)
if (
received_msg["header"]["msg_type"] == "stream"
and received_msg["parent_header"]["msg_id"] == msg_id
):
msg = received_msg["content"]["text"].strip()
if "Requirement already satisfied:" in msg:
continue
result += msg + "\n"
elif (
received_msg["header"]["msg_type"] == "execute_result"
and received_msg["parent_header"]["msg_id"] == msg_id
):
result += received_msg["content"]["data"]["text/plain"].strip() + "\n"
elif received_msg["header"]["msg_type"] == "display_data":
if "image/png" in received_msg["content"]["data"]:
return CodeBoxResponse(
code_exe_type="image/png",
code_text=code_text,
code_exe_response=received_msg["content"]["data"]["image/png"],
code_exe_status=200,
do_code_exe=self.do_code_exe
)
if "text/plain" in received_msg["content"]["data"]:
return CodeBoxResponse(
code_exe_type="text",
code_text=code_text,
code_exe_response=received_msg["content"]["data"]["text/plain"],
code_exe_status=200,
do_code_exe=self.do_code_exe
)
return CodeBoxResponse(
code_exe_type="error",
code_text=code_text,
code_exe_response=received_msg["content"]["data"]["text/plain"],
code_exe_status=420,
do_code_exe=self.do_code_exe
)
elif (
received_msg["header"]["msg_type"] == "status"
and received_msg["parent_header"]["msg_id"] == msg_id
and received_msg["content"]["execution_state"] == "idle"
):
if len(result) > 500:
result = "[...]\n" + result[-500:]
return CodeBoxResponse(
code_exe_type="text",
code_text=code_text,
code_exe_response=result or "Code run successfully (no output)",
code_exe_status=200,
do_code_exe=self.do_code_exe
)
elif (
received_msg["header"]["msg_type"] == "error"
and received_msg["parent_header"]["msg_id"] == msg_id
):
error = (
f"{received_msg['content']['ename']}: "
f"{received_msg['content']['evalue']}"
)
return CodeBoxResponse(
code_exe_type="error",
code_text=code_text,
code_exe_response=error,
code_exe_status=500,
do_code_exe=self.do_code_exe
)
def _get_kernelid(self, ) -> None:
headers = {"Authorization": f'Token {self.token}', 'token': self.token}
response = requests.get(f"{self.kernel_url}?token={self.token}", headers=headers)
if len(response.json()) > 0:
self.kernel_id = response.json()[0]["id"]
else:
response = requests.post(f"{self.kernel_url}?token={self.token}", headers=headers)
self.kernel_id = response.json()["id"]
if self.kernel_id is None:
raise Exception("Could not start kernel")
async def _aget_kernelid(self, ) -> None:
headers = {"Authorization": f'Token {self.token}', 'token': self.token}
response = requests.get(f"{self.kernel_url}?token={self.token}", headers=headers)
async with aiohttp.ClientSession() as session:
async with session.get(f"{self.kernel_url}?token={self.token}", headers=headers) as resp:
if len(await resp.json()) > 0:
self.kernel_id = (await resp.json())[0]["id"]
else:
async with session.post(f"{self.kernel_url}?token={self.token}", headers=headers) as response:
self.kernel_id = (await response.json())["id"]
# if len(response.json()) > 0:
# self.kernel_id = response.json()[0]["id"]
# else:
# response = requests.post(f"{self.kernel_url}?token={self.token}", headers=headers)
# self.kernel_id = response.json()["id"]
# if self.kernel_id is None:
# raise Exception("Could not start kernel")
def _check_connect(self, ) -> bool:
if self.kernel_url == "":
return False
try:
response = requests.get(f"{self.kernel_url}?token={self.token}", timeout=270)
return response.status_code == 200
except requests.exceptions.ConnectionError:
return False
async def _acheck_connect(self, ) -> bool:
if self.kernel_url == "":
return False
try:
async with aiohttp.ClientSession() as session:
async with session.get(f"{self.kernel_url}?token={self.token}", timeout=270) as resp:
return resp.status == 200
except aiohttp.ClientConnectorError:
pass
except aiohttp.ServerDisconnectedError:
pass
def _check_port(self, ) -> bool:
try:
response = requests.get(f"http://localhost:{self.remote_port}", timeout=270)
logger.warning(f"Port is conflict, please check your codebox's port {self.remote_port}")
return response.status_code == 200
except requests.exceptions.ConnectionError:
return False
async def _acheck_port(self, ) -> bool:
try:
async with aiohttp.ClientSession() as session:
async with session.get(f"http://localhost:{self.remote_port}", timeout=270) as resp:
logger.warning(f"Port is conflict, please check your codebox's port {self.remote_port}")
return resp.status == 200
except aiohttp.ClientConnectorError:
pass
except aiohttp.ServerDisconnectedError:
pass
def _check_connect_success(self, retry_nums: int = 5) -> bool:
while retry_nums > 0:
try:
connect_status = self._check_connect()
if connect_status:
logger.info(f"{self.remote_url} connection success")
return True
except requests.exceptions.ConnectionError:
logger.info(f"{self.remote_url} connection fail")
retry_nums -= 1
time.sleep(5)
raise BaseException(f"can't connect to {self.remote_url}")
async def _acheck_connect_success(self, retry_nums: int = 5) -> bool:
while retry_nums > 0:
try:
connect_status = await self._acheck_connect()
if connect_status:
logger.info(f"{self.remote_url} connection success")
return True
except requests.exceptions.ConnectionError:
logger.info(f"{self.remote_url} connection fail")
retry_nums -= 1
time.sleep(5)
raise BaseException(f"can't connect to {self.remote_url}")
def start(self, ):
'''判断是从外部service执行还是内部启动notebook执行'''
self.jupyter = None
if self.do_remote:
# TODO自动检测日期,并重启容器
self.kernel_url = self.remote_url + "/api/kernels"
self._check_connect_success()
self._get_kernelid()
logger.debug(self.kernel_url.replace("http", "ws") + f"/{self.kernel_id}/channels?token={self.token}")
self.wc_url = self.kernel_url.replace("http", "ws") + f"/{self.kernel_id}/channels?token={self.token}"
headers = {"Authorization": f'Token {self.token}', 'token': self.token}
self.ws = create_connection(self.wc_url, headers=headers)
else:
# TODO 自动检测本地接口
port_status = self._check_port()
connect_status = self._check_connect()
logger.debug(f"port_status: {port_status}, connect_status: {connect_status}")
if port_status and not connect_status:
raise BaseException(f"Port is conflict, please check your codebox's port {self.remote_port}")
if not connect_status:
self.jupyter = subprocess.Popen(
[
"jupyer", "notebnook",
f"--NotebookApp.token={self.token}",
f"--port={self.remote_port}",
"--no-browser",
"--ServerApp.disable_check_xsrf=True",
],
stderr=subprocess.PIPE,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
)
self.kernel_url = self.remote_url + "/api/kernels"
self._check_connect_success()
self._get_kernelid()
logger.debug(self.kernel_url.replace("http", "ws") + f"/{self.kernel_id}/channels?token={self.token}")
self.wc_url = self.kernel_url.replace("http", "ws") + f"/{self.kernel_id}/channels?token={self.token}"
headers = {"Authorization": f'Token {self.token}', 'token': self.token}
self.ws = create_connection(self.wc_url, headers=headers)
async def astart(self, ):
'''判断是从外部service执行还是内部启动notebook执行'''
self.jupyter = None
if self.do_remote:
# TODO自动检测日期,并重启容器
self.kernel_url = self.remote_url + "/api/kernels"
await self._acheck_connect_success()
await self._aget_kernelid()
self.wc_url = self.kernel_url.replace("http", "ws") + f"/{self.kernel_id}/channels?token={self.token}"
headers = {"Authorization": f'Token {self.token}', 'token': self.token}
self.ws = create_connection(self.wc_url, headers=headers)
else:
# TODO 自动检测本地接口
port_status = await self._acheck_port()
self.kernel_url = self.remote_url + "/api/kernels"
connect_status = await self._acheck_connect()
logger.debug(f"port_status: {port_status}, connect_status: {connect_status}")
if port_status and not connect_status:
raise BaseException(f"Port is conflict, please check your codebox's port {self.remote_port}")
if not connect_status:
self.jupyter = subprocess.Popen(
[
"jupyter", "notebook",
f"--NotebookApp.token={self.token}",
f"--port={self.remote_port}",
"--no-browser",
"--ServerApp.disable_check_xsrf=True",
],
stderr=subprocess.PIPE,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
)
self.kernel_url = self.remote_url + "/api/kernels"
await self._acheck_connect_success()
await self._aget_kernelid()
self.wc_url = self.kernel_url.replace("http", "ws") + f"/{self.kernel_id}/channels?token={self.token}"
headers = {"Authorization": f'Token {self.token}', 'token': self.token}
self.ws = create_connection(self.wc_url, headers=headers)
def status(self,) -> CodeBoxStatus:
if not self.kernel_id:
self._get_kernelid()
return CodeBoxStatus(
status="running" if self.kernel_id
and requests.get(self.kernel_url, timeout=270).status_code == 200
else "stopped"
)
async def astatus(self,) -> CodeBoxStatus:
if not self.kernel_id:
await self._aget_kernelid()
return CodeBoxStatus(
status="running" if self.kernel_id
and requests.get(self.kernel_url, timeout=270).status_code == 200
else "stopped"
)
def restart(self, ) -> CodeBoxStatus:
return CodeBoxStatus(status="restared")
def stop(self, ) -> CodeBoxStatus:
try:
if self.jupyter is not None:
for process in psutil.process_iter(["pid", "name", "cmdline"]):
# 检查进程名是否包含"jupyter"
if f'port={self.remote_port}' in str(process.info["cmdline"]).lower() and \
"jupyter" in process.info['name'].lower():
logger.warning(f'port={self.remote_port}, {process.info}')
# 关闭进程
process.terminate()
self.jupyter = None
except Exception as e:
logger.error(traceback.format_exc())
if self.ws is not None:
try:
if self.ws is not None:
self.ws.close()
else:
loop = asyncio.new_event_loop()
loop.run_until_complete(self.ws.close())
except Exception as e:
logger.error(traceback.format_exc())
self.ws = None
return CodeBoxStatus(status="stopped")
def __del__(self):
self.stop()

View File

181
dev_opsgpt/service/api.py Normal file
View File

@ -0,0 +1,181 @@
import nltk
import argparse
import uvicorn, os, sys
from fastapi.middleware.cors import CORSMiddleware
from starlette.responses import RedirectResponse
from typing import List
src_dir = os.path.join(
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
)
sys.path.append(src_dir)
sys.path.append(os.path.dirname(os.path.dirname(__file__)))
from configs import VERSION
from configs.model_config import NLTK_DATA_PATH
from configs.server_config import OPEN_CROSS_DOMAIN
from dev_opsgpt.chat import LLMChat, SearchChat, KnowledgeChat
from dev_opsgpt.service.kb_api import *
from dev_opsgpt.utils.server_utils import BaseResponse, ListResponse, FastAPI, MakeFastAPIOffline
nltk.data.path = [NLTK_DATA_PATH] + nltk.data.path
from dev_opsgpt.chat import LLMChat, SearchChat, KnowledgeChat
llmChat = LLMChat()
searchChat = SearchChat()
knowledgeChat = KnowledgeChat()
async def document():
return RedirectResponse(url="/docs")
def create_app():
app = FastAPI(
title="DevOps-ChatBot API Server",
version=VERSION
)
MakeFastAPIOffline(app)
# Add CORS middleware to allow all origins
# 在config.py中设置OPEN_DOMAIN=True允许跨域
# set OPEN_DOMAIN=True in config.py to allow cross-domain
if OPEN_CROSS_DOMAIN:
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
app.get("/",
response_model=BaseResponse,
summary="swagger 文档")(document)
# Tag: Chat
# app.post("/chat/fastchat",
# tags=["Chat"],
# summary="与llm模型对话(直接与fastchat api对话)")(openai_chat)
app.post("/chat/chat",
tags=["Chat"],
summary="与llm模型对话(通过LLMChain)")(llmChat.chat)
app.post("/chat/knowledge_base_chat",
tags=["Chat"],
summary="与知识库对话")(knowledgeChat.chat)
app.post("/chat/search_engine_chat",
tags=["Chat"],
summary="与搜索引擎对话")(searchChat.chat)
# Tag: Knowledge Base Management
app.get("/knowledge_base/list_knowledge_bases",
tags=["Knowledge Base Management"],
response_model=ListResponse,
summary="获取知识库列表")(list_kbs)
app.post("/knowledge_base/create_knowledge_base",
tags=["Knowledge Base Management"],
response_model=BaseResponse,
summary="创建知识库"
)(create_kb)
app.post("/knowledge_base/delete_knowledge_base",
tags=["Knowledge Base Management"],
response_model=BaseResponse,
summary="删除知识库"
)(delete_kb)
app.get("/knowledge_base/list_files",
tags=["Knowledge Base Management"],
response_model=ListResponse,
summary="获取知识库内的文件列表"
)(list_docs)
app.post("/knowledge_base/search_docs",
tags=["Knowledge Base Management"],
response_model=List[DocumentWithScore],
summary="搜索知识库"
)(search_docs)
app.post("/knowledge_base/upload_docs",
tags=["Knowledge Base Management"],
response_model=BaseResponse,
summary="上传文件到知识库,并/或进行向量化"
)(upload_doc)
app.post("/knowledge_base/delete_docs",
tags=["Knowledge Base Management"],
response_model=BaseResponse,
summary="删除知识库内指定文件"
)(delete_doc)
app.post("/knowledge_base/update_docs",
tags=["Knowledge Base Management"],
response_model=BaseResponse,
summary="更新现有文件到知识库"
)(update_doc)
app.get("/knowledge_base/download_doc",
tags=["Knowledge Base Management"],
summary="下载对应的知识文件")(download_doc)
app.post("/knowledge_base/recreate_vector_store",
tags=["Knowledge Base Management"],
summary="根据content中文档重建向量库流式输出处理进度。"
)(recreate_vector_store)
# # LLM模型相关接口
# app.post("/llm_model/list_models",
# tags=["LLM Model Management"],
# summary="列出当前已加载的模型",
# )(list_llm_models)
# app.post("/llm_model/stop",
# tags=["LLM Model Management"],
# summary="停止指定的LLM模型Model Worker)",
# )(stop_llm_model)
# app.post("/llm_model/change",
# tags=["LLM Model Management"],
# summary="切换指定的LLM模型Model Worker)",
# )(change_llm_model)
return app
app = create_app()
def run_api(host, port, **kwargs):
if kwargs.get("ssl_keyfile") and kwargs.get("ssl_certfile"):
uvicorn.run(app,
host=host,
port=port,
ssl_keyfile=kwargs.get("ssl_keyfile"),
ssl_certfile=kwargs.get("ssl_certfile"),
)
else:
uvicorn.run(app, host=host, port=port)
if __name__ == "__main__":
parser = argparse.ArgumentParser(prog='DevOps-ChatBot',
description='About DevOps-ChatBot, local knowledge based LLM with langchain'
' 基于本地知识库的 LLM 问答')
parser.add_argument("--host", type=str, default="0.0.0.0")
parser.add_argument("--port", type=int, default=7861)
parser.add_argument("--ssl_keyfile", type=str)
parser.add_argument("--ssl_certfile", type=str)
# 初始化消息
args = parser.parse_args()
args_dict = vars(args)
run_api(host=args.host,
port=args.port,
ssl_keyfile=args.ssl_keyfile,
ssl_certfile=args.ssl_certfile,
)

View File

@ -0,0 +1,185 @@
from abc import ABC, abstractmethod
from typing import List
import os
from langchain.embeddings.base import Embeddings
from langchain.docstore.document import Document
from configs.model_config import (
kbs_config, VECTOR_SEARCH_TOP_K, SCORE_THRESHOLD,
EMBEDDING_MODEL, EMBEDDING_DEVICE
)
from dev_opsgpt.orm.commands import *
from dev_opsgpt.utils.path_utils import *
from dev_opsgpt.orm.utils import DocumentFile
from dev_opsgpt.embeddings.utils import load_embeddings
from dev_opsgpt.text_splitter import LCTextSplitter
class SupportedVSType:
FAISS = 'faiss'
# MILVUS = 'milvus'
# DEFAULT = 'default'
# PG = 'pg'
class KBService(ABC):
def __init__(self,
knowledge_base_name: str,
embed_model: str = EMBEDDING_MODEL,
):
self.kb_name = knowledge_base_name
self.embed_model = embed_model
self.kb_path = get_kb_path(self.kb_name)
self.doc_path = get_doc_path(self.kb_name)
self.do_init()
def _load_embeddings(self, embed_device: str = EMBEDDING_DEVICE) -> Embeddings:
return load_embeddings(self.embed_model, embed_device)
def create_kb(self):
"""
创建知识库
"""
if not os.path.exists(self.doc_path):
os.makedirs(self.doc_path)
self.do_create_kb()
status = add_kb_to_db(self.kb_name, self.vs_type(), self.embed_model)
return status
def clear_vs(self):
"""
删除向量库中所有内容
"""
self.do_clear_vs()
status = delete_files_from_db(self.kb_name)
return status
def drop_kb(self):
"""
删除知识库
"""
self.do_drop_kb()
status = delete_kb_from_db(self.kb_name)
return status
def add_doc(self, kb_file: DocumentFile, **kwargs):
"""
向知识库添加文件
"""
lctTextSplitter = LCTextSplitter(kb_file.filepath)
docs = lctTextSplitter.file2text()
if docs:
self.delete_doc(kb_file)
embeddings = self._load_embeddings()
self.do_add_doc(docs, embeddings, **kwargs)
status = add_doc_to_db(kb_file)
else:
status = False
return status
def delete_doc(self, kb_file: DocumentFile, delete_content: bool = False, **kwargs):
"""
从知识库删除文件
"""
self.do_delete_doc(kb_file, **kwargs)
status = delete_file_from_db(kb_file)
if delete_content and os.path.exists(kb_file.filepath):
os.remove(kb_file.filepath)
return status
def update_doc(self, kb_file: DocumentFile, **kwargs):
"""
使用content中的文件更新向量库
"""
if os.path.exists(kb_file.filepath):
self.delete_doc(kb_file, **kwargs)
return self.add_doc(kb_file, **kwargs)
def exist_doc(self, file_name: str):
return doc_exists(DocumentFile(knowledge_base_name=self.kb_name,
filename=file_name))
def list_docs(self):
return list_docs_from_db(self.kb_name)
def search_docs(self,
query: str,
top_k: int = VECTOR_SEARCH_TOP_K,
score_threshold: float = SCORE_THRESHOLD,
):
embeddings = self._load_embeddings()
docs = self.do_search(query, top_k, score_threshold, embeddings)
return docs
@abstractmethod
def do_create_kb(self):
"""
创建知识库子类实自己逻辑
"""
pass
@staticmethod
def list_kbs_type():
return list(kbs_config.keys())
@classmethod
def list_kbs(cls):
return list_kbs_from_db()
def exists(self, kb_name: str = None):
kb_name = kb_name or self.kb_name
return kb_exists(kb_name)
@abstractmethod
def vs_type(self) -> str:
pass
@abstractmethod
def do_init(self):
pass
@abstractmethod
def do_drop_kb(self):
"""
删除知识库子类实自己逻辑
"""
pass
@abstractmethod
def do_search(self,
query: str,
top_k: int,
embeddings: Embeddings,
) -> List[Document]:
"""
搜索知识库子类实自己逻辑
"""
pass
@abstractmethod
def do_add_doc(self,
docs: List[Document],
embeddings: Embeddings,
):
"""
向知识库添加文档子类实自己逻辑
"""
pass
@abstractmethod
def do_delete_doc(self,
kb_file: DocumentFile):
"""
从知识库删除文档子类实自己逻辑
"""
pass
@abstractmethod
def do_clear_vs(self):
"""
从知识库删除全部向量子类实自己逻辑
"""
pass

View File

@ -0,0 +1,166 @@
import os
import shutil
from typing import List
from functools import lru_cache
from loguru import logger
from langchain.vectorstores import FAISS
from langchain.embeddings.base import Embeddings
from langchain.docstore.document import Document
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from configs.model_config import (
KB_ROOT_PATH,
CACHED_VS_NUM,
EMBEDDING_MODEL,
EMBEDDING_DEVICE,
SCORE_THRESHOLD,
FAISS_NORMALIZE_L2
)
from .base_service import KBService, SupportedVSType
from dev_opsgpt.utils.path_utils import *
from dev_opsgpt.orm.utils import DocumentFile
from dev_opsgpt.utils.server_utils import torch_gc
from dev_opsgpt.embeddings.utils import load_embeddings
# make HuggingFaceEmbeddings hashable
def _embeddings_hash(self):
return hash(self.model_name)
HuggingFaceEmbeddings.__hash__ = _embeddings_hash
_VECTOR_STORE_TICKS = {}
@lru_cache(CACHED_VS_NUM)
def load_vector_store(
knowledge_base_name: str,
embed_model: str = EMBEDDING_MODEL,
embed_device: str = EMBEDDING_DEVICE,
embeddings: Embeddings = None,
tick: int = 0, # tick will be changed by upload_doc etc. and make cache refreshed.
):
print(f"loading vector store in '{knowledge_base_name}'.")
vs_path = get_vs_path(knowledge_base_name)
if embeddings is None:
logger.info("load embedmodel: {}".format(embed_model))
embeddings = load_embeddings(embed_model, embed_device)
if not os.path.exists(vs_path):
os.makedirs(vs_path)
if "index.faiss" in os.listdir(vs_path):
search_index = FAISS.load_local(vs_path, embeddings, normalize_L2=FAISS_NORMALIZE_L2)
else:
# create an empty vector store
doc = Document(page_content="init", metadata={})
search_index = FAISS.from_documents([doc], embeddings, normalize_L2=FAISS_NORMALIZE_L2)
ids = [k for k, v in search_index.docstore._dict.items()]
search_index.delete(ids)
search_index.save_local(vs_path)
if tick == 0: # vector store is loaded first time
_VECTOR_STORE_TICKS[knowledge_base_name] = 0
# search_index.embedding_function = embeddings.embed_documents
return search_index
def refresh_vs_cache(kb_name: str):
"""
make vector store cache refreshed when next loading
"""
_VECTOR_STORE_TICKS[kb_name] = _VECTOR_STORE_TICKS.get(kb_name, 0) + 1
print(f"知识库 {kb_name} 缓存刷新:{_VECTOR_STORE_TICKS[kb_name]}")
class FaissKBService(KBService):
vs_path: str
kb_path: str
def vs_type(self) -> str:
return SupportedVSType.FAISS
@staticmethod
def get_vs_path(knowledge_base_name: str):
return os.path.join(FaissKBService.get_kb_path(knowledge_base_name), "vector_store")
@staticmethod
def get_kb_path(knowledge_base_name: str):
return os.path.join(KB_ROOT_PATH, knowledge_base_name)
def do_init(self):
self.kb_path = FaissKBService.get_kb_path(self.kb_name)
self.vs_path = FaissKBService.get_vs_path(self.kb_name)
def do_create_kb(self):
if not os.path.exists(self.vs_path):
os.makedirs(self.vs_path)
load_vector_store(self.kb_name, self.embed_model)
def do_drop_kb(self):
self.clear_vs()
shutil.rmtree(self.kb_path)
def do_search(self,
query: str,
top_k: int,
score_threshold: float = SCORE_THRESHOLD,
embeddings: Embeddings = None,
) -> List[Document]:
search_index = load_vector_store(self.kb_name,
embeddings=embeddings,
tick=_VECTOR_STORE_TICKS.get(self.kb_name))
docs = search_index.similarity_search_with_score(query, k=top_k, score_threshold=score_threshold)
return docs
def do_add_doc(self,
docs: List[Document],
embeddings: Embeddings,
**kwargs,
):
vector_store = load_vector_store(self.kb_name,
embeddings=embeddings,
tick=_VECTOR_STORE_TICKS.get(self.kb_name, 0))
logger.info("docs.lens: {}".format(len(docs)))
vector_store.add_documents(docs)
torch_gc()
if not kwargs.get("not_refresh_vs_cache"):
vector_store.save_local(self.vs_path)
refresh_vs_cache(self.kb_name)
def do_delete_doc(self,
kb_file: DocumentFile,
**kwargs):
embeddings = self._load_embeddings()
vector_store = load_vector_store(self.kb_name,
embeddings=embeddings,
tick=_VECTOR_STORE_TICKS.get(self.kb_name, 0))
ids = [k for k, v in vector_store.docstore._dict.items() if v.metadata["source"] == kb_file.filepath]
if len(ids) == 0:
return None
vector_store.delete(ids)
if not kwargs.get("not_refresh_vs_cache"):
vector_store.save_local(self.vs_path)
refresh_vs_cache(self.kb_name)
return True
def do_clear_vs(self):
shutil.rmtree(self.vs_path)
os.makedirs(self.vs_path)
refresh_vs_cache(self.kb_name)
def exist_doc(self, file_name: str):
if super().exist_doc(file_name):
return "in_db"
content_path = os.path.join(self.kb_path, "content")
if os.path.isfile(os.path.join(content_path, file_name)):
return "in_folder"
else:
return False

View File

@ -0,0 +1,277 @@
import urllib, os, json, traceback
from typing import List, Dict
from loguru import logger
from fastapi.responses import StreamingResponse, FileResponse
from fastapi import Body, File, Form, Body, Query, UploadFile
from langchain.docstore.document import Document
from .service_factory import KBServiceFactory
from dev_opsgpt.utils.server_utils import BaseResponse, ListResponse
from dev_opsgpt.utils.path_utils import *
from dev_opsgpt.orm.commands import *
from dev_opsgpt.orm.utils import DocumentFile
from configs.model_config import (
DEFAULT_VS_TYPE, EMBEDDING_MODEL, VECTOR_SEARCH_TOP_K, SCORE_THRESHOLD
)
async def list_kbs():
# Get List of Knowledge Base
return ListResponse(data=list_kbs_from_db())
async def create_kb(knowledge_base_name: str = Body(..., examples=["samples"]),
vector_store_type: str = Body("faiss"),
embed_model: str = Body(EMBEDDING_MODEL),
) -> BaseResponse:
# Create selected knowledge base
if not validate_kb_name(knowledge_base_name):
return BaseResponse(code=403, msg="Don't attack me")
if knowledge_base_name is None or knowledge_base_name.strip() == "":
return BaseResponse(code=404, msg="知识库名称不能为空,请重新填写知识库名称")
kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
if kb is not None:
return BaseResponse(code=404, msg=f"已存在同名知识库 {knowledge_base_name}")
kb = KBServiceFactory.get_service(knowledge_base_name, vector_store_type, embed_model)
try:
kb.create_kb()
except Exception as e:
print(e)
return BaseResponse(code=500, msg=f"创建知识库出错: {e}")
return BaseResponse(code=200, msg=f"已新增知识库 {knowledge_base_name}")
async def delete_kb(
knowledge_base_name: str = Body(..., examples=["samples"])
) -> BaseResponse:
# Delete selected knowledge base
if not validate_kb_name(knowledge_base_name):
return BaseResponse(code=403, msg="Don't attack me")
knowledge_base_name = urllib.parse.unquote(knowledge_base_name)
kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
if kb is None:
return BaseResponse(code=404, msg=f"未找到知识库 {knowledge_base_name}")
try:
status = kb.clear_vs()
status = kb.drop_kb()
if status:
return BaseResponse(code=200, msg=f"成功删除知识库 {knowledge_base_name}")
except Exception as e:
print(e)
return BaseResponse(code=500, msg=f"删除知识库时出现意外: {e}")
return BaseResponse(code=500, msg=f"删除知识库失败 {knowledge_base_name}")
class DocumentWithScore(Document):
score: float = None
def search_docs(query: str = Body(..., description="用户输入", examples=["你好"]),
knowledge_base_name: str = Body(..., description="知识库名称", examples=["samples"]),
top_k: int = Body(VECTOR_SEARCH_TOP_K, description="匹配向量数"),
score_threshold: float = Body(SCORE_THRESHOLD, description="知识库匹配相关度阈值取值范围在0-1之间SCORE越小相关度越高取到1相当于不筛选建议设置在0.5左右", ge=0, le=1),
) -> List[DocumentWithScore]:
kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
if kb is None:
return []
docs = kb.search_docs(query, top_k, score_threshold)
data = [DocumentWithScore(**x[0].dict(), score=x[1]) for x in docs]
return data
async def list_docs(
knowledge_base_name: str
) -> ListResponse:
if not validate_kb_name(knowledge_base_name):
return ListResponse(code=403, msg="Don't attack me", data=[])
knowledge_base_name = urllib.parse.unquote(knowledge_base_name)
kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
if kb is None:
return ListResponse(code=404, msg=f"未找到知识库 {knowledge_base_name}", data=[])
else:
all_doc_names = kb.list_docs()
return ListResponse(data=all_doc_names)
async def upload_doc(file: UploadFile = File(..., description="上传文件"),
knowledge_base_name: str = Form(..., description="知识库名称", examples=["kb1"]),
override: bool = Form(False, description="覆盖已有文件"),
not_refresh_vs_cache: bool = Form(False, description="暂不保存向量库用于FAISS"),
) -> BaseResponse:
if not validate_kb_name(knowledge_base_name):
return BaseResponse(code=403, msg="Don't attack me")
kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
if kb is None:
return BaseResponse(code=404, msg=f"未找到知识库 {knowledge_base_name}")
file_content = await file.read() # 读取上传文件的内容
try:
kb_file = DocumentFile(filename=file.filename,
knowledge_base_name=knowledge_base_name)
if (os.path.exists(kb_file.filepath)
and not override
and os.path.getsize(kb_file.filepath) == len(file_content)
):
# TODO: filesize 不同后的处理
file_status = f"文件 {kb_file.filename} 已存在。"
return BaseResponse(code=404, msg=file_status)
with open(kb_file.filepath, "wb") as f:
f.write(file_content)
except Exception as e:
logger.error(traceback.format_exc())
return BaseResponse(code=500, msg=f"{kb_file.filename} 文件上传失败,报错信息为: {e}")
try:
kb.add_doc(kb_file, not_refresh_vs_cache=not_refresh_vs_cache)
except Exception as e:
logger.error(traceback.format_exc())
return BaseResponse(code=500, msg=f"{kb_file.filename} 文件向量化失败,报错信息为: {e}")
return BaseResponse(code=200, msg=f"成功上传文件 {kb_file.filename}")
async def delete_doc(knowledge_base_name: str = Body(..., examples=["samples"]),
doc_name: str = Body(..., examples=["file_name.md"]),
delete_content: bool = Body(False),
not_refresh_vs_cache: bool = Body(False, description="暂不保存向量库用于FAISS"),
) -> BaseResponse:
if not validate_kb_name(knowledge_base_name):
return BaseResponse(code=403, msg="Don't attack me")
knowledge_base_name = urllib.parse.unquote(knowledge_base_name)
kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
if kb is None:
return BaseResponse(code=404, msg=f"未找到知识库 {knowledge_base_name}")
if not kb.exist_doc(doc_name):
return BaseResponse(code=404, msg=f"未找到文件 {doc_name}")
try:
kb_file = DocumentFile(filename=doc_name,
knowledge_base_name=knowledge_base_name)
kb.delete_doc(kb_file, delete_content, not_refresh_vs_cache=not_refresh_vs_cache)
except Exception as e:
print(e)
return BaseResponse(code=500, msg=f"{kb_file.filename} 文件删除失败,错误信息:{e}")
return BaseResponse(code=200, msg=f"{kb_file.filename} 文件删除成功")
async def update_doc(
knowledge_base_name: str = Body(..., examples=["samples"]),
file_name: str = Body(..., examples=["file_name"]),
not_refresh_vs_cache: bool = Body(False, description="暂不保存向量库用于FAISS"),
) -> BaseResponse:
'''
更新知识库文档
'''
if not validate_kb_name(knowledge_base_name):
return BaseResponse(code=403, msg="Don't attack me")
kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
if kb is None:
return BaseResponse(code=404, msg=f"未找到知识库 {knowledge_base_name}")
try:
kb_file = DocumentFile(filename=file_name,
knowledge_base_name=knowledge_base_name)
if os.path.exists(kb_file.filepath):
kb.update_doc(kb_file, not_refresh_vs_cache=not_refresh_vs_cache)
return BaseResponse(code=200, msg=f"成功更新文件 {kb_file.filename}")
except Exception as e:
logger.error(traceback.format_exc())
return BaseResponse(code=500, msg=f"{kb_file.filename} 文件更新失败,错误信息是:{e}")
return BaseResponse(code=500, msg=f"{kb_file.filename} 文件更新失败")
async def download_doc(
knowledge_base_name: str = Query(..., examples=["samples"]),
file_name: str = Query(..., examples=["test.txt"]),
):
'''
下载知识库文档
'''
if not validate_kb_name(knowledge_base_name):
return BaseResponse(code=403, msg="Don't attack me")
kb = KBServiceFactory.get_service_by_name(knowledge_base_name)
if kb is None:
return BaseResponse(code=404, msg=f"未找到知识库 {knowledge_base_name}")
try:
kb_file = DocumentFile(filename=file_name,
knowledge_base_name=knowledge_base_name)
if os.path.exists(kb_file.filepath):
return FileResponse(
path=kb_file.filepath,
filename=kb_file.filename,
media_type="multipart/form-data")
except Exception as e:
print(e)
return BaseResponse(code=500, msg=f"{kb_file.filename} 读取文件失败,错误信息是:{e}")
return BaseResponse(code=500, msg=f"{kb_file.filename} 读取文件失败")
async def recreate_vector_store(
knowledge_base_name: str = Body(..., examples=["samples"]),
allow_empty_kb: bool = Body(True),
vs_type: str = Body(DEFAULT_VS_TYPE),
embed_model: str = Body(EMBEDDING_MODEL),
):
'''
recreate vector store from the content.
this is usefull when user can copy files to content folder directly instead of upload through network.
by default, get_service_by_name only return knowledge base in the info.db and having document files in it.
set allow_empty_kb to True make it applied on empty knowledge base which it not in the info.db or having no documents.
'''
async def output():
kb = KBServiceFactory.get_service(knowledge_base_name, vs_type, embed_model)
if not kb.exists() and not allow_empty_kb:
yield {"code": 404, "msg": f"未找到知识库 {knowledge_base_name}"}
else:
kb.create_kb()
kb.clear_vs()
docs = list_docs_from_folder(knowledge_base_name)
for i, doc in enumerate(docs):
try:
kb_file = DocumentFile(doc, knowledge_base_name)
yield json.dumps({
"code": 200,
"msg": f"({i + 1} / {len(docs)}): {doc}",
"total": len(docs),
"finished": i,
"doc": doc,
}, ensure_ascii=False)
if i == len(docs) - 1:
not_refresh_vs_cache = False
else:
not_refresh_vs_cache = True
kb.add_doc(kb_file, not_refresh_vs_cache=not_refresh_vs_cache)
except Exception as e:
print(e)
yield json.dumps({
"code": 500,
"msg": f"添加文件‘{doc}’到知识库‘{knowledge_base_name}’时出错:{e}。已跳过。",
})
return StreamingResponse(output(), media_type="text/event-stream")

View File

@ -0,0 +1,293 @@
from multiprocessing import Process, Queue
import multiprocessing as mp
import sys
import os
src_dir = os.path.join(
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
)
print(src_dir)
sys.path.append(src_dir)
sys.path.append(os.path.dirname(os.path.dirname(__file__)))
from configs.model_config import llm_model_dict, LLM_MODEL, LLM_DEVICE, LOG_PATH, logger
from dev_opsgpt.utils.server_utils import MakeFastAPIOffline
host_ip = "0.0.0.0"
controller_port = 20001
model_worker_port = 20002
openai_api_port = 8888
base_url = "http://127.0.0.1:{}"
os.environ['PATH'] = os.environ.get("PATH", "") + os.pathsep + r'/d/env_utils/miniconda3/envs/devopsgpt/Lib/site-packages/torch/lib'
def set_httpx_timeout(timeout=60.0):
import httpx
httpx._config.DEFAULT_TIMEOUT_CONFIG.connect = timeout
httpx._config.DEFAULT_TIMEOUT_CONFIG.read = timeout
httpx._config.DEFAULT_TIMEOUT_CONFIG.write = timeout
def create_controller_app(
dispatch_method="shortest_queue",
):
import fastchat.constants
fastchat.constants.LOGDIR = LOG_PATH
from fastchat.serve.controller import app, Controller
controller = Controller(dispatch_method)
sys.modules["fastchat.serve.controller"].controller = controller
MakeFastAPIOffline(app)
app.title = "FastChat Controller"
return app
def create_model_worker_app(
worker_address=base_url.format(model_worker_port),
controller_address=base_url.format(controller_port),
model_path=llm_model_dict[LLM_MODEL].get("local_model_path"),
device=LLM_DEVICE,
gpus=None,
max_gpu_memory="8GiB",
load_8bit=False,
cpu_offloading=None,
gptq_ckpt=None,
gptq_wbits=16,
gptq_groupsize=-1,
gptq_act_order=False,
awq_ckpt=None,
awq_wbits=16,
awq_groupsize=-1,
model_names=[LLM_MODEL],
num_gpus=1, # not in fastchat
conv_template=None,
limit_worker_concurrency=5,
stream_interval=2,
no_register=False,
):
import fastchat.constants
fastchat.constants.LOGDIR = LOG_PATH
from fastchat.serve.model_worker import app, GptqConfig, AWQConfig, ModelWorker, worker_id
import argparse
import threading
import fastchat.serve.model_worker
# workaround to make program exit with Ctrl+c
# it should be deleted after pr is merged by fastchat
def _new_init_heart_beat(self):
self.register_to_controller()
self.heart_beat_thread = threading.Thread(
target=fastchat.serve.model_worker.heart_beat_worker, args=(self,), daemon=True,
)
self.heart_beat_thread.start()
ModelWorker.init_heart_beat = _new_init_heart_beat
parser = argparse.ArgumentParser()
args = parser.parse_args()
args.model_path = model_path
args.model_names = model_names
args.device = device
args.load_8bit = load_8bit
args.gptq_ckpt = gptq_ckpt
args.gptq_wbits = gptq_wbits
args.gptq_groupsize = gptq_groupsize
args.gptq_act_order = gptq_act_order
args.awq_ckpt = awq_ckpt
args.awq_wbits = awq_wbits
args.awq_groupsize = awq_groupsize
args.gpus = gpus
args.num_gpus = num_gpus
args.max_gpu_memory = max_gpu_memory
args.cpu_offloading = cpu_offloading
args.worker_address = worker_address
args.controller_address = controller_address
args.conv_template = conv_template
args.limit_worker_concurrency = limit_worker_concurrency
args.stream_interval = stream_interval
args.no_register = no_register
if args.gpus:
if len(args.gpus.split(",")) < args.num_gpus:
raise ValueError(
f"Larger --num-gpus ({args.num_gpus}) than --gpus {args.gpus}!"
)
os.environ["CUDA_VISIBLE_DEVICES"] = args.gpus
if gpus and num_gpus is None:
num_gpus = len(gpus.split(','))
args.num_gpus = num_gpus
gptq_config = GptqConfig(
ckpt=gptq_ckpt or model_path,
wbits=args.gptq_wbits,
groupsize=args.gptq_groupsize,
act_order=args.gptq_act_order,
)
awq_config = AWQConfig(
ckpt=args.awq_ckpt or args.model_path,
wbits=args.awq_wbits,
groupsize=args.awq_groupsize,
)
# torch.multiprocessing.set_start_method('spawn')
worker = ModelWorker(
controller_addr=args.controller_address,
worker_addr=args.worker_address,
worker_id=worker_id,
model_path=args.model_path,
model_names=args.model_names,
limit_worker_concurrency=args.limit_worker_concurrency,
no_register=args.no_register,
device=args.device,
num_gpus=args.num_gpus,
max_gpu_memory=args.max_gpu_memory,
load_8bit=args.load_8bit,
cpu_offloading=args.cpu_offloading,
gptq_config=gptq_config,
awq_config=awq_config,
stream_interval=args.stream_interval,
conv_template=args.conv_template,
)
sys.modules["fastchat.serve.model_worker"].worker = worker
sys.modules["fastchat.serve.model_worker"].args = args
sys.modules["fastchat.serve.model_worker"].gptq_config = gptq_config
MakeFastAPIOffline(app)
app.title = f"FastChat LLM Server ({LLM_MODEL})"
return app
def create_openai_api_app(
controller_address=base_url.format(controller_port),
api_keys=[],
):
import fastchat.constants
fastchat.constants.LOGDIR = LOG_PATH
from fastchat.serve.openai_api_server import app, CORSMiddleware, app_settings
app.add_middleware(
CORSMiddleware,
allow_credentials=True,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
app_settings.controller_address = controller_address
app_settings.api_keys = api_keys
MakeFastAPIOffline(app)
app.title = "FastChat OpeanAI API Server"
return app
def run_controller(q):
import uvicorn
app = create_controller_app()
@app.on_event("startup")
async def on_startup():
set_httpx_timeout()
q.put(1)
uvicorn.run(app, host=host_ip, port=controller_port)
def run_model_worker(q, *args, **kwargs):
import uvicorn
app = create_model_worker_app(*args, **kwargs)
@app.on_event("startup")
async def on_startup():
set_httpx_timeout()
while True:
no = q.get()
if no != 1:
q.put(no)
else:
break
q.put(2)
uvicorn.run(app, host=host_ip, port=model_worker_port)
def run_openai_api(q):
import uvicorn
app = create_openai_api_app()
@app.on_event("startup")
async def on_startup():
set_httpx_timeout()
while True:
no = q.get()
if no != 2:
q.put(no)
else:
break
q.put(3)
uvicorn.run(app, host=host_ip, port=openai_api_port)
if __name__ == "__main__":
mp.set_start_method("spawn")
queue = Queue()
logger.info(llm_model_dict[LLM_MODEL])
model_path = llm_model_dict[LLM_MODEL]["local_model_path"]
logger.info(f"如需查看 llm_api 日志,请前往 {LOG_PATH}")
if not model_path:
logger.error("local_model_path 不能为空")
else:
controller_process = Process(
target=run_controller,
name=f"controller({os.getpid()})",
args=(queue,),
daemon=True,
)
controller_process.start()
model_worker_process = Process(
target=run_model_worker,
name=f"model_worker({os.getpid()})",
args=(queue,),
# kwargs={"load_8bit": True},
daemon=True,
)
model_worker_process.start()
openai_api_process = Process(
target=run_openai_api,
name=f"openai_api({os.getpid()})",
args=(queue,),
daemon=True,
)
openai_api_process.start()
try:
model_worker_process.join()
controller_process.join()
openai_api_process.join()
except KeyboardInterrupt:
model_worker_process.terminate()
controller_process.terminate()
openai_api_process.terminate()
# 服务启动后接口调用示例:
# import openai
# openai.api_key = "EMPTY" # Not support yet
# openai.api_base = "http://localhost:8888/v1"
# model = "chatglm2-6b"
# # create a chat completion
# completion = openai.ChatCompletion.create(
# model=model,
# messages=[{"role": "user", "content": "Hello! What is your name?"}]
# )
# # print the completion
# print(completion.choices[0].message.content)

View File

@ -0,0 +1,136 @@
import os
from typing import Literal, Callable, Any
from configs.model_config import EMBEDDING_MODEL, DEFAULT_VS_TYPE
from dev_opsgpt.orm.utils import DocumentFile
from dev_opsgpt.orm.commands import add_doc_to_db
from dev_opsgpt.utils.path_utils import *
from .service_factory import KBServiceFactory
def folder2db(
kb_name: str,
mode: Literal["recreate_vs", "fill_info_only", "update_in_db", "increament"],
vs_type: Literal["faiss", "milvus", "pg", "chromadb"] = DEFAULT_VS_TYPE,
embed_model: str = EMBEDDING_MODEL,
callback_before: Callable = None,
callback_after: Callable = None,
):
'''
use existed files in local folder to populate database and/or vector store.
set parameter `mode` to:
recreate_vs: recreate all vector store and fill info to database using existed files in local folder
fill_info_only: do not create vector store, fill info to db using existed files only
update_in_db: update vector store and database info using local files that existed in database only
increament: create vector store and database info for local files that not existed in database only
'''
kb = KBServiceFactory.get_service(kb_name, vs_type, embed_model)
kb.create_kb()
if mode == "recreate_vs":
kb.clear_vs()
docs = list_docs_from_folder(kb_name)
for i, doc in enumerate(docs):
try:
kb_file = DocumentFile(doc, kb_name)
if callable(callback_before):
callback_before(kb_file, i, docs)
if i == len(docs) - 1:
not_refresh_vs_cache = False
else:
not_refresh_vs_cache = True
kb.add_doc(kb_file, not_refresh_vs_cache=not_refresh_vs_cache)
if callable(callback_after):
callback_after(kb_file, i, docs)
except Exception as e:
print(e)
elif mode == "fill_info_only":
docs = list_docs_from_folder(kb_name)
for i, doc in enumerate(docs):
try:
kb_file = DocumentFile(doc, kb_name)
if callable(callback_before):
callback_before(kb_file, i, docs)
add_doc_to_db(kb_file)
if callable(callback_after):
callback_after(kb_file, i, docs)
except Exception as e:
print(e)
elif mode == "update_in_db":
docs = kb.list_docs()
for i, doc in enumerate(docs):
try:
kb_file = DocumentFile(doc, kb_name)
if callable(callback_before):
callback_before(kb_file, i, docs)
if i == len(docs) - 1:
not_refresh_vs_cache = False
else:
not_refresh_vs_cache = True
kb.update_doc(kb_file, not_refresh_vs_cache=not_refresh_vs_cache)
if callable(callback_after):
callback_after(kb_file, i, docs)
except Exception as e:
print(e)
elif mode == "increament":
db_docs = kb.list_docs()
folder_docs = list_docs_from_folder(kb_name)
docs = list(set(folder_docs) - set(db_docs))
for i, doc in enumerate(docs):
try:
kb_file = DocumentFile(doc, kb_name)
if callable(callback_before):
callback_before(kb_file, i, docs)
if i == len(docs) - 1:
not_refresh_vs_cache = False
else:
not_refresh_vs_cache = True
kb.add_doc(kb_file, not_refresh_vs_cache=not_refresh_vs_cache)
if callable(callback_after):
callback_after(kb_file, i, docs)
except Exception as e:
print(e)
else:
raise ValueError(f"unspported migrate mode: {mode}")
def recreate_all_vs(
vs_type: Literal["faiss", "milvus", "pg", "chromadb"] = DEFAULT_VS_TYPE,
embed_mode: str = EMBEDDING_MODEL,
**kwargs: Any,
):
'''
used to recreate a vector store or change current vector store to another type or embed_model
'''
for kb_name in list_kbs_from_folder():
folder2db(kb_name, "recreate_vs", vs_type, embed_mode, **kwargs)
def prune_db_docs(kb_name: str):
'''
delete docs in database that not existed in local folder.
it is used to delete database docs after user deleted some doc files in file browser
'''
kb = KBServiceFactory.get_service_by_name(kb_name)
if kb.exists():
docs_in_db = kb.list_docs()
docs_in_folder = list_docs_from_folder(kb_name)
docs = list(set(docs_in_db) - set(docs_in_folder))
for doc in docs:
kb.delete_doc(DocumentFile(doc, kb_name))
return docs
def prune_folder_docs(kb_name: str):
'''
delete doc files in local folder that not existed in database.
is is used to free local disk space by delete unused doc files.
'''
kb = KBServiceFactory.get_service_by_name(kb_name)
if kb.exists():
docs_in_db = kb.list_docs()
docs_in_folder = list_docs_from_folder(kb_name)
docs = list(set(docs_in_folder) - set(docs_in_db))
for doc in docs:
os.remove(get_file_path(kb_name, doc))
return docs

View File

@ -0,0 +1,114 @@
from typing import List, Union, Dict
import os
from configs.model_config import EMBEDDING_MODEL
from .faiss_db_service import FaissKBService
from .base_service import KBService, SupportedVSType
from dev_opsgpt.orm.commands import *
from dev_opsgpt.utils.path_utils import *
class KBServiceFactory:
@staticmethod
def get_service(kb_name: str,
vector_store_type: Union[str, SupportedVSType],
embed_model: str = EMBEDDING_MODEL,
) -> KBService:
if isinstance(vector_store_type, str):
vector_store_type = getattr(SupportedVSType, vector_store_type.upper())
if SupportedVSType.FAISS == vector_store_type:
return FaissKBService(kb_name, embed_model=embed_model)
# if SupportedVSType.PG == vector_store_type:
# from server.knowledge_base.kb_service.pg_kb_service import PGKBService
# return PGKBService(kb_name, embed_model=embed_model)
# elif SupportedVSType.MILVUS == vector_store_type:
# from server.knowledge_base.kb_service.milvus_kb_service import MilvusKBService
# return MilvusKBService(kb_name, embed_model=embed_model) # other milvus parameters are set in model_config.kbs_config
# elif SupportedVSType.DEFAULT == vector_store_type: # kb_exists of default kbservice is False, to make validation easier.
# from server.knowledge_base.kb_service.default_kb_service import DefaultKBService
# return DefaultKBService(kb_name)
@staticmethod
def get_service_by_name(kb_name: str
) -> KBService:
_, vs_type, embed_model = load_kb_from_db(kb_name)
if vs_type is None and os.path.isdir(get_kb_path(kb_name)): # faiss knowledge base not in db
vs_type = "faiss"
return KBServiceFactory.get_service(kb_name, vs_type, embed_model)
@staticmethod
def get_default():
return KBServiceFactory.get_service("default", SupportedVSType.DEFAULT)
def get_kb_details() -> List[Dict]:
kbs_in_folder = list_kbs_from_folder()
kbs_in_db = KBService.list_kbs()
result = {}
for kb in kbs_in_folder:
result[kb] = {
"kb_name": kb,
"vs_type": "",
"embed_model": "",
"file_count": 0,
"create_time": None,
"in_folder": True,
"in_db": False,
}
for kb in kbs_in_db:
kb_detail = get_kb_detail(kb)
if kb_detail:
kb_detail["in_db"] = True
if kb in result:
result[kb].update(kb_detail)
else:
kb_detail["in_folder"] = False
result[kb] = kb_detail
data = []
for i, v in enumerate(result.values()):
v['No'] = i + 1
data.append(v)
return data
def get_kb_doc_details(kb_name: str) -> List[Dict]:
kb = KBServiceFactory.get_service_by_name(kb_name)
docs_in_folder = list_docs_from_folder(kb_name)
docs_in_db = kb.list_docs()
result = {}
for doc in docs_in_folder:
result[doc] = {
"kb_name": kb_name,
"file_name": doc,
"file_ext": os.path.splitext(doc)[-1],
"file_version": 0,
"document_loader": "",
"text_splitter": "",
"create_time": None,
"in_folder": True,
"in_db": False,
}
for doc in docs_in_db:
doc_detail = get_file_detail(kb_name, doc)
if doc_detail:
doc_detail["in_db"] = True
if doc in result:
result[doc].update(doc_detail)
else:
doc_detail["in_folder"] = False
result[doc] = doc_detail
data = []
for i, v in enumerate(result.values()):
v['No'] = i + 1
data.append(v)
return data

View File

@ -0,0 +1,3 @@
from .langchain_splitter import LCTextSplitter
__all__ = ["LCTextSplitter"]

View File

@ -0,0 +1,71 @@
import os
import importlib
from loguru import logger
from langchain.document_loaders.base import BaseLoader
from langchain.text_splitter import (
SpacyTextSplitter, RecursiveCharacterTextSplitter
)
from configs.model_config import (
CHUNK_SIZE,
OVERLAP_SIZE,
ZH_TITLE_ENHANCE
)
from dev_opsgpt.utils.path_utils import *
class LCTextSplitter:
'''langchain textsplitter 执行file2text'''
def __init__(
self, filepath: str, text_splitter_name: str = None
):
self.filepath = filepath
self.ext = os.path.splitext(filepath)[-1].lower()
self.text_splitter_name = text_splitter_name
if self.ext not in SUPPORTED_EXTS:
raise ValueError(f"暂未支持的文件格式 {self.ext}")
self.document_loader_name = get_LoaderClass(self.ext)
def file2text(self, ):
loader = self._load_document()
text_splitter = self._load_text_splitter()
if self.document_loader_name in ["JSONLoader", "JSONLLoader"]:
docs = loader.load()
else:
docs = loader.load_and_split(text_splitter)
logger.info(docs[0])
return docs
def _load_document(self, ) -> BaseLoader:
DocumentLoader = EXT2LOADER_DICT[self.ext]
if self.document_loader_name == "UnstructuredFileLoader":
loader = DocumentLoader(self.filepath, autodetect_encoding=True)
else:
loader = DocumentLoader(self.filepath)
return loader
def _load_text_splitter(self, ):
try:
if self.text_splitter_name is None:
text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CHUNK_SIZE,
chunk_overlap=OVERLAP_SIZE,
)
self.text_splitter_name = "SpacyTextSplitter"
elif self.document_loader_name in ["JSONLoader", "JSONLLoader"]:
text_splitter = None
else:
text_splitter_module = importlib.import_module('langchain.text_splitter')
TextSplitter = getattr(text_splitter_module, self.text_splitter_name)
text_splitter = TextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=OVERLAP_SIZE)
except Exception as e:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=OVERLAP_SIZE,
)
return text_splitter

View File

View File

@ -0,0 +1,6 @@
from .server_utils import BaseResponse, ListResponse
from .common_utils import func_timer
__all__ = [
"BaseResponse", "ListResponse", "func_timer"
]

View File

@ -0,0 +1,67 @@
import textwrap, time, copy, random, hashlib, json, os
from datetime import datetime, timedelta
from functools import wraps
from loguru import logger
DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
def timestampToDateformat(ts, interval=1000, dateformat=DATE_FORMAT):
'''将标准时间戳转换标准指定时间格式'''
return datetime.fromtimestamp(ts//interval).strftime(dateformat)
def datefromatToTimestamp(dt, interval=1000, dateformat=DATE_FORMAT):
'''将标准时间格式转换未标准时间戳'''
return datetime.strptime(dt, dateformat).timestamp()*interval
def func_timer():
'''
用装饰器实现函数计时
:param function: 需要计时的函数
:return: None
'''
@wraps(function)
def function_timer(*args, **kwargs):
t0 = time.time()
result = function(*args, **kwargs)
t1 = time.time()
logger.info('[Function: {name} finished, spent time: {time:.3f}s]'.format(
name=function.__name__,
time=t1 - t0
))
return result
return function_timer
def read_jsonl_file(filename):
data = []
with open(filename, "r", encoding="utf-8") as f:
for line in f:
data.append(json.loads(line))
return data
def save_to_jsonl_file(data, filename):
dir_name = os.path.dirname(filename)
if not os.path.exists(dir_name): os.makedirs(dir_name)
with open(filename, "w", encoding="utf-8") as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
def read_json_file(filename):
with open(filename, "r", encoding="utf-8") as f:
return json.load(f)
def save_to_json_file(data, filename):
dir_name = os.path.dirname(filename)
if not os.path.exists(dir_name): os.makedirs(dir_name)
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)

View File

@ -0,0 +1,70 @@
import os
from langchain.document_loaders import CSVLoader, PyPDFLoader, UnstructuredFileLoader, TextLoader, PythonLoader
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from dev_opsgpt.document_loaders import JSONLLoader, JSONLoader
from configs.model_config import (
embedding_model_dict,
KB_ROOT_PATH,
)
from loguru import logger
LOADERNAME2LOADER_DICT = {
"UnstructuredFileLoader": UnstructuredFileLoader,
"CSVLoader": CSVLoader,
"PyPDFLoader": PyPDFLoader,
"TextLoader": TextLoader,
"PythonLoader": PythonLoader,
"JSONLoader": JSONLoader,
"JSONLLoader": JSONLLoader
}
LOADER2EXT_DICT = {"UnstructuredFileLoader": ['.eml', '.html', '.md', '.msg', '.rst',
'.rtf', '.xml',
'.doc', '.docx', '.epub', '.odt',
'.ppt', '.pptx', '.tsv'],
"CSVLoader": [".csv"],
"PyPDFLoader": [".pdf"],
"TextLoader": ['.txt'],
"PythonLoader": ['.py'],
"JSONLoader": ['.json'],
"JSONLLoader": ['.jsonl'],
}
EXT2LOADER_DICT = {ext: LOADERNAME2LOADER_DICT[k] for k, exts in LOADER2EXT_DICT.items() for ext in exts}
SUPPORTED_EXTS = [ext for sublist in LOADER2EXT_DICT.values() for ext in sublist]
def validate_kb_name(knowledge_base_id: str) -> bool:
# 检查是否包含预期外的字符或路径攻击关键字
if "../" in knowledge_base_id:
return False
return True
def get_kb_path(knowledge_base_name: str):
return os.path.join(KB_ROOT_PATH, knowledge_base_name)
def get_doc_path(knowledge_base_name: str):
return os.path.join(get_kb_path(knowledge_base_name), "content")
def get_vs_path(knowledge_base_name: str):
return os.path.join(get_kb_path(knowledge_base_name), "vector_store")
def get_file_path(knowledge_base_name: str, doc_name: str):
return os.path.join(get_doc_path(knowledge_base_name), doc_name)
def list_kbs_from_folder():
return [f for f in os.listdir(KB_ROOT_PATH)
if os.path.isdir(os.path.join(KB_ROOT_PATH, f))]
def list_docs_from_folder(kb_name: str):
doc_path = get_doc_path(kb_name)
return [file for file in os.listdir(doc_path)
if os.path.isfile(os.path.join(doc_path, file))]
def get_LoaderClass(file_extension):
for LoaderClass, extensions in LOADER2EXT_DICT.items():
if file_extension in extensions:
return LoaderClass

View File

@ -0,0 +1,191 @@
import pydantic
from pydantic import BaseModel
from typing import List
import torch
from fastapi import FastAPI
from pathlib import Path
import asyncio
from typing import Any, Optional
from loguru import logger
class BaseResponse(BaseModel):
code: int = pydantic.Field(200, description="API status code")
msg: str = pydantic.Field("success", description="API status message")
class Config:
schema_extra = {
"example": {
"code": 200,
"msg": "success",
}
}
class ListResponse(BaseResponse):
data: List[str] = pydantic.Field(..., description="List of names")
class Config:
schema_extra = {
"example": {
"code": 200,
"msg": "success",
"data": ["doc1.docx", "doc2.pdf", "doc3.txt"],
}
}
class ChatMessage(BaseModel):
question: str = pydantic.Field(..., description="Question text")
response: str = pydantic.Field(..., description="Response text")
history: List[List[str]] = pydantic.Field(..., description="History text")
source_documents: List[str] = pydantic.Field(
..., description="List of source documents and their scores"
)
class Config:
schema_extra = {
"example": {
"question": "工伤保险如何办理?",
"response": "根据已知信息,可以总结如下:\n\n1. 参保单位为员工缴纳工伤保险费,以保障员工在发生工伤时能够获得相应的待遇。\n"
"2. 不同地区的工伤保险缴费规定可能有所不同,需要向当地社保部门咨询以了解具体的缴费标准和规定。\n"
"3. 工伤从业人员及其近亲属需要申请工伤认定,确认享受的待遇资格,并按时缴纳工伤保险费。\n"
"4. 工伤保险待遇包括工伤医疗、康复、辅助器具配置费用、伤残待遇、工亡待遇、一次性工亡补助金等。\n"
"5. 工伤保险待遇领取资格认证包括长期待遇领取人员认证和一次性待遇领取人员认证。\n"
"6. 工伤保险基金支付的待遇项目包括工伤医疗待遇、康复待遇、辅助器具配置费用、一次性工亡补助金、丧葬补助金等。",
"history": [
[
"工伤保险是什么?",
"工伤保险是指用人单位按照国家规定,为本单位的职工和用人单位的其他人员,缴纳工伤保险费,"
"由保险机构按照国家规定的标准,给予工伤保险待遇的社会保险制度。",
]
],
"source_documents": [
"出处 [1] 广州市单位从业的特定人员参加工伤保险办事指引.docx\n\n\t"
"( 一) 从业单位 (组织) 按“自愿参保”原则, 为未建 立劳动关系的特定从业人员单项参加工伤保险 、缴纳工伤保 险费。",
"出处 [2] ...",
"出处 [3] ...",
],
}
}
def torch_gc():
if torch.cuda.is_available():
# with torch.cuda.device(DEVICE):
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
elif torch.backends.mps.is_available():
try:
from torch.mps import empty_cache
empty_cache()
except Exception as e:
print(e)
print("如果您使用的是 macOS 建议将 pytorch 版本升级至 2.0.0 或更高版本,以支持及时清理 torch 产生的内存占用。")
def run_async(cor):
'''
在同步环境中运行异步代码.
'''
try:
loop = asyncio.get_event_loop()
except:
loop = asyncio.new_event_loop()
return loop.run_until_complete(cor)
def iter_over_async(ait, loop):
'''
将异步生成器封装成同步生成器.
'''
ait = ait.__aiter__()
async def get_next():
try:
obj = await ait.__anext__()
return False, obj
except StopAsyncIteration:
return True, None
while True:
done, obj = loop.run_until_complete(get_next())
if done:
break
yield obj
def MakeFastAPIOffline(
app: FastAPI,
static_dir = Path(__file__).parent / "static",
static_url = "/static-offline-docs",
docs_url: Optional[str] = "/docs",
redoc_url: Optional[str] = "/redoc",
) -> None:
"""patch the FastAPI obj that doesn't rely on CDN for the documentation page"""
from fastapi import Request
from fastapi.openapi.docs import (
get_redoc_html,
get_swagger_ui_html,
get_swagger_ui_oauth2_redirect_html,
)
from fastapi.staticfiles import StaticFiles
from starlette.responses import HTMLResponse
openapi_url = app.openapi_url
swagger_ui_oauth2_redirect_url = app.swagger_ui_oauth2_redirect_url
def remove_route(url: str) -> None:
'''
remove original route from app
'''
index = None
for i, r in enumerate(app.routes):
if r.path.lower() == url.lower():
index = i
break
if isinstance(index, int):
app.routes.pop(i)
# Set up static file mount
app.mount(
static_url,
StaticFiles(directory=Path(static_dir).as_posix()),
name="static-offline-docs",
)
if docs_url is not None:
remove_route(docs_url)
remove_route(swagger_ui_oauth2_redirect_url)
# Define the doc and redoc pages, pointing at the right files
@app.get(docs_url, include_in_schema=False)
async def custom_swagger_ui_html(request: Request) -> HTMLResponse:
root = request.scope.get("root_path")
favicon = f"{root}{static_url}/favicon.png"
return get_swagger_ui_html(
openapi_url=f"{root}{openapi_url}",
title=app.title + " - Swagger UI",
oauth2_redirect_url=swagger_ui_oauth2_redirect_url,
swagger_js_url=f"{root}{static_url}/swagger-ui-bundle.js",
swagger_css_url=f"{root}{static_url}/swagger-ui.css",
swagger_favicon_url=favicon,
)
@app.get(swagger_ui_oauth2_redirect_url, include_in_schema=False)
async def swagger_ui_redirect() -> HTMLResponse:
return get_swagger_ui_oauth2_redirect_html()
if redoc_url is not None:
remove_route(redoc_url)
@app.get(redoc_url, include_in_schema=False)
async def redoc_html(request: Request) -> HTMLResponse:
root = request.scope.get("root_path")
favicon = f"{root}{static_url}/favicon.png"
return get_redoc_html(
openapi_url=f"{root}{openapi_url}",
title=app.title + " - ReDoc",
redoc_js_url=f"{root}{static_url}/redoc.standalone.js",
with_google_fonts=False,
redoc_favicon_url=favicon,
)

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,9 @@
from .dialogue import dialogue_page, chat_box
from .document import knowledge_page
from .prompt import prompt_page
from .utils import ApiRequest
__all__ = [
"dialogue_page", "chat_box", "prompt_page", "knowledge_page",
"ApiRequest"
]

View File

@ -0,0 +1,221 @@
import streamlit as st
from streamlit_chatbox import *
from typing import List, Dict
from datetime import datetime
from .utils import *
from dev_opsgpt.utils import *
from dev_opsgpt.chat.search_chat import SEARCH_ENGINES
chat_box = ChatBox(
assistant_avatar="../sources/imgs/devops-chatbot2.png"
)
GLOBAL_EXE_CODE_TEXT = ""
def get_messages_history(history_len: int) -> List[Dict]:
def filter(msg):
'''
针对当前简单文本对话只返回每条消息的第一个element的内容
'''
content = [x._content for x in msg["elements"] if x._output_method in ["markdown", "text"]]
return {
"role": msg["role"],
"content": content[0] if content else "",
}
history = chat_box.filter_history(100000, filter) # workaround before upgrading streamlit-chatbox.
user_count = 0
i = 1
for i in range(1, len(history) + 1):
if history[-i]["role"] == "user":
user_count += 1
if user_count >= history_len:
break
return history[-i:]
def dialogue_page(api: ApiRequest):
global GLOBAL_EXE_CODE_TEXT
chat_box.init_session()
with st.sidebar:
# TODO: 对话模型与会话绑定
def on_mode_change():
mode = st.session_state.dialogue_mode
text = f"已切换到 {mode} 模式。"
if mode == "知识库问答":
cur_kb = st.session_state.get("selected_kb")
if cur_kb:
text = f"{text} 当前知识库: `{cur_kb}`。"
st.toast(text)
# sac.alert(text, description="descp", type="success", closable=True, banner=True)
dialogue_mode = st.selectbox("请选择对话模式",
["LLM 对话",
"知识库问答",
"搜索引擎问答",
],
on_change=on_mode_change,
key="dialogue_mode",
)
history_len = st.number_input("历史对话轮数:", 0, 10, 3)
# todo: support history len
def on_kb_change():
st.toast(f"已加载知识库: {st.session_state.selected_kb}")
if dialogue_mode == "知识库问答":
with st.expander("知识库配置", True):
kb_list = api.list_knowledge_bases(no_remote_api=True)
selected_kb = st.selectbox(
"请选择知识库:",
kb_list,
on_change=on_kb_change,
key="selected_kb",
)
kb_top_k = st.number_input("匹配知识条数:", 1, 20, 3)
score_threshold = st.number_input("知识匹配分数阈值:", 0.0, float(SCORE_THRESHOLD), float(SCORE_THRESHOLD), float(SCORE_THRESHOLD//100))
# chunk_content = st.checkbox("关联上下文", False, disabled=True)
# chunk_size = st.slider("关联长度:", 0, 500, 250, disabled=True)
elif dialogue_mode == "搜索引擎问答":
with st.expander("搜索引擎配置", True):
search_engine = st.selectbox("请选择搜索引擎", SEARCH_ENGINES.keys(), 0)
se_top_k = st.number_input("匹配搜索结果条数:", 1, 20, 3)
code_interpreter_on = st.toggle("开启代码解释器")
code_exec_on = st.toggle("自动执行代码")
# Display chat messages from history on app rerun
chat_box.output_messages()
chat_input_placeholder = "请输入对话内容换行请使用Ctrl+Enter "
code_text = "" or GLOBAL_EXE_CODE_TEXT
codebox_res = None
if prompt := st.chat_input(chat_input_placeholder, key="prompt"):
history = get_messages_history(history_len)
chat_box.user_say(prompt)
if dialogue_mode == "LLM 对话":
chat_box.ai_say("正在思考...")
text = ""
r = api.chat_chat(prompt, history)
for t in r:
if error_msg := check_error_msg(t): # check whether error occured
st.error(error_msg)
break
text += t["answer"]
chat_box.update_msg(text)
logger.debug(f"text: {text}")
chat_box.update_msg(text, streaming=False) # 更新最终的字符串,去除光标
# 判断是否存在代码, 并提高编辑功能,执行功能
code_text = api.codebox.decode_code_from_text(text)
GLOBAL_EXE_CODE_TEXT = code_text
if code_text and code_exec_on:
codebox_res = api.codebox_chat("```"+code_text+"```", do_code_exe=True)
elif dialogue_mode == "知识库问答":
history = get_messages_history(history_len)
chat_box.ai_say([
f"正在查询知识库 `{selected_kb}` ...",
Markdown("...", in_expander=True, title="知识库匹配结果"),
])
text = ""
for idx_count, d in enumerate(api.knowledge_base_chat(prompt, selected_kb, kb_top_k, score_threshold, history)):
if error_msg := check_error_msg(d): # check whether error occured
st.error(error_msg)
text += d["answer"]
if idx_count%10 == 0:
chat_box.update_msg(text, element_index=0)
# chat_box.update_msg("知识库匹配结果: \n\n".join(d["docs"]), element_index=1, streaming=False, state="complete")
chat_box.update_msg(text, element_index=0, streaming=False) # 更新最终的字符串,去除光标
chat_box.update_msg("知识库匹配结果: \n\n".join(d["docs"]), element_index=1, streaming=False, state="complete")
# 判断是否存在代码, 并提高编辑功能,执行功能
code_text = api.codebox.decode_code_from_text(text)
GLOBAL_EXE_CODE_TEXT = code_text
if code_text and code_exec_on:
codebox_res = api.codebox_chat("```"+code_text+"```", do_code_exe=True)
elif dialogue_mode == "搜索引擎问答":
chat_box.ai_say([
f"正在执行 `{search_engine}` 搜索...",
Markdown("...", in_expander=True, title="网络搜索结果"),
])
text = ""
d = {"docs": []}
for d in api.search_engine_chat(prompt, search_engine, se_top_k):
if error_msg := check_error_msg(d): # check whether error occured
st.error(error_msg)
text += d["answer"]
if idx_count%10 == 0:
chat_box.update_msg(text, element_index=0)
# chat_box.update_msg("搜索匹配结果: \n\n".join(d["docs"]), element_index=1, streaming=False)
chat_box.update_msg(text, element_index=0, streaming=False) # 更新最终的字符串,去除光标
chat_box.update_msg("搜索匹配结果: \n\n".join(d["docs"]), element_index=1, streaming=False, state="complete")
# 判断是否存在代码, 并提高编辑功能,执行功能
code_text = api.codebox.decode_code_from_text(text)
GLOBAL_EXE_CODE_TEXT = code_text
if code_text and code_exec_on:
codebox_res = api.codebox_chat("```"+code_text+"```", do_code_exe=True)
if code_interpreter_on:
with st.expander("代码编辑执行器", False):
code_part = st.text_area("代码片段", code_text, key="code_text")
cols = st.columns(2)
if cols[0].button(
"修改对话",
use_container_width=True,
):
code_text = code_part
GLOBAL_EXE_CODE_TEXT = code_text
st.toast("修改对话成功")
if cols[1].button(
"执行代码",
use_container_width=True
):
if code_text:
codebox_res = api.codebox_chat("```"+code_text+"```", do_code_exe=True)
st.toast("正在执行代码")
else:
st.toast("code 不能为空")
#TODO 这段信息会被记录到history里
if codebox_res is not None and codebox_res.code_exe_status != 200:
st.toast(f"{codebox_res.code_exe_response}")
if codebox_res is not None and codebox_res.code_exe_status == 200:
st.toast(f"codebox_chajt {codebox_res}")
chat_box.ai_say(Markdown(code_text, in_expander=True, title="code interpreter", unsafe_allow_html=True), )
if codebox_res.code_exe_type == "image/png":
base_text = f"```\n{code_text}\n```\n\n"
img_html = "<img src='data:image/png;base64,{}' class='img-fluid'>".format(
codebox_res.code_exe_response
)
chat_box.update_msg(base_text + img_html, streaming=False, state="complete")
else:
chat_box.update_msg('```\n'+code_text+'\n```'+"\n\n"+'```\n'+codebox_res.code_exe_response+'\n```',
streaming=False, state="complete")
now = datetime.now()
with st.sidebar:
cols = st.columns(2)
export_btn = cols[0]
if cols[1].button(
"清空对话",
use_container_width=True,
):
chat_box.reset_history()
GLOBAL_EXE_CODE_TEXT = ""
st.experimental_rerun()
export_btn.download_button(
"导出记录",
"".join(chat_box.export2md()),
file_name=f"{now:%Y-%m-%d %H.%M}_对话记录.md",
mime="text/markdown",
use_container_width=True,
)

View File

@ -0,0 +1,15 @@
# 容器安装
## windows docker install
### 系统要求
[Docker Desktop for Windows](https://docs.docker.com/desktop/install/windows-install/) 支持 64 位版本的 Windows 10 Pro且必须开启 Hyper-V若版本为 v1903 及以上则无需开启 Hyper-V或者 64 位版本的 Windows 10 Home v1903 及以上版本。
- [【全面详细】Windows10 Docker安装详细教程](https://zhuanlan.zhihu.com/p/441965046)
- [Docker 从入门到实践](https://yeasy.gitbook.io/docker_practice/install/windows)
- [Docker Desktop requires the Server service to be enabled 处理](https://blog.csdn.net/sunhy_csdn/article/details/106526991)
- [安装wsl或者等报错提示](https://learn.microsoft.com/zh-cn/windows/wsl/install)
## linux docker install
linux安装相对比较简单请自行baidu/google相关安装

View File

@ -0,0 +1,326 @@
import streamlit as st
import os
import time
import traceback
from typing import Literal, Dict, Tuple
from st_aggrid import AgGrid, JsCode
from st_aggrid.grid_options_builder import GridOptionsBuilder
import pandas as pd
from configs.model_config import embedding_model_dict, kbs_config, EMBEDDING_MODEL, DEFAULT_VS_TYPE, WEB_CRAWL_PATH
from .utils import *
from dev_opsgpt.utils.path_utils import *
from dev_opsgpt.service.service_factory import get_kb_details, get_kb_doc_details
from dev_opsgpt.orm import table_init
# SENTENCE_SIZE = 100
cell_renderer = JsCode("""function(params) {if(params.value==true){return ''}else{return '×'}}""")
def config_aggrid(
df: pd.DataFrame,
columns: Dict[Tuple[str, str], Dict] = {},
selection_mode: Literal["single", "multiple", "disabled"] = "single",
use_checkbox: bool = False,
) -> GridOptionsBuilder:
gb = GridOptionsBuilder.from_dataframe(df)
gb.configure_column("No", width=40)
for (col, header), kw in columns.items():
gb.configure_column(col, header, wrapHeaderText=True, **kw)
gb.configure_selection(
selection_mode=selection_mode,
use_checkbox=use_checkbox,
# pre_selected_rows=st.session_state.get("selected_rows", [0]),
)
return gb
def file_exists(kb: str, selected_rows: List) -> Tuple[str, str]:
'''
check whether a doc file exists in local knowledge base folder.
return the file's name and path if it exists.
'''
if selected_rows:
file_name = selected_rows[0]["file_name"]
file_path = get_file_path(kb, file_name)
if os.path.isfile(file_path):
return file_name, file_path
return "", ""
def knowledge_page(api: ApiRequest):
# 判断表是否存在并进行初始化
table_init()
try:
kb_list = {x["kb_name"]: x for x in get_kb_details()}
except Exception as e:
st.error("获取知识库信息错误,请检查是否已按照 `README.md` 中 `4 知识库初始化与迁移` 步骤完成初始化或迁移,或是否为数据库连接错误。")
st.stop()
kb_names = list(kb_list.keys())
if "selected_kb_name" in st.session_state and st.session_state["selected_kb_name"] in kb_names:
selected_kb_index = kb_names.index(st.session_state["selected_kb_name"])
else:
selected_kb_index = 0
def format_selected_kb(kb_name: str) -> str:
if kb := kb_list.get(kb_name):
return f"{kb_name} ({kb['vs_type']} @ {kb['embed_model']})"
else:
return kb_name
selected_kb = st.selectbox(
"请选择或新建知识库:",
kb_names + ["新建知识库"],
format_func=format_selected_kb,
index=selected_kb_index
)
if selected_kb == "新建知识库":
with st.form("新建知识库"):
kb_name = st.text_input(
"新建知识库名称",
placeholder="新知识库名称,不支持中文命名",
key="kb_name",
)
cols = st.columns(2)
vs_types = list(kbs_config.keys())
vs_type = cols[0].selectbox(
"向量库类型",
vs_types,
index=vs_types.index(DEFAULT_VS_TYPE),
key="vs_type",
)
embed_models = list(embedding_model_dict.keys())
embed_model = cols[1].selectbox(
"Embedding 模型",
embed_models,
index=embed_models.index(EMBEDDING_MODEL),
key="embed_model",
)
submit_create_kb = st.form_submit_button(
"新建",
# disabled=not bool(kb_name),
use_container_width=True,
)
if submit_create_kb:
if not kb_name or not kb_name.strip():
st.error(f"知识库名称不能为空!")
elif kb_name in kb_list:
st.error(f"名为 {kb_name} 的知识库已经存在!")
else:
ret = api.create_knowledge_base(
knowledge_base_name=kb_name,
vector_store_type=vs_type,
embed_model=embed_model,
)
st.toast(ret.get("msg", " "))
st.session_state["selected_kb_name"] = kb_name
st.experimental_rerun()
elif selected_kb:
kb = selected_kb
# 上传文件
# sentence_size = st.slider("文本入库分句长度限制", 1, 1000, SENTENCE_SIZE, disabled=True)
files = st.file_uploader("上传知识文件",
[i for ls in LOADER2EXT_DICT.values() for i in ls],
accept_multiple_files=True,
)
base_url = st.text_input(
"待获取内容的URL地址",
placeholder="请填写正确可打开的URL地址",
key="base_url",
)
if st.button(
"添加URL内容到知识库",
disabled= base_url is None or base_url=="",
):
filename = base_url.replace("https://", " ").\
replace("http://", " ").replace("/", " ").\
replace("?", " ").replace("=", " ").replace(".", " ").strip()
html_name = "_".join(filename.split(" ",) + ["html.jsonl"])
text_name = "_".join(filename.split(" ",) + ["text.jsonl"])
html_path = os.path.join(WEB_CRAWL_PATH, html_name,)
text_path = os.path.join(WEB_CRAWL_PATH, text_name,)
# if not os.path.exists(text_dir) or :
st.toast(base_url)
st.toast(html_path)
st.toast(text_path)
res = api.web_crawl(
base_url=base_url,
html_dir=html_path,
text_dir=text_path,
do_dfs = False,
reptile_lib="requests",
method="get",
time_sleep=2,
)
if res["status"] == 200:
st.toast(res["response"], icon="")
data = [{"file": text_path, "filename": text_name, "knowledge_base_name": kb, "not_refresh_vs_cache": False}]
for k in data:
ret = api.upload_kb_doc(**k)
logger.info(ret)
if msg := check_success_msg(ret):
st.toast(msg, icon="")
elif msg := check_error_msg(ret):
st.toast(msg, icon="")
st.session_state.files = []
else:
st.toast(res["response"], icon="")
if os.path.exists(html_path):
os.remove(html_path)
if st.button(
"添加文件到知识库",
# help="请先上传文件,再点击添加",
# use_container_width=True,
disabled=len(files) == 0,
):
data = [{"file": f, "knowledge_base_name": kb, "not_refresh_vs_cache": True} for f in files]
data[-1]["not_refresh_vs_cache"]=False
for k in data:
ret = api.upload_kb_doc(**k)
if msg := check_success_msg(ret):
st.toast(msg, icon="")
elif msg := check_error_msg(ret):
st.toast(msg, icon="")
st.session_state.files = []
st.divider()
# 知识库详情
# st.info("请选择文件,点击按钮进行操作。")
doc_details = pd.DataFrame(get_kb_doc_details(kb))
if not len(doc_details):
st.info(f"知识库 `{kb}` 中暂无文件")
else:
st.write(f"知识库 `{kb}` 中已有文件:")
st.info("知识库中包含源文件与向量库,请从下表中选择文件后操作")
doc_details.drop(columns=["kb_name"], inplace=True)
doc_details = doc_details[[
"No", "file_name", "document_loader", "text_splitter", "in_folder", "in_db",
]]
# doc_details["in_folder"] = doc_details["in_folder"].replace(True, "✓").replace(False, "×")
# doc_details["in_db"] = doc_details["in_db"].replace(True, "✓").replace(False, "×")
gb = config_aggrid(
doc_details,
{
("No", "序号"): {},
("file_name", "文档名称"): {},
# ("file_ext", "文档类型"): {},
# ("file_version", "文档版本"): {},
("document_loader", "文档加载器"): {},
("text_splitter", "分词器"): {},
# ("create_time", "创建时间"): {},
("in_folder", "源文件"): {"cellRenderer": cell_renderer},
("in_db", "向量库"): {"cellRenderer": cell_renderer},
},
"multiple",
)
doc_grid = AgGrid(
doc_details,
gb.build(),
columns_auto_size_mode="FIT_CONTENTS",
theme="alpine",
custom_css={
"#gridToolBar": {"display": "none"},
},
allow_unsafe_jscode=True
)
selected_rows = doc_grid.get("selected_rows", [])
cols = st.columns(4)
file_name, file_path = file_exists(kb, selected_rows)
if file_path:
with open(file_path, "rb") as fp:
cols[0].download_button(
"下载选中文档",
fp,
file_name=file_name,
use_container_width=True, )
else:
cols[0].download_button(
"下载选中文档",
"",
disabled=True,
use_container_width=True, )
st.write()
# 将文件分词并加载到向量库中
if cols[1].button(
"重新添加至向量库" if selected_rows and (pd.DataFrame(selected_rows)["in_db"]).any() else "添加至向量库",
disabled=not file_exists(kb, selected_rows)[0],
use_container_width=True,
):
for row in selected_rows:
api.update_kb_doc(kb, row["file_name"])
st.experimental_rerun()
# 将文件从向量库中删除,但不删除文件本身。
if cols[2].button(
"从向量库删除",
disabled=not (selected_rows and selected_rows[0]["in_db"]),
use_container_width=True,
):
for row in selected_rows:
api.delete_kb_doc(kb, row["file_name"])
st.experimental_rerun()
if cols[3].button(
"从知识库中删除",
type="primary",
use_container_width=True,
):
for row in selected_rows:
ret = api.delete_kb_doc(kb, row["file_name"], True)
st.toast(ret.get("msg", " "))
st.experimental_rerun()
st.divider()
cols = st.columns(3)
# todo: freezed
if cols[0].button(
"依据源文件重建向量库",
# help="无需上传文件通过其它方式将文档拷贝到对应知识库content目录下点击本按钮即可重建知识库。",
use_container_width=True,
type="primary",
):
with st.spinner("向量库重构中,请耐心等待,勿刷新或关闭页面。"):
empty = st.empty()
empty.progress(0.0, "")
for d in api.recreate_vector_store(kb):
if msg := check_error_msg(d):
st.toast(msg)
else:
empty.progress(d["finished"] / d["total"], f"正在处理: {d['doc']}")
st.experimental_rerun()
if cols[2].button(
"删除知识库",
use_container_width=True,
):
ret = api.delete_knowledge_base(kb)
st.toast(ret.get("msg", " "))
time.sleep(1)
st.experimental_rerun()

View File

@ -0,0 +1,40 @@
import streamlit as st
import os
import time
from datetime import datetime
import traceback
from typing import Literal, Dict, Tuple
from st_aggrid import AgGrid, JsCode
from st_aggrid.grid_options_builder import GridOptionsBuilder
import pandas as pd
from configs.model_config import embedding_model_dict, kbs_config, EMBEDDING_MODEL, DEFAULT_VS_TYPE
from .utils import *
from dev_opsgpt.utils.path_utils import *
from dev_opsgpt.service.service_factory import get_kb_details, get_kb_doc_details
from dev_opsgpt.orm import table_init
def prompt_page(api: ApiRequest):
# 判断表是否存在并进行初始化
table_init()
now = datetime.now()
with st.sidebar:
cols = st.columns(2)
export_btn = cols[0]
if cols[1].button(
"清空prompt",
use_container_width=True,
):
st.experimental_rerun()
export_btn.download_button(
"导出记录",
"测试prompt",
file_name=f"{now:%Y-%m-%d %H.%M}_对话记录.md",
mime="text/markdown",
use_container_width=True,
)

709
dev_opsgpt/webui/utils.py Normal file
View File

@ -0,0 +1,709 @@
# 该文件包含webui通用工具可以被不同的webui使用
from typing import *
from pathlib import Path
from io import BytesIO
import httpx
import asyncio
from fastapi.responses import StreamingResponse
import contextlib
import json
import nltk
import traceback
from loguru import logger
from configs.model_config import (
EMBEDDING_MODEL,
DEFAULT_VS_TYPE,
KB_ROOT_PATH,
LLM_MODEL,
SCORE_THRESHOLD,
VECTOR_SEARCH_TOP_K,
SEARCH_ENGINE_TOP_K,
NLTK_DATA_PATH,
logger,
)
from configs.server_config import SANDBOX_SERVER
from dev_opsgpt.utils.server_utils import run_async, iter_over_async
from dev_opsgpt.service.kb_api import *
from dev_opsgpt.chat import LLMChat, SearchChat, KnowledgeChat
from dev_opsgpt.sandbox import PyCodeBox, CodeBoxResponse
from web_crawler.utils.WebCrawler import WebCrawler
nltk.data.path = [NLTK_DATA_PATH] + nltk.data.path
def set_httpx_timeout(timeout=60.0):
'''
设置httpx默认timeout到60秒
httpx默认timeout是5秒在请求LLM回答时不够用
'''
httpx._config.DEFAULT_TIMEOUT_CONFIG.connect = timeout
httpx._config.DEFAULT_TIMEOUT_CONFIG.read = timeout
httpx._config.DEFAULT_TIMEOUT_CONFIG.write = timeout
KB_ROOT_PATH = Path(KB_ROOT_PATH)
set_httpx_timeout()
class ApiRequest:
'''
api.py调用的封装,主要实现:
1. 简化api调用方式
2. 实现无api调用(直接运行server.chat.*中的视图函数获取结果),无需启动api.py
'''
def __init__(
self,
base_url: str = "http://127.0.0.1:7861",
timeout: float = 60.0,
no_remote_api: bool = False, # call api view function directly
):
self.base_url = base_url
self.timeout = timeout
self.no_remote_api = no_remote_api
self.llmChat = LLMChat()
self.searchChat = SearchChat()
self.knowledgeChat = KnowledgeChat()
self.codebox = PyCodeBox(
remote_url=SANDBOX_SERVER["url"],
remote_ip=SANDBOX_SERVER["host"], # "http://localhost",
remote_port=SANDBOX_SERVER["port"],
token="mytoken",
do_code_exe=True,
do_remote=SANDBOX_SERVER["do_remote"]
)
def codebox_chat(self, text: str, file_path: str = None, do_code_exe: bool = None) -> CodeBoxResponse:
return self.codebox.chat(text, file_path, do_code_exe=do_code_exe)
def _parse_url(self, url: str) -> str:
if (not url.startswith("http")
and self.base_url
):
part1 = self.base_url.strip(" /")
part2 = url.strip(" /")
return f"{part1}/{part2}"
else:
return url
def get(
self,
url: str,
params: Union[Dict, List[Tuple], bytes] = None,
retry: int = 3,
stream: bool = False,
**kwargs: Any,
) -> Union[httpx.Response, None]:
url = self._parse_url(url)
kwargs.setdefault("timeout", self.timeout)
while retry > 0:
try:
if stream:
return httpx.stream("GET", url, params=params, **kwargs)
else:
return httpx.get(url, params=params, **kwargs)
except Exception as e:
logger.error(e)
retry -= 1
async def aget(
self,
url: str,
params: Union[Dict, List[Tuple], bytes] = None,
retry: int = 3,
stream: bool = False,
**kwargs: Any,
) -> Union[httpx.Response, None]:
url = self._parse_url(url)
kwargs.setdefault("timeout", self.timeout)
async with httpx.AsyncClient() as client:
while retry > 0:
try:
if stream:
return await client.stream("GET", url, params=params, **kwargs)
else:
return await client.get(url, params=params, **kwargs)
except Exception as e:
logger.error(e)
retry -= 1
def post(
self,
url: str,
data: Dict = None,
json: Dict = None,
retry: int = 3,
stream: bool = False,
**kwargs: Any
) -> Union[httpx.Response, None]:
url = self._parse_url(url)
kwargs.setdefault("timeout", self.timeout)
while retry > 0:
try:
# return requests.post(url, data=data, json=json, stream=stream, **kwargs)
if stream:
return httpx.stream("POST", url, data=data, json=json, **kwargs)
else:
return httpx.post(url, data=data, json=json, **kwargs)
except Exception as e:
logger.error(e)
retry -= 1
async def apost(
self,
url: str,
data: Dict = None,
json: Dict = None,
retry: int = 3,
stream: bool = False,
**kwargs: Any
) -> Union[httpx.Response, None]:
url = self._parse_url(url)
kwargs.setdefault("timeout", self.timeout)
async with httpx.AsyncClient() as client:
while retry > 0:
try:
if stream:
return await client.stream("POST", url, data=data, json=json, **kwargs)
else:
return await client.post(url, data=data, json=json, **kwargs)
except Exception as e:
logger.error(e)
retry -= 1
def delete(
self,
url: str,
data: Dict = None,
json: Dict = None,
retry: int = 3,
stream: bool = False,
**kwargs: Any
) -> Union[httpx.Response, None]:
url = self._parse_url(url)
kwargs.setdefault("timeout", self.timeout)
while retry > 0:
try:
if stream:
return httpx.stream("DELETE", url, data=data, json=json, **kwargs)
else:
return httpx.delete(url, data=data, json=json, **kwargs)
except Exception as e:
logger.error(e)
retry -= 1
async def adelete(
self,
url: str,
data: Dict = None,
json: Dict = None,
retry: int = 3,
stream: bool = False,
**kwargs: Any
) -> Union[httpx.Response, None]:
url = self._parse_url(url)
kwargs.setdefault("timeout", self.timeout)
async with httpx.AsyncClient() as client:
while retry > 0:
try:
if stream:
return await client.stream("DELETE", url, data=data, json=json, **kwargs)
else:
return await client.delete(url, data=data, json=json, **kwargs)
except Exception as e:
logger.error(e)
retry -= 1
def _fastapi_stream2generator(self, response: StreamingResponse, as_json: bool =False):
'''
将api.py中视图函数返回的StreamingResponse转化为同步生成器
'''
try:
loop = asyncio.get_event_loop()
except:
loop = asyncio.new_event_loop()
try:
for chunk in iter_over_async(response.body_iterator, loop):
if as_json and chunk:
yield json.loads(chunk)
elif chunk.strip():
yield chunk
except Exception as e:
logger.error(traceback.format_exc())
def _httpx_stream2generator(
self,
response: contextlib._GeneratorContextManager,
as_json: bool = False,
):
'''
将httpx.stream返回的GeneratorContextManager转化为普通生成器
'''
try:
with response as r:
for chunk in r.iter_text(None):
if as_json and chunk:
yield json.loads(chunk)
elif chunk.strip():
yield chunk
except httpx.ConnectError as e:
msg = f"无法连接API服务器请确认 api.py 已正常启动。"
logger.error(msg)
logger.error(e)
yield {"code": 500, "msg": msg}
except httpx.ReadTimeout as e:
msg = f"API通信超时请确认已启动FastChat与API服务详见RADME '5. 启动 API 服务或 Web UI'"
logger.error(msg)
logger.error(e)
yield {"code": 500, "msg": msg}
except Exception as e:
logger.error(e)
yield {"code": 500, "msg": str(e)}
def chat_chat(
self,
query: str,
history: List[Dict] = [],
stream: bool = True,
no_remote_api: bool = None,
):
'''
对应api.py/chat/chat接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
data = {
"query": query,
"history": history,
"stream": stream,
}
if no_remote_api:
response = self.llmChat.chat(**data)
return self._fastapi_stream2generator(response, as_json=True)
else:
response = self.post("/chat/chat", json=data, stream=True)
return self._httpx_stream2generator(response)
def knowledge_base_chat(
self,
query: str,
knowledge_base_name: str,
top_k: int = VECTOR_SEARCH_TOP_K,
score_threshold: float = SCORE_THRESHOLD,
history: List[Dict] = [],
stream: bool = True,
no_remote_api: bool = None,
):
'''
对应api.py/chat/knowledge_base_chat接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
data = {
"query": query,
"engine_name": knowledge_base_name,
"top_k": top_k,
"score_threshold": score_threshold,
"history": history,
"stream": stream,
"local_doc_url": no_remote_api,
}
if no_remote_api:
response = self.knowledgeChat.chat(**data)
return self._fastapi_stream2generator(response, as_json=True)
else:
response = self.post(
"/chat/knowledge_base_chat",
json=data,
stream=True,
)
return self._httpx_stream2generator(response, as_json=True)
def search_engine_chat(
self,
query: str,
search_engine_name: str,
top_k: int = SEARCH_ENGINE_TOP_K,
stream: bool = True,
no_remote_api: bool = None,
):
'''
对应api.py/chat/search_engine_chat接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
data = {
"query": query,
"engine_name": search_engine_name,
"top_k": top_k,
"history": [],
"stream": stream,
}
if no_remote_api:
response = self.searchChat.chat(**data)
return self._fastapi_stream2generator(response, as_json=True)
else:
response = self.post(
"/chat/search_engine_chat",
json=data,
stream=True,
)
return self._httpx_stream2generator(response, as_json=True)
# 知识库相关操作
def _check_httpx_json_response(
self,
response: httpx.Response,
errorMsg: str = f"无法连接API服务器请确认已执行python server\\api.py",
) -> Dict:
'''
check whether httpx returns correct data with normal Response.
error in api with streaming support was checked in _httpx_stream2enerator
'''
try:
return response.json()
except Exception as e:
logger.error(e)
return {"code": 500, "msg": errorMsg or str(e)}
def list_knowledge_bases(
self,
no_remote_api: bool = None,
):
'''
对应api.py/knowledge_base/list_knowledge_bases接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
if no_remote_api:
response = run_async(list_kbs())
return response.data
else:
response = self.get("/knowledge_base/list_knowledge_bases")
data = self._check_httpx_json_response(response)
return data.get("data", [])
def create_knowledge_base(
self,
knowledge_base_name: str,
vector_store_type: str = "faiss",
embed_model: str = EMBEDDING_MODEL,
no_remote_api: bool = None,
):
'''
对应api.py/knowledge_base/create_knowledge_base接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
data = {
"knowledge_base_name": knowledge_base_name,
"vector_store_type": vector_store_type,
"embed_model": embed_model,
}
if no_remote_api:
response = run_async(create_kb(**data))
return response.dict()
else:
response = self.post(
"/knowledge_base/create_knowledge_base",
json=data,
)
return self._check_httpx_json_response(response)
def delete_knowledge_base(
self,
knowledge_base_name: str,
no_remote_api: bool = None,
):
'''
对应api.py/knowledge_base/delete_knowledge_base接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
if no_remote_api:
response = run_async(delete_kb(knowledge_base_name))
return response.dict()
else:
response = self.post(
"/knowledge_base/delete_knowledge_base",
json=f"{knowledge_base_name}",
)
return self._check_httpx_json_response(response)
def list_kb_docs(
self,
knowledge_base_name: str,
no_remote_api: bool = None,
):
'''
对应api.py/knowledge_base/list_docs接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
if no_remote_api:
response = run_async(list_docs(knowledge_base_name))
return response.data
else:
response = self.get(
"/knowledge_base/list_docs",
params={"knowledge_base_name": knowledge_base_name}
)
data = self._check_httpx_json_response(response)
return data.get("data", [])
def upload_kb_doc(
self,
file: Union[str, Path, bytes],
knowledge_base_name: str,
filename: str = None,
override: bool = False,
not_refresh_vs_cache: bool = False,
no_remote_api: bool = None,
):
'''
对应api.py/knowledge_base/upload_docs接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
if isinstance(file, bytes): # raw bytes
file = BytesIO(file)
elif hasattr(file, "read"): # a file io like object
filename = filename or file.name
else: # a local path
file = Path(file).absolute().open("rb")
filename = filename or file.name
if no_remote_api:
from fastapi import UploadFile
from tempfile import SpooledTemporaryFile
temp_file = SpooledTemporaryFile(max_size=10 * 1024 * 1024)
temp_file.write(file.read())
temp_file.seek(0)
response = run_async(upload_doc(
UploadFile(file=temp_file, filename=filename),
knowledge_base_name,
override,
not_refresh_vs_cache
))
return response.dict()
else:
response = self.post(
"/knowledge_base/upload_doc",
data={
"knowledge_base_name": knowledge_base_name,
"override": override,
"not_refresh_vs_cache": not_refresh_vs_cache,
},
files={"file": (filename, file)},
)
return self._check_httpx_json_response(response)
def delete_kb_doc(
self,
knowledge_base_name: str,
doc_name: str,
delete_content: bool = False,
not_refresh_vs_cache: bool = False,
no_remote_api: bool = None,
):
'''
对应api.py/knowledge_base/delete_doc接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
data = {
"knowledge_base_name": knowledge_base_name,
"doc_name": doc_name,
"delete_content": delete_content,
"not_refresh_vs_cache": not_refresh_vs_cache,
}
if no_remote_api:
response = run_async(delete_doc(**data))
return response.dict()
else:
response = self.post(
"/knowledge_base/delete_doc",
json=data,
)
return self._check_httpx_json_response(response)
def update_kb_doc(
self,
knowledge_base_name: str,
file_name: str,
not_refresh_vs_cache: bool = False,
no_remote_api: bool = None,
):
'''
对应api.py/knowledge_base/update_doc接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
if no_remote_api:
response = run_async(update_doc(knowledge_base_name, file_name, not_refresh_vs_cache))
return response.dict()
else:
response = self.post(
"/knowledge_base/update_doc",
json={
"knowledge_base_name": knowledge_base_name,
"file_name": file_name,
"not_refresh_vs_cache": not_refresh_vs_cache,
},
)
return self._check_httpx_json_response(response)
def recreate_vector_store(
self,
knowledge_base_name: str,
allow_empty_kb: bool = True,
vs_type: str = DEFAULT_VS_TYPE,
embed_model: str = EMBEDDING_MODEL,
no_remote_api: bool = None,
):
'''
对应api.py/knowledge_base/recreate_vector_store接口
'''
if no_remote_api is None:
no_remote_api = self.no_remote_api
data = {
"knowledge_base_name": knowledge_base_name,
"allow_empty_kb": allow_empty_kb,
"vs_type": vs_type,
"embed_model": embed_model,
}
if no_remote_api:
response = run_async(recreate_vector_store(**data))
return self._fastapi_stream2generator(response, as_json=True)
else:
response = self.post(
"/knowledge_base/recreate_vector_store",
json=data,
stream=True,
timeout=None,
)
return self._httpx_stream2generator(response, as_json=True)
def web_crawl(
self,
base_url: str,
html_dir: str,
text_dir: str,
do_dfs: bool = False,
reptile_lib: str = "requests",
method: str = "get",
time_sleep: float = 2,
no_remote_api: bool = None
):
'''
根据url来检索
'''
async def _web_crawl(html_dir, text_dir, base_url, reptile_lib, method, time_sleep, do_dfs):
wc = WebCrawler()
try:
if not do_dfs:
wc.webcrawler_single(html_dir=html_dir,
text_dir=text_dir,
base_url=base_url,
reptile_lib=reptile_lib,
method=method,
time_sleep=time_sleep
)
else:
wc.webcrawler_1_degree(html_dir=html_dir,
text_dir=text_dir,
base_url=base_url,
reptile_lib=reptile_lib,
method=method,
time_sleep=time_sleep
)
return {"status": 200, "response": "success"}
except Exception as e:
return {"status": 500, "response": str(e)}
if no_remote_api is None:
no_remote_api = self.no_remote_api
data = {
"base_url": base_url,
"html_dir": html_dir,
"text_dir": text_dir,
"do_dfs": do_dfs,
"reptile_lib": reptile_lib,
"method": method,
"time_sleep": time_sleep,
}
if no_remote_api:
response = run_async(_web_crawl(**data))
return response
else:
raise Exception("not impletenion")
def check_error_msg(data: Union[str, dict, list], key: str = "errorMsg") -> str:
'''
return error message if error occured when requests API
'''
if isinstance(data, dict):
if key in data:
return data[key]
if "code" in data and data["code"] != 200:
return data["msg"]
return ""
def check_success_msg(data: Union[str, dict, list], key: str = "msg") -> str:
'''
return error message if error occured when requests API
'''
if (isinstance(data, dict)
and key in data
and "code" in data
and data["code"] == 200):
return data[key]
return ""
if __name__ == "__main__":
api = ApiRequest(no_remote_api=True)
# print(api.chat_fastchat(
# messages=[{"role": "user", "content": "hello"}]
# ))
# with api.chat_chat("你好") as r:
# for t in r.iter_text(None):
# print(t)
# r = api.chat_chat("你好", no_remote_api=True)
# for t in r:
# print(t)
# r = api.duckduckgo_search_chat("室温超导最新研究进展", no_remote_api=True)
# for t in r:
# print(t)
# print(api.list_knowledge_bases())

3
docker_build.sh Normal file
View File

@ -0,0 +1,3 @@
#!/bin/bash
docker build -t devopsgpt:pypy38 .

32
docker_requirements.txt Normal file
View File

@ -0,0 +1,32 @@
langchain==0.0.266
openai
sentence_transformers
fschat==0.2.24
transformers>=4.31.0
# torch~=2.0.0
fastapi~=0.99.1
nltk~=3.8.1
uvicorn~=0.23.1
starlette~=0.27.0
pydantic~=1.10.11
SQLAlchemy==2.0.19
faiss-cpu
nltk
loguru
pypdf
duckduckgo-search
pysocks
accelerate
matplotlib
seaborn
jupyter
notebook
# uncomment libs if you want to use corresponding vector store
# pymilvus==2.1.3 # requires milvus==2.1.3
# psycopg2
# pgvector
numpy~=1.24.4
pandas~=2.0.3
httpx~=0.24.1

0
domain/code/README.md Normal file
View File

177
domain/devops/README.md Normal file
View File

@ -0,0 +1,177 @@
# <p align="center">DevOps-ChatBot: Development by Private Knowledge Augmentation</p>
<p align="center">
<a href="README.md"><img src="https://img.shields.io/badge/文档-中文版-yellow.svg" alt="ZH doc"></a>
<a href="README_EN.md"><img src="https://img.shields.io/badge/document-英文版-yellow.svg" alt="EN doc"></a>
<img src="https://img.shields.io/github/license/codefuse-ai/codefuse-chatbot" alt="License">
<a href="https://github.com/codefuse-ai/codefuse-chatbot/issues">
<img alt="Open Issues" src="https://img.shields.io/github/issues-raw/codefuse-ai/codefuse-chatbot" />
</a>
<br><br>
</p>
本项目是一个开源的 AI 智能助手专为软件开发的全生命周期而设计涵盖设计、编码、测试、部署和运维等阶段。通过知识检索、工具使用和沙箱执行DevOps-ChatBot 能解答您开发过程中的各种专业问题、问答操作周边独立分散平台。
<!-- ![Alt text](sources/docs_imgs/objective.png) -->
## 🔔 更新
- [2023.09.15] 本地/隔离环境的沙盒功能开放基于爬虫实现指定url知识检索
## 📜 目录
- [🤝 介绍](#-介绍)
- [🎥 演示视频](#-演示视频)
- [🧭 技术路线](#-技术路线)
- [🚀 快速使用](#-快速使用)
- [🤗 致谢](#-致谢)
## 🤝 介绍
💡 本项目旨在通过检索增强生成Retrieval Augmented GenerationRAG、工具学习Tool Learning和沙盒环境来构建软件开发全生命周期的AI智能助手涵盖设计、编码、测试、部署和运维等阶段。 逐渐从各处资料查询、独立分散平台操作的传统开发运维模式转变到大模型问答的智能化开发运维模式,改变人们的开发运维习惯。
- 📚 知识库管理DevOps专业高质量知识库 + 企业级知识库自助构建 + 对话实现快速检索开源/私有技术文档
- 🐳 隔离沙盒环境:实现代码的快速编译执行测试
- 🔄 React范式支撑代码的自我迭代、自动执行
- 🛠️ Prompt管理实现各种开发、运维任务的prompt管理
- 🚀 对话驱动:需求设计、系分设计、代码生成、开发测试、部署运维自动化
<div align=center>
<img src="../../sources/docs_imgs/objective_v4.png" alt="图片" width="600" height="333">
</div>
🌍 依托于开源的 LLM 与 Embedding 模型,本项目可实现基于开源模型的离线私有部署。此外,本项目也支持 OpenAI API 的调用。
👥 核心研发团队长期专注于 AIOps + NLP 领域的研究。我们发起了 DevOpsGPT 项目,希望大家广泛贡献高质量的开发和运维文档,共同完善这套解决方案,以实现“让天下没有难做的开发”的目标。
## 🎥 演示视频
为了帮助您更直观地了解 DevOps-ChatBot 的功能和使用方法,我们录制了一个演示视频。您可以通过观看此视频,快速了解本项目的主要特性和操作流程。
[演示视频](https://www.youtube.com/watch?v=UGJdTGaVnNY&t=2s&ab_channel=HaotianZhu)
## 🧭 技术路线
<div align=center>
<img src="../../sources/docs_imgs/devops-chatbot-module.png" alt="图片" width="600" height="503">
</div>
- 🕷️ **Web Crawl**:实现定期网络文档爬取,确保数据的及时性,并依赖于开源社区的持续补充。
- 🗂️ **DocLoader & TextSplitter**:对从多种来源爬取的数据进行数据清洗、去重和分类,并支持私有文档的导入。
- 🗄️ **Vector Database**结合Text Embedding模型对文档进行Embedding并在Milvus中存储。
- 🔌 **Connector**作为调度中心负责LLM与Vector Database之间的交互调度基于Langchain技术实现。
- 📝 **Prompt Control**从开发和运维角度设计为不同问题分类并为Prompt添加背景确保答案的可控性和完整性。
- 💬 **LLM**默认使用GPT-3.5-turbo并为私有部署和其他涉及隐私的场景提供专有模型选择。
- 🔤 **Text Embedding**默认采用OpenAI的Text Embedding模型支持私有部署和其他隐私相关场景并提供专有模型选择。
- 🚧 **SandBox**对于生成的输出如代码为帮助用户判断其真实性提供了一个交互验证环境基于FaaS并支持用户进行调整。
具体实现明细见:[技术路线明细](../../sources/readme_docs/roadmap.md)
## 🚀 快速使用
请自行安装 nvidia 驱动程序,本项目已在 Python 3.9.18CUDA 11.7 环境下Windows、X86 架构的 macOS 系统中完成测试。
1、python 环境准备
- 推荐采用 conda 对 python 环境进行管理(可选)
```bash
# 准备 conda 环境
conda create --name devopsgpt python=3.9
conda activate devopsgpt
```
- 安装相关依赖
```bash
cd DevOps-ChatBot
pip install -r requirements.txt
# 安装完成后,确认电脑是否兼容 notebook=6.5.5 版本,若不兼容执行更新命令
pip install --upgrade notebook
# 修改 docker_requirement.txt 的 notebook 版本设定, 用于后续构建新的孤立镜像
notebook=6.5.5 => notebook
```
2、沙盒环境准备
- windows Docker 安装:
[Docker Desktop for Windows](https://docs.docker.com/desktop/install/windows-install/) 支持 64 位版本的 Windows 10 Pro且必须开启 Hyper-V若版本为 v1903 及以上则无需开启 Hyper-V或者 64 位版本的 Windows 10 Home v1903 及以上版本。
- [【全面详细】Windows10 Docker安装详细教程](https://zhuanlan.zhihu.com/p/441965046)
- [Docker 从入门到实践](https://yeasy.gitbook.io/docker_practice/install/windows)
- [Docker Desktop requires the Server service to be enabled 处理](https://blog.csdn.net/sunhy_csdn/article/details/106526991)
- [安装wsl或者等报错提示](https://learn.microsoft.com/zh-cn/windows/wsl/install)
- Linux Docker 安装:
Linux 安装相对比较简单,请自行 baidu/google 相关安装
- Mac Docker 安装
- [Docker 从入门到实践](https://yeasy.gitbook.io/docker_practice/install/mac)
```bash
# 构建沙盒环境的镜像notebook版本问题见上述
bash docker_build.sh
```
3、模型下载可选
如需使用开源 LLM 与 Embedding 模型可以从 HuggingFace 下载。
此处以 THUDM/chatglm2-6bm 和 text2vec-base-chinese 为例:
```
# install git-lfs
git lfs install
# install LLM-model
git lfs clone https://huggingface.co/THUDM/chatglm2-6b
# install Embedding-model
git lfs clone https://huggingface.co/shibing624/text2vec-base-chinese
```
4、基础配置
```bash
# 修改服务启动的基础配置
cd configs
cp model_config.py.example model_config.py
cp server_config.py.example server_config.py
# model_config#11~12 若需要使用openai接口openai接口key
os.environ["OPENAI_API_KEY"] = "sk-xxx"
# 可自行替换自己需要的api_base_url
os.environ["API_BASE_URL"] = "https://api.openai.com/v1"
# vi model_config#95 你需要选择的语言模型
LLM_MODEL = "gpt-3.5-turbo"
# vi model_config#33 你需要选择的向量模型
EMBEDDING_MODEL = "text2vec-base"
# vi model_config#19 修改成你的本地路径如果能直接连接huggingface则无需修改
"text2vec-base": "/home/user/xx/text2vec-base-chinese",
# 是否启动本地的notebook用于代码解释默认启动docker的notebook
# vi server_config#35True启动docker的notebookfalse启动local的notebook
"do_remote": False, / "do_remote": True,
```
5、启动服务
默认只启动webui相关服务未启动fastchat可选
```bash
# 若需要支撑codellama-34b-int4模型需要给fastchat打一个补丁
# cp examples/gptq.py ~/site-packages/fastchat/modules/gptq.py
# start llm-service可选
python dev_opsgpt/service/llm_api.py
```
```bash
cd examples
# python ../dev_opsgpt/service/llm_api.py 若需使用本地大语言模型,可执行该命令
bash start_webui.sh
```
## 🤗 致谢
本项目基于[langchain-chatchat](https://github.com/chatchat-space/Langchain-Chatchat)和[codebox-api](https://github.com/shroominic/codebox-api),在此深深感谢他们的开源贡献!

169
domain/devops/README_en.md Normal file
View File

@ -0,0 +1,169 @@
# <p align="center">DevOps-ChatBot: Development by Private Knowledge Augmentation</p>
<p align="center">
<a href="README.md"><img src="https://img.shields.io/badge/文档-中文版-yellow.svg" alt="ZH doc"></a>
<a href="README_EN.md"><img src="https://img.shields.io/badge/document-英文版-yellow.svg" alt="EN doc"></a>
<img src="https://img.shields.io/github/license/codefuse-ai/codefuse-chatbot" alt="License">
<a href="https://github.com/codefuse-ai/codefuse-chatbot/issues">
<img alt="Open Issues" src="https://img.shields.io/github/issues-raw/codefuse-ai/codefuse-chatbot" />
</a>
<br><br>
</p>
This project is an open-source AI intelligent assistant, specifically designed for the entire lifecycle of software development, covering design, coding, testing, deployment, and operations. Through knowledge retrieval, tool utilization, and sandbox execution, DevOps-ChatBot can answer various professional questions during your development process and perform question-answering operations on standalone, disparate platforms.
<!-- ![Alt text](sources/docs_imgs/objective.png) -->
## 🔔 Updates
- [2023.09.15] Sandbox features for local/isolated environments are now available, implementing specified URL knowledge retrieval based on web crawling.
## 📜 Contents
- [🤝 Introduction](#-introduction)
- [🧭 Technical Route](#-technical-route)
- [🚀 Quick Start](#-quick-start)
- [🤗 Acknowledgements](#-acknowledgements)
## 🤝 Introduction
💡 The aim of this project is to construct an AI intelligent assistant for the entire lifecycle of software development, covering design, coding, testing, deployment, and operations, through Retrieval Augmented Generation (RAG), Tool Learning, and sandbox environments. It transitions gradually from the traditional development and operations mode of querying information from various sources and operating on standalone, disparate platforms to an intelligent development and operations mode based on large-model Q&A, changing people's development and operations habits.
- 📚 Knowledge Base Management: Professional high-quality DevOps knowledge base + enterprise-level knowledge base self-construction + dialogue-based fast retrieval of open-source/private technical documents.
- 🐳 Isolated Sandbox Environment: Enables quick compilation, execution, and testing of code.
- 🔄 React Paradigm: Supports code self-iteration and automatic execution.
- 🛠️ Prompt Management: Manages prompts for various development and operations tasks.
- 🚀 Conversation Driven: Automates requirement design, system analysis design, code generation, development testing, deployment, and operations.
<div align=center>
<img src="../../sources/docs_imgs/objective_v4.png" alt="Image" width="600" height="333">
</div>
🌍 Relying on open-source LLM and Embedding models, this project can achieve offline private deployments based on open-source models. Additionally, this project also supports the use of the OpenAI API.
👥 The core development team has been long-term focused on research in the AIOps + NLP domain. We initiated the DevOpsGPT project, hoping that everyone could contribute high-quality development and operations documents widely, jointly perfecting this solution to achieve the goal of "Making Development Seamless for Everyone."
## 🧭 Technical Route
<div align=center>
<img src="../../sources/docs_imgs/devops-chatbot-module.png" alt="Image" width="600" height="503">
</div>
- 🕷️ **Web Crawl**: Implements periodic web document crawling to ensure data timeliness and relies on continuous supplementation from the open-source community.
- 🗂️ **DocLoader & TextSplitter**: Cleans, deduplicates, and categorizes data crawled from various sources and supports the import of private documents.
- 🗄️ **Vector Database**: Integrates Text Embedding models to embed documents and store them in Milvus.
- 🔌 **Connector**: Acts as the scheduling center, responsible for coordinating interactions between LLM and Vector Database, implemented based on Langchain technology.
- 📝 **Prompt Control**: Designs from development and operations perspectives, categorizes different problems, and adds backgrounds to prompts to ensure the controllability and completeness of answers.
- 💬 **LLM**: Uses GPT-3.5-turbo by default and provides proprietary model options for private deployments and other privacy-related scenarios.
- 🔤 **Text Embedding**: Uses OpenAI's Text Embedding model by default, supports private deployments and other privacy-related scenarios, and provides proprietary model options.
- 🚧 **SandBox**: For generated outputs, like code, to help users judge their authenticity, an interactive verification environment is provided (based on FaaS), allowing user adjustments.
For implementation details, see: [Technical Route Details](sources/readme_docs/roadmap.md)
## 🚀 Quick Start
Please install the Nvidia driver yourself; this project has been tested on Python 3.9.18, CUDA 11.7, Windows, and X86 architecture macOS systems.
1. Preparation of Python environment
- It is recommended to use conda to manage the python environment (optional)
```bash
# Prepare conda environment
conda create --name devopsgpt python=3.9
conda activate devopsgpt
```
- Install related dependencies
```bash
cd DevOps-ChatBot
pip install -r requirements.txt
# After installation, confirm whether the computer is compatible with notebook=6.5.5 version; if incompatible, execute the update command
pip install --upgrade notebook
# Modify the notebook version setting in docker_requirement.txt for building new isolated images later
notebook=6.5.5 => notebook
```
2. Preparation of Sandbox Environment
- Windows Docker installation:
[Docker Desktop for Windows](https://docs.docker.com/desktop/install/windows-install/) supports 64-bit versions of Windows 10 Pro, with Hyper-V enabled (not required for versions v1903 and above), or 64-bit versions of Windows 10 Home v1903 and above.
- [Comprehensive Detailed Windows 10 Docker Installation Tutorial](https://zhuanlan.zhihu.com/p/441965046)
- [Docker: From Beginner to Practitioner](https://yeasy.gitbook.io/docker_practice/install/windows)
- [Handling Docker Desktop requires the Server service to be enabled](https://blog.csdn.net/sunhy_csdn/article/details/106526991)
- [Install wsl or wait for error prompt](https://learn.microsoft.com/en-us/windows/wsl/install)
- Linux Docker Installation:
Linux installation is relatively simple, please search Baidu/Google for installation instructions.
- Mac Docker Installation
- [Docker: From Beginner to Practitioner](https://yeasy.gitbook.io/docker_practice/install/mac)
```bash
# Build images for the sandbox environment, see above for notebook version issues
bash docker_build.sh
```
3. Model Download (Optional)
If you need to use open-source LLM and Embed
ding models, you can download them from HuggingFace.
Here, we use THUDM/chatglm2-6b and text2vec-base-chinese as examples:
```
# install git-lfs
git lfs install
# install LLM-model
git lfs clone https://huggingface.co/THUDM/chatglm2-6b
# install Embedding-model
git lfs clone https://huggingface.co/shibing624/text2vec-base-chinese
```
4. Basic Configuration
```bash
# Modify the basic configuration for service startup
cd configs
cp model_config.py.example model_config.py
cp server_config.py.example server_config.py
# model_config#11~12 If you need to use the openai interface, openai interface key
os.environ["OPENAI_API_KEY"] = "sk-xxx"
# You can replace the api_base_url yourself
os.environ["API_BASE_URL"] = "https://api.openai.com/v1"
# vi model_config#95 You need to choose the language model
LLM_MODEL = "gpt-3.5-turbo"
# vi model_config#33 You need to choose the vector model
EMBEDDING_MODEL = "text2vec-base"
# vi model_config#19 Modify to your local path, if you can directly connect to huggingface, no modification is needed
"text2vec-base": "/home/user/xx/text2vec-base-chinese",
# Whether to start the local notebook for code interpretation, start the docker notebook by default
# vi server_config#35, True to start the docker notebook, false to start the local notebook
"do_remote": False, / "do_remote": True,
```
5. Start the Service
By default, only webui related services are started, and fastchat is not started (optional).
```bash
# if use codellama-34b-int4, you should replace fastchat's gptq.py
# cp examples/gptq.py ~/site-packages/fastchat/modules/gptq.py
# start llm-service可选
python dev_opsgpt/service/llm_api.py
```
```bash
cd examples
# python ../dev_opsgpt/service/llm_api.py If you need to use the local large language model, you can execute this command
bash start_webui.sh
```
## 🤗 Acknowledgements
This project is based on [langchain-chatchat](https://github.com/chatchat-space/Langchain-Chatchat) and [codebox-api](https://github.com/shroominic/codebox-api). We deeply appreciate their contributions to open source!

9
env_start.sh Normal file
View File

@ -0,0 +1,9 @@
#!/bin/bash
pip install -r requirements.txt
# torch-gpu 安装视具体配置操纵
# pip install torch==2.0.1+cu118 cudatoolkit --index-url https://download.pytorch.org/whl/cu118
# pip3 uninstall crypto
# pip3 uninstall pycrypto
# pip3 install pycryptodome

17
examples/docker_start.sh Normal file
View File

@ -0,0 +1,17 @@
#!/bin/bash
set -x
CONTAINER_NAME=devopsgpt_default
IMAGES=devopsgpt:pypy3
WORK_DIR=$PWD
docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME
EXTERNAL_PORT=5050
# linux start
# docker run -it -p 5050:5050 --name $CONTAINER_NAME $IMAGES bash
# windows start
winpty docker run -it -d -p $EXTERNAL_PORT:5050 --name $CONTAINER_NAME $IMAGES bash

121
examples/gptq.py Normal file
View File

@ -0,0 +1,121 @@
from dataclasses import dataclass, field
import os
from os.path import isdir, isfile
from pathlib import Path
import sys
from transformers import AutoTokenizer
@dataclass
class GptqConfig:
ckpt: str = field(
default=None,
metadata={
"help": "Load quantized model. The path to the local GPTQ checkpoint."
},
)
wbits: int = field(default=16, metadata={"help": "#bits to use for quantization"})
groupsize: int = field(
default=-1,
metadata={"help": "Groupsize to use for quantization; default uses full row."},
)
act_order: bool = field(
default=True,
metadata={"help": "Whether to apply the activation order GPTQ heuristic"},
)
def load_quant_by_autogptq(model):
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model = AutoGPTQForCausalLM.from_quantized(model,
inject_fused_attention=False,
inject_fused_mlp=False,
use_cuda_fp16=True,
disable_exllama=False,
device_map='auto'
)
return model
def load_gptq_quantized(model_name, gptq_config: GptqConfig):
print("Loading GPTQ quantized model...")
if gptq_config.act_order:
try:
script_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
module_path = os.path.join(script_path, "../repositories/GPTQ-for-LLaMa")
sys.path.insert(0, module_path)
from llama import load_quant
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
# only `fastest-inference-4bit` branch cares about `act_order`
model = load_quant(
model_name,
find_gptq_ckpt(gptq_config),
gptq_config.wbits,
gptq_config.groupsize,
act_order=gptq_config.act_order,
)
except ImportError as e:
print(f"Error: Failed to load GPTQ-for-LLaMa. {e}")
print("See https://github.com/lm-sys/FastChat/blob/main/docs/gptq.md")
sys.exit(-1)
else:
# other branches
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = load_quant_by_autogptq(model_name)
return model, tokenizer
# def load_gptq_quantized(model_name, gptq_config: GptqConfig):
# print("Loading GPTQ quantized model...")
# try:
# script_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
# module_path = os.path.join(script_path, "repositories/GPTQ-for-LLaMa")
# sys.path.insert(0, module_path)
# from llama import load_quant
# except ImportError as e:
# print(f"Error: Failed to load GPTQ-for-LLaMa. {e}")
# print("See https://github.com/lm-sys/FastChat/blob/main/docs/gptq.md")
# sys.exit(-1)
# tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
# # only `fastest-inference-4bit` branch cares about `act_order`
# if gptq_config.act_order:
# model = load_quant(
# model_name,
# find_gptq_ckpt(gptq_config),
# gptq_config.wbits,
# gptq_config.groupsize,
# act_order=gptq_config.act_order,
# )
# else:
# # other branches
# model = load_quant(
# model_name,
# find_gptq_ckpt(gptq_config),
# gptq_config.wbits,
# gptq_config.groupsize,
# )
# return model, tokenizer
def find_gptq_ckpt(gptq_config: GptqConfig):
if Path(gptq_config.ckpt).is_file():
return gptq_config.ckpt
# for ext in ["*.pt", "*.safetensors",]:
for ext in ["*.pt", "*.bin",]:
matched_result = sorted(Path(gptq_config.ckpt).glob(ext))
if len(matched_result) > 0:
return str(matched_result[-1])
print("Error: gptq checkpoint not found")
sys.exit(1)

View File

@ -0,0 +1,35 @@
Stack trace:
Frame Function Args
065CA2341B0 00180064365 (001802A267B, 00180266FD1, 00180310720, 065CA234690)
065CA2346E0 001800499D2 (00000000000, 00000000000, 00000000000, 00000000000)
065CA2356F0 00180049A11 (00000000032, 000000018A1, 00180310720, 001803512E5)
065CA235720 0018017C34A (00000000000, 00000000000, 00000000000, 00000000000)
065CA2357C0 00180107A01 (065CA2369E8, 000000018A1, 00180310720, 000FFFFFFFF)
065CA236A50 0018016142F (7FFD86AEB0EB, 001803512E5, 065CA236A88, 00000000000)
065CA236BF0 00180142EBB (7FFD86AEB0EB, 001803512E5, 065CA236A88, 00000000000)
065CA236BF0 004B39318AE (24B346C45C0, 065CA236C98, 24B00A54270, 00000000000)
065CA236CD0 004B3936115 (00000000000, 24B36D861A0, FFC1F2061D5F707D, 065CA236DB0)
065CA236D80 7FFDB39F4541 (0000000000A, 065CA236FD0, 7FFDB39F4262, 00000000000)
065CA236DB0 7FFDB39F4332 (065CA237010, 00000000000, 24B594D6870, 00000000000)
065CA237010 7FFDB39F4212 (7FFD86AE24F7, 7FFD8696D680, 24B77D7BD60, 065CA236FB0)
065CA237010 7FFD8695CABF (004B39319D0, 065CA236FC0, 00000001101, 24B00000000)
065CA237010 7FFD8695D629 (24B00A4D6C0, 00000000000, 00000000000, 06500001101)
24B00A4DEE0 7FFD869577FD (24B00A573C0, 00000000000, 00000000000, 24B00A4FB20)
00000000000 7FFD86AE0256 (065CA237299, 24B00A4FB10, 24B2688BDF8, 00000000003)
065CA237299 7FFD86BC8D88 (24B594D6870, 065CA237299, 00000000000, 24B00A47F70)
065CA237299 7FFD86BC2DF8 (00000000000, 24B00A505B0, 00000000043, 24B00A47F70)
00000000001 7FFD86BC798A (00000000002, 24B00A4E898, 7FFD86B49A3A, 00000000001)
00000000000 7FFD86AE0AAF (065CA237599, 24B008E7BB8, 7FFD86B2B9DA, 24B008E7BA8)
065CA237599 7FFD86BC03F6 (24B75AE03C8, 24B752735CA, 000000000A0, 00000000062)
065CA237599 7FFD86BC8D88 (24B594D6870, 065CA237599, 24B75273340, 7FFD86E45B00)
065CA237599 7FFD86BC517F (00000000000, 24B00A50D40, 00000000040, 24B34AC8A08)
00000000000 7FFD86BC798A (00000000000, 24B00A4F180, 00000000000, 00000000000)
24B00A50D40 7FFD86BC17F4 (7FFD86D82364, 24B00A50D40, 00000000010, 24B00A20B30)
24B00A50D40 7FFD86BBCA4F (24B34AD0810, 24B34AD5478, 24B594D6870, 24B594D6870)
00000000000 7FFD86B26BF5 (24B34AD0810, 24B00A4F240, 7FFDCD9B5BA1, 065CA237840)
00000000000 7FFD86AE03A6 (24B00A52280, 24B00A4F240, 24B00A52298, 24B345B7B32)
24B00A4F240 7FFD86BC8EF0 (24B00A52280, 065CA237949, 24B34AD5470, 00000000038)
065CA237949 7FFD86BC52D8 (00000000000, 24B34AD99D0, 0000000004F, 00000000000)
00000000003 7FFD86BC798A (00000000001, 00000000000, 24B516F49B0, 00000000003)
00000000000 7FFD86AE0AAF (065CA237C49, 24B2687E5C8, 7FFD86B28DEE, 24B2687E5A8)
End of stack trace (more stack frames may be present)

49
examples/start_sandbox.py Normal file
View File

@ -0,0 +1,49 @@
import docker, sys, os, time, requests
from loguru import logger
src_dir = os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
sys.path.append(src_dir)
from configs.server_config import CONTRAINER_NAME, SANDBOX_SERVER, IMAGE_NAME
if SANDBOX_SERVER["do_remote"]:
client = docker.from_env()
for i in client.containers.list(all=True):
if i.name == CONTRAINER_NAME:
container = i
container.stop()
container.remove()
break
# 启动容器
logger.info("start ot init container & notebook")
container = client.containers.run(
image=IMAGE_NAME,
command="bash",
name=CONTRAINER_NAME,
ports={"5050/tcp": SANDBOX_SERVER["port"]},
stdin_open=True,
detach=True,
tty=True,
)
# 启动notebook
exec_command = container.exec_run("bash jupyter_start.sh")
# 判断notebook是否启动
retry_nums = 3
while retry_nums>0:
response = requests.get(f"http://localhost:{SANDBOX_SERVER['port']}", timeout=270)
if response.status_code == 200:
logger.info("container & notebook init success")
break
else:
retry_nums -= 1
logger.info(client.containers.list())
logger.info("wait container running ...")
time.sleep(5)
else:
logger.info("启动local的notebook环境支持代码执行")

12
examples/start_webui.sh Normal file
View File

@ -0,0 +1,12 @@
#!/bin/bash
set -e
# python ../dev_opsgpt/service/llm_api.py
# 启动独立的沙箱环境
python start_sandbox.py
# python ../dev_opsgpt/service/llm_api.py
streamlit run webui.py

32
examples/stop_sandbox.py Normal file
View File

@ -0,0 +1,32 @@
import docker, sys, os, time, requests
from loguru import logger
src_dir = os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
sys.path.append(src_dir)
from configs.server_config import CONTRAINER_NAME, SANDBOX_SERVER
if SANDBOX_SERVER["do_remote"]:
# stop and remove the container
client = docker.from_env()
for i in client.containers.list(all=True):
if i.name == CONTRAINER_NAME:
container = i
container.stop()
container.remove()
break
else:
# stop local
import psutil
for process in psutil.process_iter(["pid", "name", "cmdline"]):
# check process name contains "jupyter" and port=xx
if f"port={SANDBOX_SERVER['port']}" in str(process.info["cmdline"]).lower() and \
"jupyter" in process.info['name'].lower():
logger.warning(f"port={SANDBOX_SERVER['port']}, {process.info}")
# 关闭进程
process.terminate()

4
examples/stop_webui.sh Normal file
View File

@ -0,0 +1,4 @@
#!/bin/bash
# stop sandbox
python stop_sandbox.py

82
examples/webui.py Normal file
View File

@ -0,0 +1,82 @@
# 运行方式:
# 1. 安装必要的包pip install streamlit-option-menu streamlit-chatbox>=1.1.6
# 2. 运行本机fastchat服务python server\llm_api.py 或者 运行对应的sh文件
# 3. 运行API服务器python server/api.py。如果使用api = ApiRequest(no_remote_api=True),该步可以跳过。
# 4. 运行WEB UIstreamlit run webui.py --server.port 7860
import os, sys
import streamlit as st
from streamlit_option_menu import option_menu
src_dir = os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
sys.path.append(src_dir)
from dev_opsgpt.webui import *
from configs import VERSION, LLM_MODEL
api = ApiRequest(base_url="http://127.0.0.1:7861", no_remote_api=True)
if __name__ == "__main__":
st.set_page_config(
"DevOpsGPT-Chat WebUI",
os.path.join("../sources/imgs", "devops-chatbot.png"),
initial_sidebar_state="expanded",
menu_items={
'Get Help': 'https://github.com/lightislost/devopsgpt',
'Report a bug': "https://github.com/lightislost/devopsgpt/issues",
'About': f"""欢迎使用 DevOpsGPT-Chat WebUI {VERSION}"""
}
)
if not chat_box.chat_inited:
st.toast(
f"欢迎使用 [`DevOpsGPT-Chat`](https://github.com/lightislost/devopsgpt) ! \n\n"
f"当前使用模型`{LLM_MODEL}`, 您可以开始提问了."
)
pages = {
"对话": {
"icon": "chat",
"func": dialogue_page,
},
"知识库管理": {
"icon": "hdd-stack",
"func": knowledge_page,
},
# "Prompt管理": {
# "icon": "hdd-stack",
# "func": prompt_page,
# },
}
with st.sidebar:
st.image(
os.path.join(
"../sources/imgs",
"devops-chatbot.png"
),
use_column_width=True
)
st.caption(
f"""<p align="right">当前版本:{VERSION}</p>""",
unsafe_allow_html=True,
)
options = list(pages)
icons = [x["icon"] for x in pages.values()]
default_index = 0
selected_page = option_menu(
"",
options=options,
icons=icons,
# menu_icon="chat-quote",
default_index=default_index,
)
if selected_page in pages:
pages[selected_page]["func"](api)

3
jupyter_start.sh Normal file
View File

@ -0,0 +1,3 @@
#!/bin/bash
nohup jupyter-notebook --NotebookApp.token=mytoken --port=5050 --allow-root --ip=0.0.0.0 --no-browser --ServerApp.disable_check_xsrf=True &

View File

@ -0,0 +1,76 @@
The Carnegie Mellon Pronouncing Dictionary [cmudict.0.7a]
ftp://ftp.cs.cmu.edu/project/speech/dict/
https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.7a
Copyright (C) 1993-2008 Carnegie Mellon University. All rights reserved.
File Format: Each line consists of an uppercased word,
a counter (for alternative pronunciations), and a transcription.
Vowels are marked for stress (1=primary, 2=secondary, 0=no stress).
E.g.: NATURAL 1 N AE1 CH ER0 AH0 L
The dictionary contains 127069 entries. Of these, 119400 words are assigned
a unique pronunciation, 6830 words have two pronunciations, and 839 words have
three or more pronunciations. Many of these are fast-speech variants.
Phonemes: There are 39 phonemes, as shown below:
Phoneme Example Translation Phoneme Example Translation
------- ------- ----------- ------- ------- -----------
AA odd AA D AE at AE T
AH hut HH AH T AO ought AO T
AW cow K AW AY hide HH AY D
B be B IY CH cheese CH IY Z
D dee D IY DH thee DH IY
EH Ed EH D ER hurt HH ER T
EY ate EY T F fee F IY
G green G R IY N HH he HH IY
IH it IH T IY eat IY T
JH gee JH IY K key K IY
L lee L IY M me M IY
N knee N IY NG ping P IH NG
OW oat OW T OY toy T OY
P pee P IY R read R IY D
S sea S IY SH she SH IY
T tea T IY TH theta TH EY T AH
UH hood HH UH D UW two T UW
V vee V IY W we W IY
Y yield Y IY L D Z zee Z IY
ZH seizure S IY ZH ER
(For NLTK, entries have been sorted so that, e.g. FIRE 1 and FIRE 2
are contiguous, and not separated by FIRE'S 1.)
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
The contents of this file are deemed to be source code.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
This work was supported in part by funding from the Defense Advanced
Research Projects Agency, the Office of Naval Research and the National
Science Foundation of the United States of America, and by member
companies of the Carnegie Mellon Sphinx Speech Consortium. We acknowledge
the contributions of many volunteers to the expansion and improvement of
this dictionary.
THIS SOFTWARE IS PROVIDED BY CARNEGIE MELLON UNIVERSITY ``AS IS'' AND
ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY
NOR ITS EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,98 @@
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
been contributed by various people using NLTK for sentence boundary detection.
For information about how to use these models, please confer the tokenization HOWTO:
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
and chapter 3.8 of the NLTK book:
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
There are pretrained tokenizers for the following languages:
File Language Source Contents Size of training corpus(in tokens) Model contributed by
=======================================================================================================================================================================
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
Literarni Noviny
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
(Berlingske Avisdata, Copenhagen) Weekend Avisen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
(American)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
Text Bank (Suomen Kielen newspapers
Tekstipankki)
Finnish Center for IT Science
(CSC)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
(Switzerland) CD-ROM
(Uses "ss"
instead of "ß")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
(Bokmål and Information Technologies,
Nynorsk) Bergen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
(http://www.nkjp.pl/)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
(Brazilian) (Linguateca)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
Slovene Academy for Arts
and Sciences
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
(and some other texts)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
(Türkçe Derlem Projesi)
University of Ankara
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
Unicode using the codecs module.
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
Computational Linguistics 32: 485-525.
---- Training Code ----
# import punkt
import nltk.tokenize.punkt
# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
# Read in training corpus (one example: Slovene)
import codecs
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
# Train tokenizer
tokenizer.train(text)
# Dump pickled tokenizer
import pickle
out = open("slovene.pickle","wb")
pickle.dump(tokenizer, out)
out.close()
---------

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,98 @@
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
been contributed by various people using NLTK for sentence boundary detection.
For information about how to use these models, please confer the tokenization HOWTO:
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
and chapter 3.8 of the NLTK book:
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
There are pretrained tokenizers for the following languages:
File Language Source Contents Size of training corpus(in tokens) Model contributed by
=======================================================================================================================================================================
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
Literarni Noviny
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
(Berlingske Avisdata, Copenhagen) Weekend Avisen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
(American)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
Text Bank (Suomen Kielen newspapers
Tekstipankki)
Finnish Center for IT Science
(CSC)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
(Switzerland) CD-ROM
(Uses "ss"
instead of "ß")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
(Bokmål and Information Technologies,
Nynorsk) Bergen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
(http://www.nkjp.pl/)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
(Brazilian) (Linguateca)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
Slovene Academy for Arts
and Sciences
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
(and some other texts)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
(Türkçe Derlem Projesi)
University of Ankara
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
Unicode using the codecs module.
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
Computational Linguistics 32: 485-525.
---- Training Code ----
# import punkt
import nltk.tokenize.punkt
# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
# Read in training corpus (one example: Slovene)
import codecs
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
# Train tokenizer
tokenizer.train(text)
# Dump pickled tokenizer
import pickle
out = open("slovene.pickle","wb")
pickle.dump(tokenizer, out)
out.close()
---------

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More