一套企业级「实体 / 关系自动抽取系统（LLM + 规则）」完整落地方案，可直接接入你前面的 KG + HIC + Path控制系统。

目标只有一个：

把“文本 → 知识图谱（Neo4j）”自动化，并且“可控 + 可修正 + 可进化”。

一、系统总架构（生产级）

数据源（网页 / 商品 / 评论 / PDF）

↓

预处理（清洗 / 分段）

↓

候选抽取（LLM）🔥

↓

规则校验（Rule Engine）🔥

↓

标准化（Normalization）

↓

置信度评分（Scoring）

↓

入库（Neo4j）

↓

HIC人工审核（可选）

二、抽取目标定义（Schema对齐）

👉 必须和你KG严格一致（否则后面全乱）

2.1 实体类型

Product

Brand

Feature

User

Scenario

Problem

Category

2.2 关系类型

HAS_FEATURE

SUITABLE_FOR

SOLVES

BELONGS_TO

OWNS

三、LLM抽取层（核心）

3.1 输入文本（示例）

K5 Electric Toothbrush is designed for college students with braces.

It features soft bristles and a 2-minute timer to protect sensitive gums.

3.2 Prompt设计（关键🔥）

你是信息抽取系统：

任务：
从文本中提取：
1. 实体（必须分类）
2. 实体之间的关系

输出JSON：

{
“entities”: [
{“name”: “”, “type”: “”}
],
“relations”: [
{“source”: “”, “target”: “”, “type”: “”}
]
}

规则：
– 必须使用以下类型：
Product / Feature / User / Problem / Scenario
– 不允许输出无关内容

3.3 输出示例

{

“entities”: [

{“name”: “K5 Electric Toothbrush”, “type”: “Product”},

{“name”: “college students”, “type”: “User”},

{“name”: “braces”, “type”: “Problem”},

{“name”: “soft bristles”, “type”: “Feature”},

{“name”: “2-minute timer”, “type”: “Feature”},

{“name”: “sensitive gums”, “type”: “Problem”}

],

“relations”: [

{“source”: “K5 Electric Toothbrush”, “target”: “college students”, “type”: “SUITABLE_FOR”},

{“source”: “K5 Electric Toothbrush”, “target”: “soft bristles”, “type”: “HAS_FEATURE”},

{“source”: “soft bristles”, “target”: “sensitive gums”, “type”: “SOLVES”}

]

}

四、规则引擎（Rule Engine）🔥关键差异

👉 LLM会“乱说”，规则负责“纠正 + 控制”

4.1 实体校验规则

ALLOWED_TYPES = {
“Product”, “Feature”, “User”, “Problem”, “Scenario”
}def validate_entity(e):
return e[“type”] in ALLOWED_TYPES

4.2 关系合法性约束

VALID_RELATIONS = {

(“Product”, “Feature”): “HAS_FEATURE”,

(“Product”, “User”): “SUITABLE_FOR”,

(“Feature”, “Problem”): “SOLVES”

}

4.3 关系修正（自动纠错🔥）

def fix_relation(rel, entities_map):
src_type = entities_map[rel[“source”]]
tgt_type = entities_map[rel[“target”]]expected = VALID_RELATIONS.get((src_type, tgt_type))

if expected:
rel[“type”] = expected

return rel

4.4 规则增强（业务控制）

# 强制品牌优先

if entity[“name”] == “K5”:

entity[“confidence”] = 0.95

五、标准化（Normalization）

👉 解决“同义词问题”

5.1 示例

“college students” → “college_student”

“soft toothbrush” → “soft bristles”

5.2 代码

NORMALIZATION_MAP = {
“college students”: “college_student”,
“students”: “college_student”,
“soft toothbrush”: “soft bristles”
}def normalize(name):
return NORMALIZATION_MAP.get(name.lower(), name.lower())

六、置信度评分系统（非常关键）

6.1 评分公式

confidence = (

0.5 * llm_score +

0.3 * rule_score +

0.2 * frequency

)

6.2 示例

来源	分数
LLM	0.7
规则匹配	1.0
出现频率	0.6

👉 最终：

confidence = 0.76

七、入库Neo4j（自动化）

7.1 实体入库

MERGE (p:Product {name: “K5 Electric Toothbrush”})

SET p.confidence = 0.95

7.2 关系入库

MATCH (p:Product {name:”K5 Electric Toothbrush”})

MATCH (f:Feature {name:”soft bristles”})

MERGE (p)-[:HAS_FEATURE {confidence:0.9}]->(f)

八、完整Pipeline代码（核心）

def pipeline(text):

# 1. LLM抽取
raw = llm_extract(text)

# 2. 校验实体
entities = [e for e in raw[“entities”] if validate_entity(e)]

# 3. 建立索引
entity_map = {e[“name”]: e[“type”] for e in entities}

# 4. 修正关系
relations = [
fix_relation(r, entity_map)
for r in raw[“relations”]
]

# 5. 标准化
for e in entities:
e[“name”] = normalize(e[“name”])

# 6. 打分
for r in relations:
r[“confidence”] = 0.8

# 7. 入库
save_to_neo4j(entities, relations)

九、HIC人工干预点（必须有）

后台可做：

✔ 删除错误关系

✔ 修改实体类型

✔ 提升某产品权重

✔ 锁定核心关系

十、关键优化（拉开差距）

10.1 多模型投票

GPT + Claude + 本地模型 → 投票

10.2 高频关系强化

if relation_count > 5:

confidence += 0.1

10.3 黑名单机制

禁止某些竞品进入KG

十一、最终效果

输入：

任意网页 / 评论 / 产品描述

输出：

自动生成：

✔ Product节点

✔ Feature关系

✔ User适配关系

✔ Problem解决路径

作者tsai-spr tsai-spr