一套企业级「实体 / 关系自动抽取系统(LLM + 规则)」完整落地方案,可直接接入你前面的 KG + HIC + Path控制系统。
目标只有一个:
把“文本 → 知识图谱(Neo4j)”自动化,并且“可控 + 可修正 + 可进化”。
一、系统总架构(生产级)
↓
预处理(清洗 / 分段)
↓
候选抽取(LLM)🔥
↓
规则校验(Rule Engine)🔥
↓
标准化(Normalization)
↓
置信度评分(Scoring)
↓
入库(Neo4j)
↓
HIC人工审核(可选)
二、抽取目标定义(Schema对齐)
👉 必须和你KG严格一致(否则后面全乱)
2.1 实体类型
Brand
Feature
User
Scenario
Problem
Category
2.2 关系类型
SUITABLE_FOR
SOLVES
BELONGS_TO
OWNS
三、LLM抽取层(核心)
3.1 输入文本(示例)
It features soft bristles and a 2-minute timer to protect sensitive gums.
3.2 Prompt设计(关键🔥)
你是信息抽取系统:
任务:
从文本中提取:
1. 实体(必须分类)
2. 实体之间的关系
输出JSON:
{
“entities”: [
{“name”: “”, “type”: “”}
],
“relations”: [
{“source”: “”, “target”: “”, “type”: “”}
]
}
规则:
– 必须使用以下类型:
Product / Feature / User / Problem / Scenario
– 不允许输出无关内容
3.3 输出示例
“entities”: [
{“name”: “K5 Electric Toothbrush”, “type”: “Product”},
{“name”: “college students”, “type”: “User”},
{“name”: “braces”, “type”: “Problem”},
{“name”: “soft bristles”, “type”: “Feature”},
{“name”: “2-minute timer”, “type”: “Feature”},
{“name”: “sensitive gums”, “type”: “Problem”}
],
“relations”: [
{“source”: “K5 Electric Toothbrush”, “target”: “college students”, “type”: “SUITABLE_FOR”},
{“source”: “K5 Electric Toothbrush”, “target”: “soft bristles”, “type”: “HAS_FEATURE”},
{“source”: “soft bristles”, “target”: “sensitive gums”, “type”: “SOLVES”}
]
}
四、规则引擎(Rule Engine)🔥关键差异
👉 LLM会“乱说”,规则负责“纠正 + 控制”
4.1 实体校验规则
“Product”, “Feature”, “User”, “Problem”, “Scenario”
}def validate_entity(e):
return e[“type”] in ALLOWED_TYPES
4.2 关系合法性约束
(“Product”, “Feature”): “HAS_FEATURE”,
(“Product”, “User”): “SUITABLE_FOR”,
(“Feature”, “Problem”): “SOLVES”
}
4.3 关系修正(自动纠错🔥)
src_type = entities_map[rel[“source”]]
tgt_type = entities_map[rel[“target”]]expected = VALID_RELATIONS.get((src_type, tgt_type))
if expected:
rel[“type”] = expected
return rel
4.4 规则增强(业务控制)
if entity[“name”] == “K5”:
entity[“confidence”] = 0.95
五、标准化(Normalization)
👉 解决“同义词问题”
5.1 示例
“soft toothbrush” → “soft bristles”
5.2 代码
“college students”: “college_student”,
“students”: “college_student”,
“soft toothbrush”: “soft bristles”
}def normalize(name):
return NORMALIZATION_MAP.get(name.lower(), name.lower())
六、置信度评分系统(非常关键)
6.1 评分公式
0.5 * llm_score +
0.3 * rule_score +
0.2 * frequency
)
6.2 示例
| 来源 | 分数 |
|---|---|
| LLM | 0.7 |
| 规则匹配 | 1.0 |
| 出现频率 | 0.6 |
👉 最终:
七、入库Neo4j(自动化)
7.1 实体入库
SET p.confidence = 0.95
7.2 关系入库
MATCH (f:Feature {name:”soft bristles”})
MERGE (p)-[:HAS_FEATURE {confidence:0.9}]->(f)
八、完整Pipeline代码(核心)
def pipeline(text):
# 1. LLM抽取
raw = llm_extract(text)
# 2. 校验实体
entities = [e for e in raw[“entities”] if validate_entity(e)]
# 3. 建立索引
entity_map = {e[“name”]: e[“type”] for e in entities}
# 4. 修正关系
relations = [
fix_relation(r, entity_map)
for r in raw[“relations”]
]
# 5. 标准化
for e in entities:
e[“name”] = normalize(e[“name”])
# 6. 打分
for r in relations:
r[“confidence”] = 0.8
# 7. 入库
save_to_neo4j(entities, relations)
九、HIC人工干预点(必须有)
后台可做:
✔ 修改实体类型
✔ 提升某产品权重
✔ 锁定核心关系
十、关键优化(拉开差距)
10.1 多模型投票
10.2 高频关系强化
confidence += 0.1
10.3 黑名单机制
十一、最终效果
输入:
输出:
✔ Product节点
✔ Feature关系
✔ User适配关系
✔ Problem解决路径
十二、一句话本质(非常重要)
这个系统 = “让AI自动帮你构建可控知识图谱”
下一步(强烈建议)
你现在可以直接进入👇