一套企业级「实体 / 关系自动抽取系统(LLM + 规则)」完整落地方案,可直接接入你前面的 KG + HIC + Path控制系统

目标只有一个:

把“文本 → 知识图谱(Neo4j)”自动化,并且“可控 + 可修正 + 可进化”。


一、系统总架构(生产级)

数据源(网页 / 商品 / 评论 / PDF)

预处理(清洗 / 分段)

候选抽取(LLM)🔥

规则校验(Rule Engine)🔥

标准化(Normalization)

置信度评分(Scoring)

入库(Neo4j)

HIC人工审核(可选)

二、抽取目标定义(Schema对齐)

👉 必须和你KG严格一致(否则后面全乱)


2.1 实体类型

Product
Brand
Feature
User
Scenario
Problem
Category

2.2 关系类型

HAS_FEATURE
SUITABLE_FOR
SOLVES
BELONGS_TO
OWNS

三、LLM抽取层(核心)


3.1 输入文本(示例)

K5 Electric Toothbrush is designed for college students with braces.
It features soft bristles and a 2-minute timer to protect sensitive gums.

3.2 Prompt设计(关键🔥)

你是信息抽取系统:

任务:
从文本中提取:
1. 实体(必须分类)
2. 实体之间的关系

输出JSON:

{
“entities”: [
{“name”: “”, “type”: “”}
],
“relations”: [
{“source”: “”, “target”: “”, “type”: “”}
]
}

规则:
– 必须使用以下类型:
Product / Feature / User / Problem / Scenario
– 不允许输出无关内容


3.3 输出示例

{
“entities”: [
{“name”: “K5 Electric Toothbrush”, “type”: “Product”},
{“name”: “college students”, “type”: “User”},
{“name”: “braces”, “type”: “Problem”},
{“name”: “soft bristles”, “type”: “Feature”},
{“name”: “2-minute timer”, “type”: “Feature”},
{“name”: “sensitive gums”, “type”: “Problem”}
],
“relations”: [
{“source”: “K5 Electric Toothbrush”, “target”: “college students”, “type”: “SUITABLE_FOR”},
{“source”: “K5 Electric Toothbrush”, “target”: “soft bristles”, “type”: “HAS_FEATURE”},
{“source”: “soft bristles”, “target”: “sensitive gums”, “type”: “SOLVES”}
]
}

四、规则引擎(Rule Engine)🔥关键差异

👉 LLM会“乱说”,规则负责“纠正 + 控制”


4.1 实体校验规则

ALLOWED_TYPES = {
“Product”, “Feature”, “User”, “Problem”, “Scenario”
}def validate_entity(e):
return e[“type”] in ALLOWED_TYPES


4.2 关系合法性约束

VALID_RELATIONS = {
(“Product”, “Feature”): “HAS_FEATURE”,
(“Product”, “User”): “SUITABLE_FOR”,
(“Feature”, “Problem”): “SOLVES”
}

4.3 关系修正(自动纠错🔥)

def fix_relation(rel, entities_map):
src_type = entities_map[rel[“source”]]
tgt_type = entities_map[rel[“target”]]expected = VALID_RELATIONS.get((src_type, tgt_type))

if expected:
rel[“type”] = expected

return rel


4.4 规则增强(业务控制)

# 强制品牌优先
if entity[“name”] == “K5”:
entity[“confidence”] = 0.95

五、标准化(Normalization)

👉 解决“同义词问题”


5.1 示例

“college students” → “college_student”
“soft toothbrush” → “soft bristles”

5.2 代码

NORMALIZATION_MAP = {
“college students”: “college_student”,
“students”: “college_student”,
“soft toothbrush”: “soft bristles”
}def normalize(name):
return NORMALIZATION_MAP.get(name.lower(), name.lower())


六、置信度评分系统(非常关键)


6.1 评分公式

confidence = (
0.5 * llm_score +
0.3 * rule_score +
0.2 * frequency
)

6.2 示例

来源 分数
LLM 0.7
规则匹配 1.0
出现频率 0.6

👉 最终:

confidence = 0.76

七、入库Neo4j(自动化)


7.1 实体入库

MERGE (p:Product {name: “K5 Electric Toothbrush”})
SET p.confidence = 0.95

7.2 关系入库

MATCH (p:Product {name:”K5 Electric Toothbrush”})
MATCH (f:Feature {name:”soft bristles”})
MERGE (p)-[:HAS_FEATURE {confidence:0.9}]->(f)

八、完整Pipeline代码(核心)

def pipeline(text):

# 1. LLM抽取
raw = llm_extract(text)

# 2. 校验实体
entities = [e for e in raw[“entities”] if validate_entity(e)]

# 3. 建立索引
entity_map = {e[“name”]: e[“type”] for e in entities}

# 4. 修正关系
relations = [
fix_relation(r, entity_map)
for r in raw[“relations”]
]

# 5. 标准化
for e in entities:
e[“name”] = normalize(e[“name”])

# 6. 打分
for r in relations:
r[“confidence”] = 0.8

# 7. 入库
save_to_neo4j(entities, relations)


九、HIC人工干预点(必须有)


后台可做:

✔ 删除错误关系
✔ 修改实体类型
✔ 提升某产品权重
✔ 锁定核心关系

十、关键优化(拉开差距)


10.1 多模型投票

GPT + Claude + 本地模型 → 投票

10.2 高频关系强化

if relation_count > 5:
confidence += 0.1

10.3 黑名单机制

禁止某些竞品进入KG

十一、最终效果


输入:

任意网页 / 评论 / 产品描述

输出:

自动生成:
✔ Product节点
✔ Feature关系
✔ User适配关系
✔ Problem解决路径

十二、一句话本质(非常重要)

这个系统 = “让AI自动帮你构建可控知识图谱”


下一步(强烈建议)

你现在可以直接进入👇


👉 1️⃣ 批量爬虫 + 抽取(自动建KG)

👉 2️⃣ KG实时更新(流式处理)

👉 3️⃣ 与推荐系统联动(实时影响排序)

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注