技能要求:
Python
经验要求:
5-10年经验
工作描述:
项目编号:【143985】
主要需求:
1. 爬虫:抓取us patent center特定xml文件
2. 提取文本:从xml文件中提取文字信息→doc
3. NER & NED/NEL:从doc中提取knowledge concepts(entities in some classes in Wikidata),可采用以下几种做法
1. annotate a training set + finetune a NED model (e.g. BLINK, LUKE on github)
2. annotate a training set + PatentBert + train NER using DYGIE++(on github)
3. spacy pipeline: annotate a training set + PatentBert train NER + Wiki NEL
4. spacy pipeline: annotate a training set + PatentBert train NER + entity clustering
5. joint NER & NED using merged knowledge base
非必要需求:
1. Build a two-mode network of the extracted entities and put them into Neo4J
语言为python,model基本都有GitHub repo和论文进行参考,可调用google cloud GPU,只提供code就好,training可以我方来做