OHSUMED数据集介绍
1. OHSUMED數據集介紹
本實驗中采用OHSUMED測試數據集合(其也被用于第9 屆國際文本檢索競賽TREC9 的文檔過濾子競賽)。OHSUMED 數據集合由William Hersh和他的同事們一起建立,其文檔來源于醫藥信息數據庫MEDLINE10,它包含了從1987 年到1991 年五年間270 個醫藥類雜志的標題和/或摘要,包含了348566個文檔。一個OHSUMED 文檔由8 個域組成,含義如下:
z .I 文章的OHSUMED 序列號,從1 到348566
z .U MEDLINE 標識
z .S 文章來源
z .M MeSH 索引詞
z .T 文章標題
z .P 文章類型
z .W 文章摘要
z .A 文章作者
OHSUMED 的作者還為文檔集合構造了106 個查詢,這些查詢來源于醫生在給病人看病的過程中所提交的查詢字符串,每一個查詢由兩部分組成:病人情況的簡單描述和所需信息的描述。一個OHSUMED 查詢由如下3 不同域組成:
z .I 文章的OHSUMED 序列號,從1 到106
z .B 患者信息
z .W 信息需求
基于以上的文檔集合和查詢集合,OHSUMED 一共標注了16140 個查詢-文
檔對,每一個查詢-文檔對都被標注成相關(definitely relevant)、部分相關(partially relevant)或者不相關(not relevant),最終的標注結果中一共包含了2557個相關、2932 個部分相關以及12498 個不相關的查詢-文檔對(一個文檔可能被標記成多個級別,在本節的實驗中,取其級別最高的標號作為其最終標號)。
Here are the files, their uncompressed size, and a description of their content:
1)? ohsumed.87 (60,303,307) — Contains the MEDLINE documents for the year 1987.? The format for each of the MEDLINE document files follows the conventions of the SMART system, with each field defined as below (NLM designator in parentheses):
??? .I??? sequential identifier
??? .U??? MEDLINE identifier (UI)
??? .M??? Human-assigned MeSH terms (MH)
??? .T??? Title (TI)
??? .P??? Publication type (PT)
??? .W??? Abstract (AB)
??? .A??? Author (AU)
??? .S??? Source (SO)
(Note:? Some references have their abstracts truncated at 250 words, while some have no abstracts at all.)
2)? ohsumed.88 (78,585,929) — Contains the MEDLINE documents for the year 1988, formatted as above.
3)? ohsumed.89 (84,719,077) — Contains the MEDLINE documents for the year 1989, formatted as above.
4)? ohsumed.90 (86,754,890) — Contains the MEDLINE documents for the year 1990, formatted as above.
5)? ohsumed.91 (89,761,122) — Contains the MEDLINE documents for the year 1991, formatted as above.
6)? queries (11,591) — Contains the 106 queries in test set, with patient and topic information, in the format:
??? .I??? Sequential identifier
??? .B??? Patient information
??? .W??? Information request
7)? drel.ui (26,919) — Contains the query-document pairs rated as definitely relevant, with documents listed by MEDLINE UI, in the format:
???
8)? drel.i (21,709) — Contains the query-document pairs rated as definitely relevant, with documents listed by sequential number (from the .I field),? in the format:
???
9)? pdrel.ui (57,831) — Contains the query-doc pairs rated as definitely or possibly relevant, with documents listed by MEDLINE UI,? in the format:
???
10)? pdrel.i (46,664) — Contains the query-doc pairs rated as definitely or possibly relevant, with documents listed by sequential number (from the .I field),? in the format:
???
11)? judged (368,366) — Contains a list of all retrieved documents by any of the five original searchers or SMART, sorted first by query number and then document number, along with their relevance judgments.? The relevance judgments are either d (definitely relevant), p (possibly relevant), or n (not relevant).? The relevance1 judgment is the original relevance judgment done on the documents retrieved by the original searchers.? The relevance 2 judgment is the second relevance judgment done to assess interobserver reliability of the relevance1 judgments.? The relevance3 judgment is the relevance judgment done on documents retrieved by SMART but not the original searchers, or another relevance judgment on an originally retrieved document to assess interobserver reliability.
???
??? [][]
12)? ui (3,137,094) — Contains the MEDLINE UI’s for all 348,566 documents in test database, listed one per line.
13)? readme — This file.
http://ir.ohsu.edu/ohsumed/ohsumed.html
總結
以上是生活随笔為你收集整理的OHSUMED数据集介绍的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 数据集-用于数据挖掘、信息检索、知识发现
- 下一篇: 计算广告学(Computational