How to read SGML files using Java
I've got a text categorisation test collection called Reuters-21578 for my Information Retrieval project. It is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Each of the 22 files begins with a document type declaration line:
<!DOCTYPE lewis SYSTEM "lewis.dtd"> The DTD file lewis.dtd is included in the distribution. Following the document type declaration line are individual Reuters articles marked up with SGML tags.
My questions is how to write a java program to read those 21578 documents or transform them into 21578 seperated text files.
Last edited by WXY595; 01-16-2006 at 11:48 AM.
By Gary Furash in forum Database
Last Post: 03-14-2003, 12:59 PM
By Rob Abbe in forum Talk to the Editors
Last Post: 01-13-2003, 03:57 PM
By Lori Piquet in forum Talk to the Editors
Last Post: 10-10-2002, 07:01 AM
By Glen Kunene in forum Talk to the Editors
Last Post: 03-23-2002, 01:43 AM
By Akhilesh Mritunjai in forum Java
Last Post: 03-27-2000, 12:21 PM
-- Android Development Center
-- Cloud Development Project Center
-- HTML5 Development Center
-- Windows Mobile Development Center