How to read SGML files using Java
I've got a text categorisation test collection called Reuters-21578 for my Information Retrieval project. It is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Each of the 22 files begins with a document type declaration line:
<!DOCTYPE lewis SYSTEM "lewis.dtd"> The DTD file lewis.dtd is included in the distribution. Following the document type declaration line are individual Reuters articles marked up with SGML tags.
My questions is how to write a java program to read those 21578 documents or transform them into 21578 seperated text files.
Last edited by WXY595; 01-16-2006 at 11:48 AM.
By Gary Furash in forum Database
Last Post: 03-14-2003, 12:59 PM
By Rob Abbe in forum Talk to the Editors
Last Post: 01-13-2003, 03:57 PM
By Lori Piquet in forum Talk to the Editors
Last Post: 10-10-2002, 07:01 AM
By Glen Kunene in forum Talk to the Editors
Last Post: 03-23-2002, 01:43 AM
By Akhilesh Mritunjai in forum Java
Last Post: 03-27-2000, 12:21 PM
Top DevX Stories
Easy Web Services with SQL Server 2005 HTTP Endpoints
JavaOne 2005: Java Platform Roadmap Focuses on Ease of Development, Sun Focuses on the "Free" in F.O.S.S.
Wed Yourself to UML with the Power of Associations
Microsoft to Add AJAX Capabilities to ASP.NET
IBM's Cloudscape Versus MySQL