Archives and Records Professionals for Research Data IG Activity Overview PDF questions and ArchivesData center connections

PDF questions and ArchivesData center connections

Creator

Discussion
August 16, 2017 at 7:56 pm #118255

Lynn Yarmey
Member

Hello Archives and Records Professionals IG!
I am hoping you might be able to help me answer a request I just received from a data center friend on archival practices around PDF migration. My apologies if this is a distraction to the group or better addressed elsewhere (?), offline information would be lovely if that is more appropriate.
The broader connection for the IG might be that data center folks, not necessarily trained in archival practices or up to speed in the institutional records environments, also archive more traditional records with their data holdings. Making (additional) connections across these communities would be an amazing thing, imo.
Sincere thanks!
Lynn
————————————————————————-
My basic question(s) are (1) how are you creating PDF/A files (2) are you validating them, and (3) how are you addressing validation errors? (4) Lastly, any thoughts on where I can find answers to similar questions online? I seems to be having zero luck, but I can’t be the only one having these issues?
Background:
We’re in the process of handling the backlog conversion of our repository’s materials into archival versions. We’re a disciplinary specific repo, and I’ve worked to ensure that we only accession materials that have a clear migration path, but as you can imagine, it’d be nice if it were only that easy. My current challenge is in working with the 19,000 PDF files in our system (many are quite large).
We’ve been doing an analysis of various tools that can create PDF/A files, comparing Adobe Acrobat, Ghostscript, and ABBYY FineReader, in how well they converted a small test set of PDFs and evaluating how well they’d work for larger batches in an automated manner. We then took the resulting PDF/A files and validated them using VeraPDF, Adobe’s preflight, and PDFBox’s preflight tools.
We’ve also looked at tools like Archivematica, but shied away as they don’t support all of the file formats that we support, and don’t address some of these issues.
Challenges:
In our initial tests a few years ago, none of the tools created valid PDF/A files in all cases, currently Adobe Acrobat is doing better, but it lacks useful tools for automating hundreds or thousand of conversions. Archivematica uses Ghostscript to convert files, but it seem to be the worst of the bunch in terms of validation errors. We use ABBYY FineReader because of it’s superior OCR and batch tools, but it also produces errors at the moment. I’ve yet to find a good tool beyond Acrobat, when it works.
Creator

Discussion

Author

Replies
August 17, 2017 at 10:58 am #131589

Marjan Grootveld
Member

Dear Lynn and colleagues,
Data management at the DANS research data archive includes conversion of document formats such as .doc, .docx, .rtf to PDF/A for long-term preservation. We make use of a Python script for processing the files; depending on the amount we manually check all converted files or a representative sample.
In the words of our preservation officer, Valentijn Gillissen: “We do not convert PDF files to PDF/A. We were looking into options to do so three years ago, but opted against it after consulting with Johan van der Knijff, Digital Preservation Researcher at the National Library of the Netherlands (KB). He argued that PDF/A documents are PDF files with additional restrictions, which means that you will lose properties should the original file contain features that are not allowed by the PDF/A standard.
The type of source material makes a difference: digitized or born-digital material. You are more likely to succeed in conversion of digitized PDF files to PDF/A, but you might argue that the PDF/A standard does not actually offer a huge benefit over the original PDF in those cases. The standard offers more advantages if the PDF is born-digital, but conversion options had always been limited and included the risk of information loss, with little to no options for validation.
The conclusion was to only pursue PDF->PDF/A conversion strategies if the advantages of going for the PDF/A standard are clear, if you are aware of all the risks and if you have a validation strategy.
The majority of PDF files that DANS ingests is digitized material, we tend to receive born-digital documents in Microsoft Word or other formats which we can easily convert to PDF/A. For the born-digital PDF files, we have so far decided to stick with the original PDF because they are robust enough as it is. There is also the matter of the lack of proper tools: there may have been improvement on this front over the past years, but we are not aware of any trustworthy methods.”
Kind regards,
Marjan
Dr Marjan Grootveld
Senior policy officer
+31(0)6 12 10 15 14
Skype: mgrootveld1
***@***.***
Office days Monday – Thursday
DANS: Netherlands Institute for Permanent Access to Digital Research Resources
Anna van Saksenlaan 51 | 2593 HW The Hague | P.O. Box 93067 | 2509 AB The Hague | +31 70 349 44 50 | ***@***.*** | dans.knaw.nl
DANS is an institute of the Dutch Academy KNAW and funding organisation NWO.
Van: namens yarmey
Datum: woensdag 16 augustus 2017 21:56
Aan: “***@***.***-groups.org”
Onderwerp: [rda-archives-records-ig] PDF questions and ArchivesData center connections
Hello Archives and Records Professionals IG!
I am hoping you might be able to help me answer a request I just received from a data center friend on archival practices around PDF migration. My apologies if this is a distraction to the group or better addressed elsewhere (?), offline information would be lovely if that is more appropriate.
The broader connection for the IG might be that data center folks, not necessarily trained in archival practices or up to speed in the institutional records environments, also archive more traditional records with their data holdings. Making (additional) connections across these communities would be an amazing thing, imo.
Sincere thanks!
Lynn
————————————————————————-
My basic question(s) are (1) how are you creating PDF/A files (2) are you validating them, and (3) how are you addressing validation errors? (4) Lastly, any thoughts on where I can find answers to similar questions online? I seems to be having zero luck, but I can’t be the only one having these issues?
Background:
We’re in the process of handling the backlog conversion of our repository’s materials into archival versions. We’re a disciplinary specific repo, and I’ve worked to ensure that we only accession materials that have a clear migration path, but as you can imagine, it’d be nice if it were only that easy. My current challenge is in working with the 19,000 PDF files in our system (many are quite large).
We’ve been doing an analysis of various tools that can create PDF/A files, comparing Adobe Acrobat, Ghostscript, and ABBYY FineReader, in how well they converted a small test set of PDFs and evaluating how well they’d work for larger batches in an automated manner. We then took the resulting PDF/A files and validated them using VeraPDF, Adobe’s preflight, and PDFBox’s preflight tools.
We’ve also looked at tools like Archivematica, but shied away as they don’t support all of the file formats that we support, and don’t address some of these issues.
Challenges:
In our initial tests a few years ago, none of the tools created valid PDF/A files in all cases, currently Adobe Acrobat is doing better, but it lacks useful tools for automating hundreds or thousand of conversions. Archivematica uses Ghostscript to convert files, but it seem to be the worst of the bunch in terms of validation errors. We use ABBYY FineReader because of it’s superior OCR and batch tools, but it also produces errors at the moment. I’ve yet to find a good tool beyond Acrobat, when it works.
August 18, 2017 at 9:07 am #131580

Rebecca Taylor-Grant
Member

Thanks for the response, Marjan!
And just to note, the chairs are happy for the mailing list to be used for this type of query.
Best regards,
Rebecca Grant
Research Data Manager
Open Research Group
Springer Nature
The Campus | Trematon Walk | Wharfdale Road | London N1 9FN
T: +44 (0) 2070144273
***@***.***
http://www.springernature.com
ORCiD: 0000-0002-7614-0806
From: marjan.grootveld=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of mgrootveld
Sent: 17 August 2017 11:58
To: ***@***.***-groups.org
Cc: Valentijn Gilissen
Subject: Re: [rda-archives-records-ig] PDF questions and ArchivesData center connections
Dear Lynn and colleagues,
Data management at the DANS research data archive includes conversion of document formats such as .doc, .docx, .rtf to PDF/A for long-term preservation. We make use of a Python script for processing the files; depending on the amount we manually check all converted files or a representative sample.
In the words of our preservation officer, Valentijn Gillissen: “We do not convert PDF files to PDF/A. We were looking into options to do so three years ago, but opted against it after consulting with Johan van der Knijff, Digital Preservation Researcher at the National Library of the Netherlands (KB). He argued that PDF/A documents are PDF files with additional restrictions, which means that you will lose properties should the original file contain features that are not allowed by the PDF/A standard.
The type of source material makes a difference: digitized or born-digital material. You are more likely to succeed in conversion of digitized PDF files to PDF/A, but you might argue that the PDF/A standard does not actually offer a huge benefit over the original PDF in those cases. The standard offers more advantages if the PDF is born-digital, but conversion options had always been limited and included the risk of information loss, with little to no options for validation.
The conclusion was to only pursue PDF->PDF/A conversion strategies if the advantages of going for the PDF/A standard are clear, if you are aware of all the risks and if you have a validation strategy.
The majority of PDF files that DANS ingests is digitized material, we tend to receive born-digital documents in Microsoft Word or other formats which we can easily convert to PDF/A. For the born-digital PDF files, we have so far decided to stick with the original PDF because they are robust enough as it is. There is also the matter of the lack of proper tools: there may have been improvement on this front over the past years, but we are not aware of any trustworthy methods.”
Kind regards,
Marjan
Dr Marjan Grootveld
Senior policy officer
+31(0)6 12 10 15 14
Skype: mgrootveld1
***@***.***
Office days Monday – Thursday
DANS: Netherlands Institute for Permanent Access to Digital Research Resources
Anna van Saksenlaan 51 | 2593 HW The Hague | P.O. Box 93067 | 2509 AB The Hague | +31 70 349 44 50 | ***@***.*** | dans.knaw.nl
DANS is an institute of the Dutch Academy KNAW and funding organisation NWO.
Van: namens yarmey
Datum: woensdag 16 augustus 2017 21:56
Aan: “***@***.***-groups.org”
Onderwerp: [rda-archives-records-ig] PDF questions and ArchivesData center connections
Hello Archives and Records Professionals IG!
I am hoping you might be able to help me answer a request I just received from a data center friend on archival practices around PDF migration. My apologies if this is a distraction to the group or better addressed elsewhere (?), offline information would be lovely if that is more appropriate.
The broader connection for the IG might be that data center folks, not necessarily trained in archival practices or up to speed in the institutional records environments, also archive more traditional records with their data holdings. Making (additional) connections across these communities would be an amazing thing, imo.
Sincere thanks!
Lynn
————————————————————————-
My basic question(s) are (1) how are you creating PDF/A files (2) are you validating them, and (3) how are you addressing validation errors? (4) Lastly, any thoughts on where I can find answers to similar questions online? I seems to be having zero luck, but I can’t be the only one having these issues?
Background:
We’re in the process of handling the backlog conversion of our repository’s materials into archival versions. We’re a disciplinary specific repo, and I’ve worked to ensure that we only accession materials that have a clear migration path, but as you can imagine, it’d be nice if it were only that easy. My current challenge is in working with the 19,000 PDF files in our system (many are quite large).
We’ve been doing an analysis of various tools that can create PDF/A files, comparing Adobe Acrobat, Ghostscript, and ABBYY FineReader, in how well they converted a small test set of PDFs and evaluating how well they’d work for larger batches in an automated manner. We then took the resulting PDF/A files and validated them using VeraPDF, Adobe’s preflight, and PDFBox’s preflight tools.
We’ve also looked at tools like Archivematica, but shied away as they don’t support all of the file formats that we support, and don’t address some of these issues.
Challenges:
In our initial tests a few years ago, none of the tools created valid PDF/A files in all cases, currently Adobe Acrobat is doing better, but it lacks useful tools for automating hundreds or thousand of conversions. Archivematica uses Ghostscript to convert files, but it seem to be the worst of the bunch in terms of validation errors. We use ABBYY FineReader because of it’s superior OCR and batch tools, but it also produces errors at the moment. I’ve yet to find a good tool beyond Acrobat, when it works.
DISCLAIMER: This e-mail is confidential and should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage mechanism. Macmillan Publishers Limited does not accept liability for any statements made which are clearly the sender’s own and not expressly made on behalf of Nature Research or one of their agents.
Please note that Macmillan Publishers Limited and their agents and affiliates do not accept any responsibility for viruses or malware that may be contained in this e-mail or its attachments and it is your responsibility to scan the e-mail and attachments (if any).
Author

Replies

Archives and Records Professionals for Research Data IG

Group Organizers

PDF questions and ArchivesData center connections