Erratum
Redundant extra line in response records
Originally reported by
Greg Lindahl
.
The WARC files of the August 2018 crawl contain a redundant empty line between the HTTP headers and the payload
of WARC response records. This extra line may cause the following problems when processing the WARC files:
- Because WARC readers/parsers assume only a single empty line, the extracted payload content starts with
\r\n
. While leading new lines are usually ignored by HTML processors, document parsers for binary formats (PDF, office documents, etc.) are likely to fail.
- The length of the payload in the optional HTTP
Content-Length
header
is off by 2. This may also cause WARC processors to fail.
Please see this issue on GitHub for more information. We apologise for this bug!
Affected Crawls
Affected Web Graphs
No items found.