got爬取GBK网页乱码问题

Posted by 松一老贼 on March 17, 2022

官方文档

responseType

Type: 'text' | 'json' | 'buffer'

Default: 'text'

The parsing method.

The promise also has .text(), .json() and .buffer() methods which return another Got promise for the parsed body. It’s like setting the options to {responseType: 'json', resolveBodyOnly: true} but without affecting the main Got promise.

import got from 'got';

const responsePromise = got('https://httpbin.org/anything');
const bufferPromise = responsePromise.buffer();
const jsonPromise = responsePromise.json();

const [response, buffer, json] = await Promise.all([responsePromise, bufferPromise, jsonPromise]);
// `response` is an instance of Got Response
// `buffer` is an instance of Buffer
// `json` is an object

Note:

  • When using streams, this option is ignored.

Note:

  • 'buffer' will return the raw body buffer. Any modifications will also alter the result of .text() and .json(). Before overwriting the buffer, please copy it first via Buffer.from(buffer). See nodejs/node#27080

具体案例

利用RssHub爬取GBK页面时遇到了这个问题:

const data = await got({
  method: 'get',
  url: rootUrl,
  headers: {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36',
  },
  responseType: 'buffer',// 这里设置返回时为buffer,必须是返回之前就是buffer,不然返回后已经用utf8编过码过后的结果,再进行解码无意义
});
// iconv-lite 库进行解码
const $ = cheerio.load(iconvLite.decode(Buffer.from(data.rawBody), 'gbk'));