Puppeteer

基于 Puppeteer 的编程控制

Puppeteer 是 Google
通过 headless 参数来指定是否启用 Headless 模式,默认情况下是启用的。此外,在我们使用 npm 安装 Puppeteer 的时候其会自动下载指定版本的 Chromium 从而保证接口的开箱即用性,也可以通过 executablePath 参数指定启动版本:
1
const browser = await puppeteer.launch({ headless: false }); // default is true
2
3
const browser = await puppeteer.launch({ executablePath: '/path/to/Chrome' });
Copied!
在大规模部署的情况下,我们需要控制 Puppeteer 连接到远端的服务化方式部署的 Headless Chrome 集群,此时就可以使用 connect 函数连接到 Headless Chrome 实例:
1
puppeteer.connect({
2
browserWSEndpoint:
3
'ws://{remoteip}:9222/devtools/browser/fa60c034-422d-4f2c-bbeb-17a2cfd690f2'
4
});
Copied!
1
import { launch } from 'puppeteer';
2
(async () => {
3
const browser = await launch({ headless: false });
4
const page = await browser.newPage();
5
await page.goto('https://example.com', { waitUntil: 'networkidle' });
6
await page.addScriptTag({
7
url: 'https://code.jquery.com/jquery-3.2.1.min.js'
8
});
9
await page.close();
10
await browser.close();
11
})();
Copied!

动态渲染

动态代理

1
const puppeteer = require('puppeteer');
2
3
(async () => {
4
const browser = await puppeteer.launch({
5
// Launch chromium using a proxy server on port 9876.
6
// More on proxying:
7
// https://www.chromium.org/developers/design-documents/network-settings
8
args: ['--proxy-server=127.0.0.1:9876']
9
});
10
11
//加隧道代理 加headers头即可
12
await page.setExtraHTTPHeaders({
13
'Proxy-Authorization':
14
'Basic ' + Buffer.from(`${username}:${password}`).toString('base64')
15
});
16
17
const page = await browser.newPage();
18
await page.goto('https://google.com');
19
await browser.close();
20
})();
Copied!

页面操作

脚本执行

1
const puppeteer = require('puppeteer');
2
3
(async () => {
4
const browser = await puppeteer.launch();
5
6
const page = await browser.newPage();
7
8
await page.goto('https://example.com'); // Get the "viewport" of the page, as reported by the page.
9
10
const dimensions = await page.evaluate(() => {
11
return {
12
width: document.documentElement.clientWidth,
13
14
height: document.documentElement.clientHeight,
15
16
deviceScaleFactor: window.devicePixelRatio
17
};
18
});
19
20
console.log('Dimensions:', dimensions);
21
22
await browser.close();
23
})();
Copied!
如果需要传递参数,则在 evaluate 的后续参数传入需要传入的参数:
1
const links = await page.evaluate(evalVar => {
2
console.log(evalVar); // should be defined now
3
// ...
4
}, evalVar);
Copied!
在 Puppeteer 中我们还可以添加外部的脚本执行操作:
1
const puppeteer = require('puppeteer');
2
3
(async () => {
4
const browser = await puppeteer.launch({ headless: false });
5
const page = await browser.newPage();
6
await page.goto('https://google.com');
7
await page.addScriptTag({
8
url: 'https://rawgithub.com/marmelab/gremlins.js/master/gremlins.min.js'
9
});
10
await page.evaluate(() => {
11
window.gremlins.createHorde().unleash();
12
});
13
})();
Copied!

页面保存

1
const puppeteer = require('puppeteer');
2
3
(async () => {
4
const browser = await puppeteer.launch();
5
6
const page = await browser.newPage();
7
8
await page.goto('https://news.ycombinator.com', { waitUntil: 'networkidle' });
9
10
await page.pdf({ path: 'hn.pdf', format: 'A4' });
11
12
await browser.close();
13
})();
Copied!

监听网页请求

1
const puppeteer = require('puppeteer');
2
3
puppeteer.launch().then(async browser => {
4
const page = await browser.newPage();
5
await page.setRequestInterception(true);
6
page.on('request', interceptedRequest => {
7
if (
8
interceptedRequest.url().endsWith('.png') ||
9
interceptedRequest.url().endsWith('.jpg')
10
)
11
interceptedRequest.abort();
12
else interceptedRequest.continue();
13
});
14
await page.goto('https://example.com');
15
await browser.close();
16
});
Copied!

端到端测试

Last modified 2yr ago