video-scraper-core - DEV

Build Commitizen friendly License GitHub issues GitHub stars npm

video-scraper-core

An npm package that provides an abstract class to scrape videos with Puppeteer.

Install

To install video-scraper-core, run:

$ npm install video-scraper-core

Project purpose

This module is written because videos hosted on some websites are difficult to download and watchable only in the browser. Even by using some browser tools, sometimes, it may be difficult or impossible to download the video. A solution that can always be used, is actually taking a video screen recording after having played the video, but it is too time-consuming to be done manually.

This is why I have written this module, that uses puppeteer and puppeteer-stream under the hood to open a google-chrome browser, see the video and take a video recording of it.

The module is written in Typescript, uses Webpack to reduce the bundle size (even if most of it comes from the puppeter browser), uses euberlog for a scoped debug log and is full of configurations.

How does it work

The module provides an abstract class that you can extend to create your own scraper. By overriding some simple methods, you can adapt the scraper to your needs.

The scraper:

  • launches a browser window
  • loads the url page
  • calls the hook afterPageLoaded, for example if a login is needed
  • if not already specified by the options, parses the video duration
  • if fullscreen is specified by the options, clicks the fullscreen button
  • clicks the play button
  • waits an optional delay specified by the options
  • starts the browser window recording
  • waits until the video is finished / the specified duration is reached
  • waits an optional delay specified by the options
  • closes the browser window and the video is saved

Project usage

An example to create a scraper for TumConf:

import { VideoScraperCore, ScrapingOptions, BrowserOptions } from 'video-scraper-core';
import { Page } from 'puppeteer';
import { Logger } from 'euberlog';

// Extend VideoScraperCore to create the scraper class
export class TumConfScraper extends VideoScraperCore {
// The passcode used to login
private readonly passcode: string;

// The constructor that allows the passcode to be specified
constructor(passcode: string, browserOptions: BrowserOptions) {
super(browserOptions);
this.passcode = passcode;
}

// The selector of the full screen button
protected getFullScreenSelector(): string {
return '.vjs-fullscreen-toggle-control-button';
}
// The selector of the play button
protected getPlayButtonSelector(): string {
return '.vjs-play-control';
}
// The selector of the video time duration
protected getVideoDurationSelector(): string {
return '.vjs-time-range-duration';
}

// After the page is loaded, login by using puppeteer
protected async afterPageLoaded(_options: ScrapingOptions, page: Page, logger: Logger): Promise<void> {
logger.debug('Putting the passcode to access the video');
await page.waitForSelector('input#password');
await page.$eval(
'input#password',
(el: HTMLInputElement, passcode: string) => (el.value = passcode),
this.passcode
);

logger.debug('Clicking the button to access the video');
await page.waitForSelector('.btn-primary.submit');
await page.$eval('.btn-primary.submit', (button: HTMLButtonElement) => button.click());
}
}

async function main() {
// Create an instance of the scraper
const scraper = new TumConfScraper('mypasscode', { debug: true });
// Launch the Chrome browser
await scraper.launch();
// Scrape and save the video
await scraper.scrape('https://videourl.com', './saved.webm');
// Close the browser
await scraper.close();
}
main();

API

The documentation site is: video-scraper-core documentation

The documentation for development site is: video-scraper-core dev documentation

VideoScraperCore

The VideoScraperCore class, that can be extended to scrape a video from a website and saves it to a file.

Constructor:

VideoScraperCore(options)

Parameters:

  • options: Optional. A BrowserOptions object that specifies the options for this instance.

Public methods:

  • setBrowserOptions(options: BrowserOptions): void: Changes the browser options with the ones given by the options parameter.
  • launch(): Promise: Launches the browser window.
  • close(): Promise: Closes the browser window.
  • scrape(url: string, destPath: string, options: ScrapingOptions): Promise: Scrapes the video in url and saves it to destPath. Some ScrapingOptions can be passed.

Protected methods:

  • handleDurationText(durationText: string): number: Given the duration text gotten from the page's HTML (e.g. 1:30:23), it returns the duration in milliseconds.
  • getVideoDuration(page: Page, logger: Logger): Promise: Gets the video duration by parsing the given page.
  • setVideoToFullScreen(page: Page, logger: Logger): Promise: Sets the video put the video in fullscreen.
  • playVideo(page: Page, logger: Logger): Promise: Plays the video by clicking the play button.

Protected and abstract methods:

  • afterPageLoaded(options: ScrapingOptions, page: Page, logger: Logger): Promise: This method is called after the page, with the specified url, is loaded. It can be used for things such as logging in if it is requested before reaching the video page.
  • getVideoDurationSelector(): string: Returns the video duration selector, which is used by the method getVideoDuration to extract the video duration text from the page.
  • getFullScreenSelector(): string: Returns the video full screen selector, which is used by the method setVideoToFullScreen to put the video in full screen.
  • getPlayButtonSelector(): string: Returns the video play button selector, which is used by the method playVideo to play the video.

BrowserOptions

The options given to the VideoScraperCore constructor.

Parameters:

  • debug: Default value: false. If true, it will show debug log.
  • debugScope: Default value: 'VideoScraperCore'. The scope given to the euberlog debug logger.
  • browserExecutablePath: Default value: '/usr/bin/google-chrome'. The path to the browser executable.
  • windowSize: Default value: { width: 1920, height: 1080 }. The object that says how big the window size will be.

ScrapingOptions

The options given to a scrape method.

Parameters:

  • duration: Default value: null. The duration in milliseconds of the recorded video.
  • delayAfterVideoStarted: Default value: 0. The delay in milliseconds after that the play button has been clicked.
  • delayAfterVideoFinished: Default value: 15_000. The delay in milliseconds after that the duration milliseconds are past and before that the recording is stopped.
  • fullscreen: Default value: false. If true, the video will be recorded after having put it on fullscreen.
  • audio: Default value: true. If true, the audio will be recorded.
  • video: Default value: true. If true, the video will be recorded.
  • mimeType: Default value: 'video/webm'. The mimetype of the recorded video or audio.
  • audioBitsPerSecond: Default value: undefined. The chosen bitrate for the audio component of the media. If not specified, it will be adaptive, depending upon the sample rate and the number of channels.
  • videoBitsPerSecond: Default value: undefined. The chosen bitrate for the video component of the media. If not specified, the rate will be 2.5Mbps.
  • frameSize: Default value: 20. The number of milliseconds to record into each packet.
  • useGlobalDebug: Default value: true. If true, the global logger will be used, ignoring other debug options in this object.
  • debug: Default value: null. If null, the debug will be shown by looking at the passed BrowserOptions. Otherwise, if useGlobalDebug is false, this specifies if the debug will be shown.
  • debugScope: Default value: null. If useGlobalDebug is true, this will be ignore. Otherwise, this specifies if the euberlog logger scope for the debug of this scrape.

Errors

There are also some error classes that can be thrown by this module:

  • VideoScraperCoreError: The base error class of the bbb-video-scraper module
  • VideoScraperCoreBrowserNotLaunchedError: The error extending VideoScraperCoreError that is thrown when actions on a non-launched browser are attempted to be executed.
  • VideoScraperCoreDuringBrowserLaunchError: The error extending VideoScraperCoreError that is thrown when an error occurred when a browser is getting closed.
  • VideoScraperCoreDuringBrowserCloseError: The error extending VideoScraperCoreError that is thrown when an error occurs during the launch of a browser.
  • VideoScraperCoreDuringScrapingError: The error extending VideoScraperCoreError that is thrown when an error occurs during a video scraping

Notes

  • The default browser is Google Chrome on /usr/bin/google-chrome, because Chromium did not support the BBB videos. You can always change the browser executable path on the configurations.
  • By default (if the duration option is null), the duration of the recording will be automatically detected by looking at the vjs player of the page and by adding a stopping delay of 15 seconds.
  • This module can be uses only in headful mode.

Projects using this module

Generated using TypeDoc