The links are extracted from the markup that is downloaded during the crawl session. Your configuration file includes the createLinkExtractor method, which calls the logic for extracting the links to be crawled. Enter a comma-separated array, as shown in the example below: To do the latter, continue to the next step Defining a Crawler. (In addition to the starting URI, you can set crawl depth and similar parameters, call post-crawl commands, and implement interfaces to define logic specific to your target sites.)Īt this point, you have the option to either customize the downloaded groovy file now, or first create the crawler and then customize its groovy file (which is editable in the Site Capture interface). Your sample groovy file specifies a sample starting URI, which you will reset when you create the crawler in the next step. When the crawler is used for the first time in a given mode, Site Capture creates additional subfolders (in //) to store sites captured in that mode. Within the folder, Site Capture creates an /app subfolder to which it uploads the groovy file from your local computer. When you define a crawler, Site Capture creates a folder bearing the name of the crawler (, or Sample in our scenario) and places that folder in the following path: /fw-site-capture/crawler/. The file is stored in a custom folder structure. The exercises in this chapter cover both types of crawler scenarios: manual and publishing-triggered.Įvery crawler is controlled by its own oovy file. (On every publishing destination definition, you can specify one or more crawlers, but only a single capture mode.) Information about the successful start of crawler sessions is stored in the Site Capture file system and in the log files ( futuretense.txt, by default) of the WebCenter Sites source and target systems. To use a crawler for publishing-triggered site capture, you must take an additional step: you must name the crawler and specify its capture mode on the publishing destination definition on the WebCenter Sites source system that is integrated with Site Capture, as described in the Configuring Site Capture with the Configurator in Installing and Configuring Oracle WebCenter Sites. Although the groovy file controls the site capture process, the capture mode is set outside the file. You must code the groovy file with methods in the BaseConfigurator class that specify at least the starting URIs and link extraction logic for the crawler. To create your own crawler, you must name the crawler (typically, after the target site), and upload a text file named oovy, which controls the site capture process. This guide assumes the crawlers were installed during the Site Capture installation process and uses the Sample crawler primarily. To help you get started quickly, Site Capture comes with two sample crawlers, Sample and FirstSiteII. Starting any type of site capture process requires you to define a crawler in the Site Capture interface. For any capture mode, you have the option of configuring crawlers to email reports as soon as they are generated. In archive capture, you can download them from the Site Capture interface. In static capture, you must obtain the logs from the file system. You can download the files, preview the archived sites, and set capture schedules.įor any capture mode, logs are generated after the crawl session to provide such information as crawled URLs, HTTP status, and network conditions. However, because the zip files are referenced by pointers in the Site Capture database, you can manage them from the Site Capture interface. Like static sessions, you can manually initiate archive crawl sessions from the Site Capture interface or after a publishing session. However, you can manage downloaded sites from the Site Capture file system only. You can initiate static crawl sessions manually from the application interface or after a publishing session. Pointers to the zip files are created in the Site Capture database. In archive mode, all crawled sites are kept and stored as zip files (archives) in time-stamped folders. Only the latest capture is kept (the previously stored files are overwritten). In static mode, a crawled site is stored as files ready to be served. Static mode supports rapid deployment, high availability scenarios.Īrchive mode is used to maintain copies of websites on a regular basis for compliance purposes or similar reasons. Table 31-1 Static Capture Mode and Archive Mode Static Mode
0 Comments
Leave a Reply. |