I need to get some pages from a domain. I am using the Screaming Frog app. But the site is so big, more than 2 million pages inside, so my computer and RAM memory collapses when I'm tracking the 400,000 pages, without finish my work.
For this reason, I need to do a more specific search: the correct result is no more than 50 thousand valid pages. So I need the correct expressions in the "exclude" function to do it.
On the site there are "user profiles" like this:
[login to view URL]
[login to view URL]
[login to view URL]
.
.
[login to view URL]
This is what I need, I need to crawl all of these URLs. I'm going to export in an excel file all of these usersprofile URL.
But ... from each profile I received more than 5 results like this ... But I'm only interested in the result a). The a) is the URL that I need, the b) the result is a folder, I suppose, and the rest are subfolders. I dont need any of the results since b)
a) [login to view URL]
b) [login to view URL]
c) [login to view URL]
d) [login to view URL]
e) [login to view URL]
f) [login to view URL]
For example, for the case c) I tried to do it using an exclusion expression like:
EXCLUDE * Following. *
EXCLUDE [login to view URL]*
EXCLUDE [login to view URL]*/following.*
[login to view URL]
The URL [login to view URL] did not crawl ... THIS IS CORRECT! But the URL [login to view URL] was not tracked either. THIS IS NOT CORRECT.
The domain is [login to view URL]
I need an expert with Screaming Frog who can help me with the correct expressions to track in my application.
Thanks!!
from the screaming frog page
screamingfrog co uk/seo-spider/user-guide/general/#crawling
You may notice some URLs which are not within the /blog/ sub folder are crawled as well by default. This will be due to the ‘check links outside of start folder‘ configuration. This configuration allows the SEO Spider to focus it’s crawl within the /blog/ directory, but still crawl links that are not within this directory, when they are linked to from within side it. However, it will not crawl any further onwards. This is useful as you may wish to find broken links that sit within the /blog/ sub folder, but don’t have /blog/ within the URL structure. To only crawl URLs with /blog/, simply untick this configuration.
try that ;)