Automate Searching Websites For Emails, Social Media, Crypto, and Phone Numbers

As the title says, this post is about finding all sorts of useful data in websites.

Using https://github.com/tomcaliendo/e-scraper there are 5 Python scripts:

1- main.py – this script will take any url in the “domains.txt” file included and will create two txt documents, one with all of the website’s domains and subdomains, the second document will have all of the scraped email addresses.

2 – urlstxt_15cryptos_check.py – this one will take in any urls from the included text doc “urls.txt” and will search for cryptocurrency addresses, but only from the one url, it will not seek out any subdomains. It is set up to search specifically for the individual criteria of the top 15 most popular cryptocurrencies’ addresses.

3 – fifteensoc_intake_urlsdoc.py – searches for accounts that fit the criteria of each the 15 most popular social media.

4 – top_20_country_phone_numbers.py – checks for phone numbers that fit the criteria of the 20 countries with the most phone numbers

5 – USphone_search.py – searches for only US numbers, without looking for US country code (not throwing shade at the rest of the world, I just happen to live in the US.

Scrape Email Addresses From Website

The process will find every page in a website and scrape email addresses from the content and source code. This comes from the python script at (https://github.com/0x4D-5A/e-scraper/tree/main), which deserves credit.

The script we will use is slightly different. I took that script and added one feature that creates a document with all of the website’s subdomains.

Before starting you need to have a github account and a gitpod.io account. To save time, just sign up for github and then gitpod.io will let you login with your github account.

1 – log into github and then in a second tab log into gitpod.io. Then in a third tab go to:

gitpod.io/#https://github.com/tomcaliendo/e-scraper

2 – Find where there is a text document named “domains.txt” in the top left corner. Click on it to open and put the target website’s domain in the file. Then hold down “ctrl” and hit “s” to save

In this example I put “search-ish.com”

SIDE NOTE: when you input the target domain/URL, make sure you do not include https:// or https://

If the domain has a www. then include that. Search-ish.com is typically rendered “https://search-ish.com” (even though it still works if you type “www.search-ish.com”).

3 – Find at the bottom of the screen under terminal where it says “gitpod /workspace/e-scraper (main) $”

To the right, type in the following command and hit enter:
python main.py

Then, all of the website’s domains and subdomains with start listing out in the Terminal

Also, two documents will be created on the left side of the screen, one with the scraped email addresses and one with all of the website’s pages listed out

Scraping Crypto Addresses

Now we run a script that will look for crypto addresses for any of the (As of this writing) 15 most popular cryptocurrencies

1 – on the left side you will see a doc named “urls.txt”

Open it and put in any url of a webpage you want to search. You can choose one or you can copy and paste all of the urls from listed in the newly created file “search-ish.com…_internal_pages.txt”

I pasted in “https://search-ish.com/2023/01/08/hunch-ly-basics/”

then type the command:

python urlstxt_15cryptos_check.py

Hit enter and you will see the following lines appear below

Searching https://search-ish.com/2023/01/08/hunch-ly-basics/ for cryptocurrency addresses…
Findings saved to findings.txt

There is now a document on the left named “findings.txt”

Click on the document and you see the following text:

URL: https://search-ish.com/2023/01/08/hunch-ly-basics/
Cryptocurrency addresses found:
Address: bc1q34aq5drpuwy3wg191hup9892qp6svr81dzyy7c (Type: Bitcoin)

Scrape Social Media, Phone Numbers

There are 3 more Python scripts in there available to use.

To run any of those scripts just type in the command “python” and then the file name of the script as you did before. (Like before with python urlstxt_15cryptos_check.py”)

You can search for social media accounts, international phone numbers, or US phone numbers. The files are:

fifteensoc_intake_urlsdoc.py
top_20_country_phone_numbers.py
USphone_search.py

Find Company Employees Through Website Changes

Identify Changes in Website (including when employees were hired or left the company)

Changes in company websites can identify when employees were hired or left the company. The following tools will provide that information

1 – https://www.aihitdata.com – This tool tracks and parses data from company websites, registration, Linkedin accounts for employees, and possibly more. Notably, this tool also identifies changes in company websites, such as when new employees profiles or photos are posted on the website. AIHITdata also identifies when employees are taken down from the website.

2 – The WayBack Machine and other internet archives – The WayBack Machine has an archive of copies of websites over time. There are other similar tools. Another post provides information about this. See:

Internet Archives

3 – The Wayback MAchine also has a “Changes” feature that tells you when changes happened in a website over time – see previous post:

WayBack Machine’s “Changes” Feature

4 – \https://carbondate.cs.odu.edu – The tool’s main purpose is to try to figure out when a website was created. But I use it to find if a website exists in different internet archives. If the tool finds your website in an archive, it will list out a URL that brings you to that archive’s earliest version of your website.

Tracking Photos – Using Carbon 14 Python script

The Carbon 14 script by Nixintel (https://nixintel.info/) will look at a webpage and determine when each photo was uploaded to the site.

I use this to figure out when employees started working at a company. I look at employee profiles on a website and find when the profile photo was uploaded. I am assuming the connection between profile and employment.

This also gives me some guidance on what to seek out in the Internet Archive. I will look for the archive of a website before and after the date the photo was uploaded. This should confirm my expectations and also show the previous employee in the same position.

More Employees

A company’s website will often how a list or profiles of employees and/or leadership, or boards.

Usually if you hover your mouse over one of the photos some additional details will show up, like title or email. This can be frustrating if you want to scrape all of the profiles but you need to hover your mouse over every single one to get those details.

A previous post addressed how to web scrape a page like that, including scraping all of the profiles and the details that show up when hovering your mouse. See the post:

How to Web Scrape Corporate Profiles with Python

Carbon 14

1 – Log into Github and then in a second tab log into Gitpod, then in a third tab paste and go to:

gitpod.io/#https://github.com/Lazza/Carbon14

2 – Hit continue and once a new “workspace” has opened up, type:

python -m pip install -r requirements.txt

3 – type the following, but where it lists “examplewebsite.com” you put in the url for the specific page in the website that you want to search

python carbon14.py

(ex. “python carbon14.py https://www.examplewebsite.com”)

3 – The output will show when each photo on the page was uploaded.

If you used the webpage with corporate profiles you will see an output that looks like this:

The output is a list of information on each photo, including the date uploaded and the photo URL. Each employee’s name will often be part of the URL to their profile photo. And of course you can also click on the URL to open the photo.

SEARCH-ish Author Avatar

Leave a comment

Blog at WordPress.com.