Synchronize your medium blog with python

TLDR; If you write a lot of medium articles, you may also want to show them also on your own homepage or on different other pages, like me (see below a screenshot of my homepage). So, for me, it was really time-consuming and annoying to update the links on my homepage every time I was writing a new blog post. My homepage is made with python, so why not get always automatically the links with web scraping from medium?

Screenshot from my homepage, antonioblago.com.

We need to install and import following libraries:

from bs4 import BeautifulSoup ## for web scraping
import urllib.request
import re ## regex
from linkpreview import link_preview
import pandas as pd

We define the request function with urllib for our medium blog. Therefore, you have to find the URL for your medium landing page. For me, it is antonioblago.medium.com.

url = <your medium blog>
req = urllib.request.Request(url, headers = {'User-agent': 'your bot 0.1'})
response = urllib.request.urlopen(req)
html = response.read()
# Parsing response
soup = BeautifulSoup(html, 'html.parser')
# Find all a classes
text = soup.find_all("a")

Next, we create a for loop for all links we found on this page and search for ‘href=”/’, which is our pattern for the relevant URLs. How do I know that? Well, for that, you have to go to your page and open up the developer tools of your browser (F12). Following, you have to mouse over the header of your article and it will show you the link, see below.

Screenshot by author
list_urls = []
for item in text:
## convert it to strings
item = str(item)
pos = item.find('href="/')
if pos is not None: ## if you find a link
if pos > 0 and item.find("user_profile")>0:
print(item)

You will find a lot of links and not all of them are relevant and most are duplicates.

Screenshot by author

Next, we just want to have those links with href=”/ + letters. We can use regex for that. Why do I not use the “aria-label”? Because these can change when medium updates its webpage. The link will be more robust, I think.

result = re.search(r'href="\/+([a-z])\w+', item)
if result:
print(item)

Exkurs regex (regular expressions)

If I am coding with regex, I use https://regexr.com/ to test my hypothesis. It is a great website to try out your patterns. It is for javascript, but the patterns are similar for python as well. It highlights your pattern in real time, see below.

Screenshot by author

To see, if it also would work in python you can check out https://pythex.org/. in both cases, we find the pattern we looked for.

Screenshot by author

Lets continue

We want to find the start and the end of our pattern. Therefore, the library re gives us a function for our result object, start() and end(). We also want to have the full URL, so we have to look for the end pattern. Then we add our landing page URL. To avoid duplicates in our list, we check first if the URL is already in the list.

result = re.search(r'href="\/+([a-z])\w+', item)
if result:
try:
start = result.start()## Start pattern
end_item = item[start:]
end = re.search(r'-"', end_item).end() ## End of pattern

url_extract_pos = item[start:start+end] ## cut out url
url = "http://antonioblago.medium.com"+url_extract_pos[6:-1]

if url not in list_urls:
list_urls.append(url)
print(url)

Your output:

Screenshot by author

Now, we want to get the data for our link preview. We save them into a list of dictionaries. After, we can save them to a dataframe and upload it to a database.

list_of_links = []
for i in list_urls:
preview = link_preview(i)

dic_preview = {"title": preview.title,
"description": preview.description,
"image": preview.image,
"force_title": preview.force_title,
"absolute_image": preview.absolute_image,
"url": i}
list_of_links.append(dic_preview)

That’s it! You can run you script daily or weekly in the backend, how I am doing it on pythonanywhere.com*. You can find the code on github below.

Thanks for reading my article. I hope you liked it. Please feel free to like, share, and comment on it. Follow me on more content about cryptos, stocks, data analysis and web development.

Read my other article about stocks and reddit:

Here you find the code: https://github.com/AntonioBlago/sychronize_your_medium_blog.

❤❤❤Fork it, download it and give it a star. ❤❤❤

If you want to learn more about Data Science in Python, I recommend this book*.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store