There is power in presenting related info together, especially when that information comes from sites you already know, trust and use… Don’t you think? Content aggregation is the process of gathering together different content into one place so the user can find everything they need without having to look around for it. It’s a better option than scrolling endlessly through many different websites and trying to weed out the relevant info. In this article, you are going to learn how to create a simple, customized news aggregator site using python from scratch. We will walk through a simple tutorial that will ensure that by the end of this article, you will have a fully functional news aggregator site using Python.

What is a Content Aggregator?

A content aggregator is like a restaurant menu. It serves up different articles or press releases right to you with a single click on their site, so you don’t have to worry about hunting down multiple sources of information just to get what you want. Content aggregators organize multiple articles and press releases and other media such as blog posts, news articles, product descriptions, etc., in one convenient area and make them all available for viewing, so now when you have news that’s worth reading, you no longer have to go through the trouble of going from site to site trying to find it!

Requirements

  1. Django
  2. BeatifulSoup
  3. Requests Module

Note: We’ll be making this entire project in Django, which is a python framework specifically made for web development, although I would explain what I am doing as we go on, it would be best if you have some knowledge about it beforehand.

Setting Up The Project 

To start off, we’ll first have to install the Django framework, which can be installed through:

pip install django

To start the project run the following command:

django-admin startproject content_aggregator

After running the command above, go into the project directory and run the following command to create a Django application:

cd content_aggregator  #you can go to the project directory using this command
python manage.py startapp aggregator

Go to the content_aggregator folder using an IDE and open content_aggregator/settings.py and go to the “INSTALLED_APPS” section and add the application’s name like shown below:

INSTALLED_APPS = [
  'django.contrib.admin',
  'django.contrib.auth',
  'django.contrib.contenttypes',
  'django.contrib.sessions',
  'django.contrib.messages',
  'django.contrib.staticfiles',
  'aggregator'  #<-- here
]

Also, add the TEMPLATES directory to make your templates work properly.

TEMPLATES = [
    {
        'BACKEND': 'django.template.backends.django.DjangoTemplates',
        'DIRS': [os.path.join(BASE_DIR,'templates')],  #<-- here
        'APP_DIRS': True,
        'OPTIONS': {
            'context_processors': [
                'django.template.context_processors.debug',
                'django.template.context_processors.request',
                'django.contrib.auth.context_processors.auth',
                'django.contrib.messages.context_processors.messages',
            ],
        },
    },
]

Open the content_aggregator/urls.py file and make the following changes:

from django.contrib import admin
from django.urls import path, include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('/', include('aggregator.urls')),
]

Scraping the websites

To obtain the data for aggregation, we used a method of our own called web scraping. Web scraping is a way to extract data from an existing website. To scrape the site, we made use of the requests and beautifulsoup modules. These modules are extremely helpful when it comes time to crawl or scrape through websites and extract information. In this case, we will be extracting articles from Times of India and theonion by utilizing these two Python modules.

We can start by going to theonion or any website that you might want to, the process will be exactly the same. 

Follow these steps to continue:

  1. Open the website and go to developer tools by pressing the F12 button or navigating through the side menu and something on the right side or below on the browser should up, those are the developer tools.
  2. Press Ctrl (or command) +Shift+C keys or click on the button with the arrow on the box in the top left corner.
  3. Navigate on the container of the article which should be a div in most cases, click on it and it would show on the right side where you can see its class.

The whole thing should look like this:

Here we can see the heading of the article is stored as h4 Now we can use this to get the data we need.

Writing the views

Now here is the main coding part, here we’ll have to import, manage, and set up how things will work in this project. 

To install the two Python modules that we talked about earlier i.e. requests and beautifulsoup modules you can run the following commands:

pip install bs4
pip install requests

After installing both of the packages we can start working on the views:

import requests
from django.shortcuts import render
from bs4 import BeautifulSoup 
toi_r = requests.get("https://timesofindia.indiatimes.com/briefs")
toi_soup = BeautifulSoup(toi_r.content, "html.parser")

toi_headings = toi_soup.find_all('h2')

toi_headings = toi_headings[0:-13] # removing footers

toi_news = []

for th in toi_headings:
    toi_news.append(th.text)



#Getting news from theonion

ht_r = requests.get("https://www.theonion.com/")
ht_soup = BeautifulSoup(ht_r.content, "html.parser")
ht_headings = ht_soup.find_all('h4')
ht_headings = ht_headings[2:]
ht_news = []

for hth in ht_headings:
    ht_news.append(hth.text)


def index(req):
    return render(req, 'index.html', {'toi_news':toi_news, 'ht_news': ht_news})

Writing Templates

The next move is going to be to create a templates directory and create the indet.html file and it should look like this.

<!DOCTYPE html>
<html>
<head>
    <title></title>
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
</head>
<body>
    <div class="jumbotron">
        <center><h1>Content Aggregator</h1>
          <a href="/" class="btn btn-danger">Refresh News</a>
        </form>
    </center>
    </div>
    <div class="container">
        <div class="row">
            <div class="col-6">
                    <h3 class="text-centre"> News from Times of india</h3>
                    {% for n in toi_news %}
                    <h5> -  {{n}} </h5>
                    <hr>
                    {% endfor %}
                    <br>
            </div>
            <div class="col-6">
                    <h3 class="text-centre">News from theonion </h3>
                    {% for htn in ht_news %}
                    <h5> - {{htn}} </h5>
                    <hr>
                    {% endfor %}
                    <br>
            </div>
        </div>


</div>
    <script
src="https://code.jquery.com/jquery-3.3.1.min.js"
integrity="sha256-FgpCb/KJQlLNfOu91ta32o/NMZxltwRo8QtmkMRdAu8="
    crossorigin="anonymous"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js" integrity="sha384-ApNbgh9B+Y1QKtv3Rn7W3mgPxhU9K/ScQsAP7hUibX39j7fakFPskvXusvfa0b4Q" crossorigin="anonymous"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js" integrity="sha384-JZR6Spejh4U02d8jOt6vLEHfe/JQGiRRSQQxSfFWpi1MquVdAyjUar5+76PVCmYl" crossorigin="anonymous"></script>
</body>
</html>

Output

Now that everything is done we can run the project but first, we’ll have to do some things first. Run both of the following commands to compile the whole project:

python manage.py makemigrations
python manage.py migrate

Now we can start the project by:

python3 manage.py runserver

And you should see a response like this:

Go to http://127.0.0.1:8000/ to see if it worked or not and if  it did it would look something like this:

Final Words

That’s it, with this you have a simple two-site news aggregator site using Python. It can be simply modified to take any two sites and put their headlines on one page. You could add your own sites to the list by modifying the code. You can also look to add functionality by scraping more data such as URLs and images. This would help you up to your skills and help you learn how the web works.

Here are some useful tutorials that you can read: