Scraping real estate prices using python and visualization using maps

TL;DR

An interactive map, accurate as of 13/08/2018 showing property prices per square meter in various areas of Tallin:

https://dvas0004.github.io/TallinnRealEstate/

Data shown is for 3-bedroom apartments (resource limitations). Green is less expensive, red is more expensive. Clicking on a data point will show a popup containing the actual price per square meter for that data point


 

As any house/apartment hunter knows, finding the perfect place to call home is an arduous and drawn-out process. In this show-and-tell article I’ve used python to scrape data from one of the most popular Estonian real-estate sites (https://kv.ee) and display the median price per square meter at different locations across Tallinn:

tallin_property_1

The above is a screenshot of the final result, which you can browse here:
https://dvas0004.github.io/TallinnRealEstate/

Note: the map only shows results for 3-bedroom apartments due to resource limitations. Green is cheaper, red is more expensive

Tip: click on the individual data points to display a popup showing the actual price per square meter.

Technical description

The actual code is posted at the end of this article. The main ingredients for this script were the python “requests” and “requests_html” modules. Admittedly, I could have used just one module but I did want to try out the HTML parsing capabilities of the requests_html module. For simplicity’s sake, the script outputs a static HTML file which can then be loaded into the browser or github pages like I did above. A more sophisticated approach would be to use a Python web framework like Flask to host the web page directly.

Scraping the data involved inspecting the web traffic between the browser and KV.EE, specifically when using the “Search by Map” functionality on the site. One the appropriate search filters are set, and the map is centered around the area you’d like to search within, pressing the “search” button issues a request via a URL similar to that shown on line 24 in the code below. The parameters I was particular interested in were the parameters describing the map area to search:

  • nelng / nelat : north east longitude / latitude (the top right corner of the map)
  • swlng / swlat : south west longitude / latitude (the bottom left corner of the map)

This allows us to issue different requests for the areas within which we’d like to scrape data, as was done in lines 154-185 from the code snippet below.  The “get_area_objects” class method gets a list of object IDs representing apartments, and their corresponding co-ordinates.

At this stage, we have the co-ordinates for the apartments, but we need to get their price and area in order to calculate their price per meter squared. This is what the “get_object_details” class method does – and it is here that requests_html really shines since it makes it very easy to extract the data we require

In the final stage, the “get_html” method uses Leaflet to build a map over which we display our data – circles representing the price per meter. I used an elegant JavaScript function (perc2color) in line 111 to convert from number/price to color

Advertisements

Lessons learnt: Of Spring Boot + OAuth2 + redirect URIs

TL;DR: make sure NGINX is setup correctly (proxy_set_header) before messing around with your code.

Scenario: Deploying a Spring Boot micro-service behind an NGINX reverse proxy gave us issues when using default Google OAuth2 configuration as described here , basically showing the “Redirect URI Mismatch” mentioned at the very end of the linked article

Trying the solution based on security.oauth2.client.pre-established-redirect-uri as mentioned in this article didn’t make any difference, looking at the debug spring security logs showed the framework was still redirecting requests to the default redirect URL

First Attempt (an interesting detour but ultimately failed attempt)

What did make a difference was the following application property we found from the spring boot documentation:

spring.security.oauth2.client.registration.google.redirect-uri-template: http://abc.example.com/login/oauth2/callback/google

However, it is important to set the redirectionEndpoint baseURI as part of the configuration, for example:

@Configuration
class OAuth2LoginSecurityConfig : WebSecurityConfigurerAdapter() {
    @Throws(Exception::class)
    override fun configure(http: HttpSecurity) {
        http
                .authorizeRequests()
                    .antMatchers("/login.html", "/favicon.ico")
                        .permitAll()
                    .anyRequest().authenticated()
                    .and()
                        .oauth2Login()
                            .redirectionEndpoint().baseUri("/login/callback/code/*")
    }
}

Note how the baseUri is set to only the path portion of the redirect-uri-template, and does not include the hostname.

Second Attempt (correct solution)

The above almost got us to the correct solution, however Spring Boot threw up the error of “invalid redirect uri“. Looking at the code, it transpires that Spring Oauth2 checks the redirect URI returned to it from the authentication provider (Google in our case) with a redirect URI it builds on the fly – using the hostname it detects from the original incoming HTTP Request (i.e. the request coming from the front-end).

Since this HTTP Request passes through an NGINX reverse proxy, Spring Boot was actually seeing “localhost” in the HTTP Host Header, and since localhost does not match the redirect-uri-template we set above (abc.example.com in our particular example), then Spring Boot throws up the invalid redirect uri error.

So it turns out the solution was really rather simple:

  1. Remove the changes introduced in the first attempt (i.e. the custom redirectionEndpoint settings and the redirect-uri-template)
  2. Instruct NGINX to copy the HTTP host header it received from the original client, into the HTTP hos header that it sends to the server, thereby preserving the HTTP Host header, like so:
proxy_set_header Host $http_host;

This is equivalent to Apache’s ProxyPreserveHost setting