Blaine Elliott

Analyzing Who’s Hiring Data on HackerNews with Interactive Python

2021-07-01T23:56:12+00:00

HackerNews has a monthly thread where employers can post job listings. Interactive Mode in Visual Studio is something I recently discovered that puts a notebook inside Visual Studio. Here, I’m going to put those two things together to use Interactive Mode to lightly analyze job postings on HN.

The Python Notebook in git github.com/iblaine/hn-whoshiring-analysis.

Some questions we will be able to answer with this analysis…

How has the popularity of “Data Engineer” changed over time?
How has “remote” factored into job descriptions since covid?
What companies have been posting the most to HN Who’s Hiring threads over time?

Interactive Mode in Visual Studio
Interactive Mode in Visual Studio is a notebook built into your IDE. A line with
# %%
creates a new cell for operations. That is it. Each # %% is a cell that can be run individually or put into debug mode.

Rough requirements for this analysis…

Collect every HN posts post from Jan 2013 to July 2021. (comes out to 103 HN threads & ~67,493 job postings)
Collect header for each post (first line of every post) and parse it as company_name, location, job_title, salary
Search entire text for keywords, count if the keyword is found in an item, count the frequency of a keyword in a post

Definitions

post = An HN Who’s Hiring post, that gets created on the first of every month.
item = When a user creates a new message in a post, we call that an item. A post contains many items.
cell = A block of code created with # %%.
cnt_total = 1 if the keyword is found in an item.
cnt_unique = number of times a keyword is found in an item.

General process for this notebook

Collect every HN Who’s Hiring posts. I’m using Google to query for what we hope is the correct HN link, take the first result & verify if it’s true. Selenium is used for the scraping.
For every post on HN, get every item in that post. This is a time consuming GET request for every item, for those items not already loaded.
Parse collected data into a dict.
Save dict to a file on disk.
Demormalize data into a pandas dataframe.
Analyze data.

Notes for this notebook

There are 2 types of cells in this notebook, those for testing marked # skip / debug and the rest needed to do the things.
I added sys.exit(1) to avoid running every cell all at once. Feel free to remove that as needed.
Data was collected from 2013-01-01 to 2021-07-01.
Google search results are used to find historical HN posts because that seemed convenient at the time.

Analysis…

Number of items in an HN Who’s Hiring post by month.
How has the popularity of “Data Engineer” changed over time?
How has “remote” factored into job descriptions since covid?
Which companies post the most? Threads for: March 2013 & July 2021

Conclusions

Interactive mode in Visual Studio is a nice tool. Probably not ideal for enterprise work, but it’s convenient and easy to use for quick analysis.
“remote” is showing up more frequently post covid. No real surprise there but interesting none the less.
As of 2018, Data Engineering became a real job, and its popularity is increasing.
Apple is a frequent poster while other FANG companies are not (the dirty source data may be a root cause).

The Anatomy of a Covid Appointment Bot

2021-06-15T20:22:11+00:00

When covid vaccine appointments were initially rolled out, they were difficult to book, requiring users to navigate confusing forms, refresh pages and pray for openings. To ease the frustration for nearby friends & family, I created a python script to automate the steps to find covid appointments. Things snowballed from there into this twitter bot. The bot ran for 3 months, generated 12,349 tweets, 106,353 clicks and cost $1.25/day.

TL;DR; I created a twitter bot to help people find available covid appointments, it served a purpose by minimizing the pain of finding appointments and maintaining accurate searches required constant maintenance.

The general business logic to find a covid appointment was relatively simple. Look for a covid appointment, collect available appointment data, then post a tweet if necessary.

Technical Notes

A persistent Amazon EC2 instance was used.
Terraform was used to simplify deployments and give me an opportunity to learn terraform. Initially I thought I’d spin up a container for each search, which would have removed most limits on scaling, but I never got beyond searching ~50 clinics at a time.
Searching was done using Python + selenium.
A poor man’s CI/CD was implemented such that development was done on my mac, pushed to git, then the EC2 instance pulled down changes every hour.
```
from sys import platform
prod = True if platform == "linux" else False # dev = my mac, prod = AWS
if prod == False:
  <use mac libraries, see chrome windows>
elif prod == True:
  <use EC2 libraries, headless chrome>
```
This tiny feature made my life much easier, so I’m pointing it out here. There was a constant need to debug scraping logic and this hack cut down the effort needed to deploy the many incremental updates.

Clinics were added via YAML using the following format…

state:
  california:                                                                                                         # Name of the state
      santa_barbara:                                                                                                  # Name of the county
          costco:                                                                                                     # Name of the clinic
              twitter_name: "Costco"                                                                                  # Name to use in a tweet
              status: "active"                                                                                        # Can be active/inactive
              url: "https://book.appointment-plus.com/d133yng2/#/book-appointment/select-a-location?_qk=lbvsv9hv4u"   # Target URL to begin a search
              cuttly_url: "https://cutt.ly/0vwl8qL"                                                                   # URL to use in a tweet
              city: "santa_maria"                                                                                     # Name of city where clinic is located
              data:                                                                                                   # Information to help the selenium actions
                  id: 477                                                                                             # Misc info needed for this particular clinic
                  selectEmployeeButton: 1097                                                                          # Misc info needed for this particular clinic

In this case we’re searching for a Costco clinic. A posted tweet would look something like…

The Twitter API was used to expose search results via @CovidVaxster. More on why I settle on twitter instead of a website below…
My original idea was to present the data on a website, but the reality is this bot was difficult to scale. Daily maintenance was often required, sometimes several hours at a time. I instead focused my efforts on creating accurate searches for the 3 nearest counties in my area: San Luis Obispo, Santa Barbara, & Ventura.
Pagination was done on several clinics to collect the number of available appointment slots. This way I could give users some sense of timeliness to create an appointment. Where possible I added logic to paginate through sites, which also added to maintenance costs.
Traffic monitoring was done through cutt.ly. If a user clicked a link in a tweet, then cutt.ly recorded a click. This way I could get a sense of whether or not these tweets were being used. To date I believe the tweets were used, at least 100k times. Based on UserAgent analysis, it seems like 40% of the traffic was bot activity.
Twitter Tags were used to search for appointments by city. Search for “#vaxster #santabarbara” and you’ll see the most recent available appointments BUT twitters spam filter has removed most tweets. The tags did reference all tweets in the past… ¯\_(ツ)_/¯

Vaxstter Code

Vaxster code to manage the search logic + Helper code to manage Terraform & twitter…

Vaxster code
.
├── clinic.py                           Manages generic business logic
├── clinic_libs                         Collection of selenium scripts designed to navigate individual clinics
│   ├── __init__.py
│   ├── costco.py                       Selenium steps to check costco
│   ├── cvs.py                          Selenium steps to check cvs
│   ├── goleta_valley_cottage_frame.py  Selenium steps to check a local hospital v1
│   ├── goleta_valley_cottage_v1.py     Selenium steps to check a local hospital v2, needed after the hospital refactored some links
│   ├── mhealthcheckin_v1.py            Selenium steps to check albertsons/ralphs/savon v1
│   ├── mhealthcheckin_v2.py            Selenium steps to check albertsons/ralphs/savon v2, needed after mhealthcheckin refactored their UI
│   ├── santa_barbara_medcenter.py      Selenium steps to check a local clinic in Santa Barbara
│   ├── slocounty_ca_gov.py             Selenium steps to check San Luis Obispo
│   └── walgreens.py                    Selenium steps to check walgreens
├── config.yml                          YAML file for clinic information
├── county.py                           Manages counties, forks parallel selenium threads
├── requirements.txt
├── state.py                            Intended to manage multiple states but I never managed to move beyond California
└── vaxster.py                          Mostly reads YAML settings then kicks off states.py

Helper code
.
├── collect_stats.py                    Collect click stats from cutt.ly, post results to twitter
├── terraform
│    ├── main.tf
│    ├── outputs.tf
│    ├── provider.tf
│    ├── user_data.tpl                   Commands to set up a new EC2 instance: add AWS creds, install chrome driver, clone git repo, etc.
│    └── variables.tf
└── twitter_helper.py                   Create tweets using the twitter API

The evolution of clinic searches

Clinics went through some not so obvious changes over time, where demand and supply fluctuated dramatically. Here’s a brief description of how that went.

Initial Rollout
January 2021 - February 15th, 2021
Clinics began to allow the public to book their own appointments. Period of high demand with limited vaccine appointment availability. This period probably marked the point where this bot delivered the most value. There was little information on how to book an appointment, let alone find availability.
The Winter Storm
February 15th, 2021 - Feb 28th, 2021
During this period, a massive winter storm hit the midwest and impacted the vaccine supply chain. Many existing appointments were cancelled and new appointments were difficult to find. I disabled most searches during the this time because supply was so low.
High Availability / High Demand
March 2021 - April 20021
This period marked a lot of activity where appointments were widely available and in high demand. Supply chain problems slowly went away and clinics made needed improvements to their booking processes, which also meant constant updates were needed to keep up with website updates.
High Availability / Low Demand
This period marked the beginning of the end of the need for a vaccine bot. Vaccine supply chain problems were mostly solved by now. There was a period for a few days where activity dropped significantly, due to several clinics updating their websites at around the same time. As time went on, the workflow for clinics grew in complexity, and so did updates. Over time it became increasingly difficult to keep up with updates, and necessary updates became increasingly costly.

Maintenance

Maintenance was the biggest surprise. At best, every new clinic was a few lines in a YAML file. At worst, every new clinic was a new python class. To expand on that, here’s a generalization of the different types of clinics

Required Login or Captcha
This included any clinic that required a login or a captcha to search for appointments. I excluded these clinics as to ‘do no evil’ on the interwebz. Any clinic requiring a captcha to search appointments was not meant to be searched.
Nationwide Clinics
This included CVS, Ralphs, Albertsons, & RiteAid. The nice thing about these clinics was once a selenium script was created for one clinic, the logic could be repurposed for many clinics.
Hospital Chains
Local hospital chains we’re pretty valuable, such that they had consistently reliable appointment information.
Urgent Care Clinics
This included privately owned urgent care clinics. The supply on these clinics was very low, but consistent. They were hidden gems of appointments that showed up for a few days at a time.
Mass Vaccination Sites
This includes one off mass vaccination sites. The web interfaces were very subjective, such that I could put a few hours into scraping them, get a lot of appointments for a short amount of time, then never be able to reused that code again.

Keeping searching running and able to deliver accurate appointment data was an ongoing chore. Part of me thinks the govt dropped the ball by not rolling out a nationwide covid appointment booking website, but that now seems like a near impossible task.

Lessons learned

Scaling a search engine across clinics is exponentially difficult. Every clinic, county, and state, may have their own way of managing covid appointments.
Terraform is convenient, but figuring out VPCs, SGs, and IAM roles is a chore.
Twitter is full of spam. 40% of my clicks seem to have been generated by bots.

Detecting Data Anomalies In Your Data

2020-08-09T08:09:00+00:00

This project is still in an alpha phase…and I do hope it gets to a point where this can be shared, used, poked at, prodded, and improved.

What? The Data Anomaly Detection Tool(“DAD Tool”) is a tool that can assess data for abnormalities at scale.

Why? Newly generated data can be difficult to trust. Every situation is unique and some organizations are going to struggle more than others. Generally speaking, an organization that consumes a lot of unstructured data is going to have a more challenging time trusting the data it consumes. With that in mind, we want tools that can improve our ability to trust our data.

How? The DAD Tool was created using python, Flask, and AWS…for the most part.

What else? Here’s a deck that was created to describe the tool: https://www.slideshare.net/iblaine/using-airflow-for-tools-development

What’s interesting about the DAD Tool is how the architecture uses lightweight Python classes for the math behind the statistics, takes advantage of databases to do the heavy lifting of transporting data and uses Airflow as part of the backend. All that is a longer conversation, but basically the DAD Tool aggregates statistical requirements, leverages Airflow as an outsourced orchestration tool, and simplifies the execution of tests for users. With a few clicks, new tests can be added to assess every column of every table in a schema.

Strandbeest Bike

2019-12-19T22:10:00+00:00

Some friends and I built a strandbeest bike that’s part bike, part strandbeest. Here it is on an episode of Tosh.o.

FWIW, I did not get permission to put this video up on youtube but Tosh.o didn’t seek permission to put our video up on their TV show so…we are even.

Here is the bike on display at an exhibit at the Exploratorium in San Francisco for a strandbeest event.

Answers to common questions…

Q. Why not put legs on the front?

A. Putting legs on the front would probably prevent it from turning. Plus the idea is to make it half bike, half legs. It needs at least one wheel.

Q. Is it fast?

A. No

Q. Can it climb hills?

A. Probably not. it needs a level surface with as few curves as possible.

Inventory to get started

Materials

160′ of 1/2″ steel pipe
3′ of 5/8″ cold rolled steel rod
10’x1″ of 1/8″ steel plate
120 bushings (part# MYI-05-05 from igus.com)

Equipment

Arc welder
Drill press
Pipe cutter
Graduated drill bit for cutting 1/2 steel

Building the strandbeest bike

First step was to create a 3D model of what the bike.

Here’s the strandbeest mechanism. Just need 4 of these on the back of a bike.

Framing a test leg out of wood.

Lots of 1/2 inch steel to be used for building legs.

Creating leg parts out of steel.

Fabricating some sides of a leg.

Leg parts waiting to be assembled.

Curved bars were used to avoid getting in the way of the crankshaft.

Frame construction.

Creating the crankshaft.

Parts waiting for assembly.

Assembling the strandbeest bike

Sorting the parts

Building the legs (part 1 of 2)

Building the legs (part 2 of 2)

Attaching the legs and crankshaft to the frame

Attaching the frame to the bike.

Riding the strandbeest bike

Data Engineers need to be better at Systems Design

2019-04-30T09:48:55+00:00

Systems Design in Data Engineering is becoming increasingly important. As the data industry becomes increasingly complex, the cost to build frameworks is increasing, as well as the need for good systems design skills. These examples have been taken from LinkedIn, Airbnb, and Chegg.

First, some horror stories to demonstrate when Data Engineering goes wrong.

2,000 line ETL job in Python

This ETL job was written entirely in python to ingest data, transform it, and write to several tables. It was difficult to maintain and understand. No one wanted to touch this process for fear of permanently breaking it. This ETL slowly grew out of control over the years. To avoid this problem, the ETL process should have stuck to the principle that a single process should have a single purpose.

The magical ETL tool

This ETL tool was written in house with good intentions but eventually had to be replaced. This tool could ingest an event steam, structure unstructured data into automatically generated tables, then create pseudo facts and dimensions. It was impressively feature rich but also complex and too closely coupled with a specific database. It began to fail as the needs of the business outgrew the primary goals of the tool. To avoid this problem, it should have been split into multiple services.

Using the wrong tool for the wrong job

This data pipeline was feature rich and built with too many tools. It included a complex tech stack of Informatica, Informatica Cloud, SQL, MS SQL stored procedures & Powershell. Informatica was used only as a dependency manager, which was a red flag. The original developers were under pressure to deliver, they picked technologies they were unfamiliar with, then used them incorrectly. The framework created facts with attributes and dimensions with measures. Technologies were used for the wrong reasons and design patterns were broken.

How could these problems have been avoided?

The Single Responsibility Principle should be followed for nodes within data pipelines
ETL processes should be Idempotent
Scope creep should be recognized as it happens
Avoid committing code in bulk. Small commits speed up the process to find problems and make deployments easier to digest.

Data Engineering is now Software Engineering

Data Warehouse Architectures used to be built by a collection of a few dozen large software vendors like SAP, Business Objects, Cognos, Informatica, Oracle and Teradata. Frameworks were the software you purchased, with Data Engineers building scripts to take care of the long tail of requirements. Todays industry is more complex. There are hundreds of solutions to pick from between Amazon, Google, Apache and others. Companies build their own frameworks from scratch and tailor them to the tools they prefer. Scripts are no longer enough to successfully use today tools. Data Engineers need to be proficient at Systems Design.

How can Data Engineers make the shift to think like Software Engineers?

For systems design, brush up using an online course in Software Design and Architecture
For industry knowledge, listen to podcasts and read blogs such as Data Engineering Weekly and SF Data Weekly
For general coding, practices on Leet Code. In algorithms, Data Engineers should be able to solve all easy challenges, most medium challenges, and some hard challenges.

To be fair, some of these problems listed above were caused by me, fixed by me or refactored by me. No one wants to learn from mistakes but it’s better than making the same mistake twice.

https://www.linkedin.com/pulse/data-engineers-need-better-systems-design-blaine-elliott/

Defining the role of a Data Engineer

2019-02-05T09:25:38+00:00

The role of the Data Engineer has evolved rapidly over the past few years and created a lot of confusion. Some people with the title Data Engineer spend most their time writing SQL queries while others are creating databases from scratch. This leads to a lot of confusion for Engineering and Recruiting departments. Companies may not know when to shift responsibilities to Data Engineers and Recruiting departments may have trouble creating job postings. I’m going to try to clear that up in this post.

Data Engineering was born out of the need to have a Business Intelligence Analyst with most of the skills of a Software Engineer. The market used to be dominated by a few large companies and the scope of technologies was relatively narrow. Since then, the industry has been infused with many different types of technologies, which has greatly increased the scope of responsibilities. Today’s Data Engineer has most of the skills of a Software Engineer with additional skills to create solutions specific to data.

Examples of tasks for Data Engineers

Administer AWS EC2 instances
Build dimensional data marts per Kimball or Inmon
Build tools for managing data in Python, PHP, Ruby or Go
Containerize applications and services using docker, swarm or kubernetes
Create backup and recovery plans
Create CI/CD processes to manage deployments
Create data modeling diagrams using Visio, ERwin or LucidCharts
Create data pipelines for dependency management in Airflow, Gobblin, Luigi, Nifi, or Jenkins
Create ELT jobs using spark, hive, Python, Scala, SQL, Java, pig or PLSQL
Create ETL jobs using SSIS, Informatica, Talend or Pentaho
Create MapReduce functions in Java
Create REST APIs in Python or Java to ingest data from third party tools such as SalesForce or Zuora
Create reports in Tableau, Qlik or Looker
Deploy streaming tools such as Kafka, Flink, Streamsets, Beam or AWS Kinesis
Deploy and optimize… distributed databases such as Hadoop, spark, Druid or ElasticSearch
Deploy and optimize… MPP databases such as Redshift, Teradata, or Snowflake
Deploy and optimize… NewSQL databases such as Rockset or TiDB
Deploy and optimize… NoSQL databases such as MongoDB, HBase or Cassandra
Deploy and optimize… RDBMS databases such as MySQL, Postgres, SQL Server, Oracle
Design and develop data security policies
Manage AI/ML modeling data pipelines in Pandas, Sci-kit Learn, or PySpark
Migrate services and between Amazon AWS, the Google Cloud Platform and on prem solutions
Support end users with a large range of technical skills, from data scientists to management
Track down data quality issues
Train companies and departments on best practices related to databases and data platforms

All these tasks above can fall under the scope of Data Engineering. This is a problem because all these tasks together involve a wide range of skills. It’s unrealistic to expect Data Engineers to be experts in all of these tasks. Part of the solution to that problem is Data Engineer must very good at learning new technologies. Another part of that solution is to create categories of Data Engineers and we’re starting to see that today. Titles like Data Infrastructure Engineer and Data Modeling Engineer now exist. Here is a list of common specialties within Data Engineering:

Data Engineer

Primary purpose is to cover Data Engineering needs for the entire company. Typically comfortable with many skills but is not a master in any specific skills. Should understand cloud services, databases, ETL, data pipelines, SQL, at least one object oriented language, and unix. Must be a quick learner because this type of role has you moving in different directions. Usually found at small companies or those companies with new Data Engineering departments.

Sr Data Engineer

All the skills of a Data Engineer plus the ability to lead projects that impact a single department. Comfortable with ‘Big Data’ concepts. Can contribute to existing frameworks.

Staff Data Engineer

All the skills of a Sr Data Engineer plus the ability to lead projects that impact the entire company. Has the ability to create frameworks from scratch. Comfortable making Build vs Buy decisions plus defining and executing roadmaps for large projects.

Data Architect

Primary purpose is to act as a technical leader for all things related to data. Typically provides guidance for projects, does not have direct reports and leads data governance efforts. Should have all the skills of a Staff Data Engineer. Given a project with a dozen Software Engineers and a few Data Engineers, there will be one Data Architect who provides guidance for data related technology decisions. This role is commonly found at large companies and consulting companies.

Data Engineer - Analytics

Primary purpose is to deliver analytical reports and analysis. Should have a good understanding of reporting tools like Tableau, Qlik and Looker. Should be able to optimize queries and databases. This role share a lot in common with a Business Intelligence Analyst.

Data Engineer - Data Pipelines / ETL

Primary purpose is to create and manage data pipelines. Should be able to create their own tools for managing data pipelines or deploy tools like Airlow and Informatica. Has a good understanding of design patterns for batch processing and stream processing.

Data Engineer - Infrastructure

Primary purpose is to create and manage infrastructure projects. Should be able to create tools and build applications. Has a good understanding of OOP, open source projects and cloud services. Many open source projects are spawned by these types of Data Engineers.

Data Engineer - Modeling

Primary purpose is to create and manage dimensional models. Should be able to debate Kimball and Inmon design patterns and has an in depth understanding of database optimization.

The future of Data Engineering roles

For the future, we know that data is not going to go away, it is only going to grow. Technology will keep adjusting to handle the challenges that come with big data both on the infrastructure side as well as the application/tooling side. Today, we are seeing large growth in the data engineer-infrastructure role to keep up with the explosion of big data applications under Apache, AWS and GCP. Data must be accessible to be able to be applied. And we need people to help bridge the gap from raw data to applied data, and those roles may be classified as data engineers, data scientists, or ML engineers.

Thank you Cindy Rottinghuis for your contributions and editing.

#dataengineering #dataengineer

https://www.linkedin.com/pulse/defining-role-data-engineer-blaine-elliott/