Jekyll2022-07-23T22:13:28+00:00https://iblaine.github.io/feed.xmlBlaine ElliottData Engineerin’Analyzing Who’s Hiring Data on HackerNews with Interactive Python2021-07-01T23:56:12+00:002021-07-01T23:56:12+00:00https://iblaine.github.io/hackernews-whos-hiring-analysis<p>HackerNews has a monthly <a href="https://news.ycombinator.com/item?id=27699704">thread</a> where employers can post job listings. <a href="https://code.visualstudio.com/docs/python/jupyter-support-py">Interactive Mode in Visual Studio</a> is something I recently discovered that puts a notebook inside Visual Studio. Here, I’m going to put those two things together to use Interactive Mode to lightly analyze job postings on HN.</p>
<p>The Python Notebook in git <a href="https://github.com/iblaine/hn-whoshiring-analysis">github.com/iblaine/hn-whoshiring-analysis</a>.</p>
<p><img src="/assets/images/projects/hn-analysis-walkthrough.gif" alt="" /></p>
<p><strong>Some questions we will be able to answer with this analysis…</strong></p>
<ul>
<li>How has the popularity of “Data Engineer” changed over time?</li>
<li>How has “remote” factored into job descriptions since covid?</li>
<li>What companies have been posting the most to HN Who’s Hiring threads over time?</li>
</ul>
<p><strong>Interactive Mode in Visual Studio</strong><br />
<a href="https://code.visualstudio.com/docs/python/jupyter-support-py">Interactive Mode in Visual Studio</a> is a notebook built into your IDE. A line with<br />
<code class="language-plaintext highlighter-rouge"># %%</code><br />
creates a new cell for operations. That is it. Each <code class="language-plaintext highlighter-rouge"># %%</code> is a cell that can be run individually or put into debug mode.</p>
<p><strong>Rough requirements for this analysis…</strong></p>
<ul>
<li>Collect every HN posts post from Jan 2013 to July 2021. (comes out to 103 HN threads & ~67,493 job postings)</li>
<li>Collect header for each post (first line of every post) and parse it as company_name, location, job_title, salary</li>
<li>Search entire text for keywords, count if the keyword is found in an item, count the frequency of a keyword in a post</li>
</ul>
<p><strong>Definitions</strong></p>
<ul>
<li>post = An HN Who’s Hiring post, that gets created on the first of every month.</li>
<li>item = When a user creates a new message in a post, we call that an item. A post contains many items.</li>
<li>cell = A block of code created with <code class="language-plaintext highlighter-rouge"># %%</code>.</li>
<li>cnt_total = 1 if the keyword is found in an item.</li>
<li>cnt_unique = number of times a keyword is found in an item.</li>
</ul>
<p><strong>General process for this notebook</strong></p>
<ol>
<li>Collect every HN Who’s Hiring posts. I’m using Google to query for what we hope is the correct HN link, take the first result & verify if it’s true. Selenium is used for the scraping.</li>
<li>For every post on HN, get every item in that post. This is a time consuming GET request for every item, for those items not already loaded.</li>
<li>Parse collected data into a dict.</li>
<li>Save dict to a file on disk.</li>
<li>Demormalize data into a pandas dataframe.</li>
<li>Analyze data.</li>
</ol>
<p><strong>Notes for this notebook</strong></p>
<ul>
<li>There are 2 types of cells in this notebook, those for testing marked <code class="language-plaintext highlighter-rouge"># skip / debug</code> and the rest needed to do the things.</li>
<li>I added <code class="language-plaintext highlighter-rouge">sys.exit(1)</code> to avoid running every cell all at once. Feel free to remove that as needed.</li>
<li>Data was collected from 2013-01-01 to 2021-07-01.</li>
<li>Google search results are used to find historical HN posts because that seemed convenient at the time.</li>
</ul>
<p><strong>Analysis…</strong></p>
<ul>
<li>Number of items in an HN Who’s Hiring post by month.
<img src="/assets/images/projects/graph-hnwhoshiring-by-month.png" alt="" /></li>
<li>How has the popularity of “Data Engineer” changed over time?
<img src="/assets/images/projects/graph-dataengineer-by-month.png" alt="" /></li>
<li>How has “remote” factored into job descriptions since covid?
<img src="/assets/images/projects/graph-remote-by-month.png" alt="" /></li>
<li>Which companies post the most?
Threads for: <a href="https://news.ycombinator.com/item?id=5304169">March 2013</a> & <a href="https://news.ycombinator.com/item?id=27699704">July 2021</a>
<img src="/assets/images/projects/data-top-posters.png" alt="" /></li>
</ul>
<p><strong>Conclusions</strong></p>
<ul>
<li>Interactive mode in Visual Studio is a nice tool. Probably not ideal for enterprise work, but it’s convenient and easy to use for quick analysis.</li>
<li>“remote” is showing up more frequently post covid. No real surprise there but interesting none the less.</li>
<li>As of 2018, Data Engineering became a real job, and its popularity is increasing.</li>
<li>Apple is a frequent poster while other FANG companies are not (the dirty source data may be a root cause).</li>
</ul>blaineHackerNews has a monthly thread where employers can post job listings. Interactive Mode in Visual Studio is something I recently discovered that puts a notebook inside Visual Studio. Here, I’m going to put those two things together to use Interactive Mode to lightly analyze job postings on HN.The Anatomy of a Covid Appointment Bot2021-06-15T20:22:11+00:002021-06-15T20:22:11+00:00https://iblaine.github.io/covid-appointment-bot<p>When covid vaccine appointments were initially rolled out, they were difficult to book, requiring users to navigate confusing forms, refresh pages and pray for openings. To ease the frustration for nearby friends & family, I created a python script to automate the steps to find covid appointments. Things snowballed from there into this twitter bot. The bot ran for 3 months, generated 12,349 tweets, 106,353 clicks and cost $1.25/day.</p>
<p><strong>TL;DR;</strong> I created a twitter bot to help people find available covid appointments, it served a purpose by minimizing the pain of finding appointments and maintaining accurate searches required constant maintenance.</p>
<p><img src="/assets/images/projects/covid-vaxster-the-cure.png" alt="" /></p>
<p>The general business logic to find a covid appointment was relatively simple. Look for a covid appointment, collect available appointment data, then post a tweet if necessary.
<img src="/assets/images/projects/covid-vaxster-logic.png" alt="" /></p>
<p><strong>Technical Notes</strong></p>
<ul>
<li>A persistent Amazon EC2 instance was used.</li>
<li>Terraform was used to simplify deployments and give me an opportunity to learn terraform. Initially I thought I’d spin up a container for each search, which would have removed most limits on scaling, but I never got beyond searching ~50 clinics at a time.</li>
<li>Searching was done using Python + <a href="https://github.com/SeleniumHQ/selenium/">selenium</a>.</li>
<li>A poor man’s CI/CD was implemented such that development was done on my mac, pushed to git, then the EC2 instance pulled down changes every hour.
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from sys import platform
prod = True if platform == "linux" else False # dev = my mac, prod = AWS
if prod == False:
<use mac libraries, see chrome windows>
elif prod == True:
<use EC2 libraries, headless chrome>
</code></pre></div> </div>
<p>This tiny feature made my life much easier, so I’m pointing it out here. There was a constant need to debug scraping logic and this hack cut down the effort needed to deploy the many incremental updates.</p>
</li>
<li>Clinics were added via YAML using the following format…
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>state:
california: # Name of the state
santa_barbara: # Name of the county
costco: # Name of the clinic
twitter_name: "Costco" # Name to use in a tweet
status: "active" # Can be active/inactive
url: "https://book.appointment-plus.com/d133yng2/#/book-appointment/select-a-location?_qk=lbvsv9hv4u" # Target URL to begin a search
cuttly_url: "https://cutt.ly/0vwl8qL" # URL to use in a tweet
city: "santa_maria" # Name of city where clinic is located
data: # Information to help the selenium actions
id: 477 # Misc info needed for this particular clinic
selectEmployeeButton: 1097 # Misc info needed for this particular clinic
</code></pre></div> </div>
<p>In this case we’re searching for a Costco clinic. A posted tweet would look something like…
<a href="https://twitter.com/CovidVaxster/status/1396707760612560897"><img src="/assets/images/projects/covid-vaxster-tweet.png" alt="" /></a></p>
</li>
<li>The Twitter API was used to expose search results via <a href="https://twitter.com/covidvaxster">@CovidVaxster</a>. More on why I settle on twitter instead of a website below…</li>
<li>My original idea was to present the data on a website, but the reality is this bot was difficult to scale. Daily maintenance was often required, sometimes several hours at a time. I instead focused my efforts on creating accurate searches for the 3 nearest counties in my area: San Luis Obispo, Santa Barbara, & Ventura.</li>
<li>Pagination was done on several clinics to collect the number of available appointment slots. This way I could give users some sense of timeliness to create an appointment. Where possible I added logic to paginate through sites, which also added to maintenance costs.</li>
<li>Traffic monitoring was done through <a href="https://cutt.ly/">cutt.ly</a>. If a user clicked a link in a tweet, then <a href="https://cutt.ly/">cutt.ly</a> recorded a click. This way I could get a sense of whether or not these tweets were being used. To date I believe the tweets were used, at least 100k times. Based on UserAgent analysis, it seems like 40% of the traffic was bot activity.</li>
<li>Twitter Tags were used to search for appointments by city. Search for <a href="https://twitter.com/search?q=%23vaxster%20%23santabarbara">“#vaxster #santabarbara”</a> and you’ll see the most recent available appointments BUT twitters spam filter has removed most tweets. The tags did reference all tweets in the past… <code class="language-plaintext highlighter-rouge">¯\_(ツ)_/¯</code></li>
</ul>
<p><strong>Vaxstter Code</strong></p>
<p>Vaxster code to manage the search logic + Helper code to manage Terraform & twitter…</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Vaxster code
.
├── clinic.py Manages generic business logic
├── clinic_libs Collection of selenium scripts designed to navigate individual clinics
│ ├── __init__.py
│ ├── costco.py Selenium steps to check costco
│ ├── cvs.py Selenium steps to check cvs
│ ├── goleta_valley_cottage_frame.py Selenium steps to check a local hospital v1
│ ├── goleta_valley_cottage_v1.py Selenium steps to check a local hospital v2, needed after the hospital refactored some links
│ ├── mhealthcheckin_v1.py Selenium steps to check albertsons/ralphs/savon v1
│ ├── mhealthcheckin_v2.py Selenium steps to check albertsons/ralphs/savon v2, needed after mhealthcheckin refactored their UI
│ ├── santa_barbara_medcenter.py Selenium steps to check a local clinic in Santa Barbara
│ ├── slocounty_ca_gov.py Selenium steps to check San Luis Obispo
│ └── walgreens.py Selenium steps to check walgreens
├── config.yml YAML file for clinic information
├── county.py Manages counties, forks parallel selenium threads
├── requirements.txt
├── state.py Intended to manage multiple states but I never managed to move beyond California
└── vaxster.py Mostly reads YAML settings then kicks off states.py
Helper code
.
├── collect_stats.py Collect click stats from cutt.ly, post results to twitter
├── terraform
│ ├── main.tf
│ ├── outputs.tf
│ ├── provider.tf
│ ├── user_data.tpl Commands to set up a new EC2 instance: add AWS creds, install chrome driver, clone git repo, etc.
│ └── variables.tf
└── twitter_helper.py Create tweets using the twitter API
</code></pre></div></div>
<p><strong>The evolution of clinic searches</strong></p>
<p>Clinics went through some not so obvious changes over time, where demand and supply fluctuated dramatically. Here’s a brief description of how that went.</p>
<ul>
<li><u>Initial Rollout</u><br />
January 2021 - February 15th, 2021<br />
Clinics began to allow the public to book their own appointments. Period of high demand with limited vaccine appointment availability. This period probably marked the point where this bot delivered the most value. There was little information on how to book an appointment, let alone find availability.</li>
<li><u>The Winter Storm</u><br />
February 15th, 2021 - Feb 28th, 2021<br />
During this period, a <a href="https://en.wikipedia.org/wiki/February_13%E2%80%9317,_2021_North_American_winter_storm">massive winter storm</a> hit the midwest and impacted the vaccine supply chain. Many existing appointments were cancelled and new appointments were difficult to find. I disabled most searches during the this time because supply was so low.</li>
<li><u>High Availability / High Demand</u><br />
March 2021 - April 20021<br />
This period marked a lot of activity where appointments were widely available and in high demand. Supply chain problems slowly went away and clinics made needed improvements to their booking processes, which also meant constant updates were needed to keep up with website updates.</li>
<li><u>High Availability / Low Demand</u><br />
This period marked the beginning of the end of the need for a vaccine bot. Vaccine supply chain problems were mostly solved by now. There was a period for a few days where activity dropped significantly, due to several clinics updating their websites at around the same time. As time went on, the workflow for clinics grew in complexity, and so did updates. Over time it became increasingly difficult to keep up with updates, and necessary updates became increasingly costly.</li>
</ul>
<p><img src="/assets/images/projects/covid-vaxster-timeline.png" alt="" /></p>
<p><strong>Maintenance</strong></p>
<p>Maintenance was the biggest surprise. At best, every new clinic was a few lines in a YAML file. At worst, every new clinic was a new python class. To expand on that, here’s a generalization of the different types of clinics</p>
<ul>
<li><u>Required Login or Captcha</u><br />
This included any clinic that required a login or a captcha to search for appointments. I excluded these clinics as to ‘do no evil’ on the interwebz. Any clinic requiring a captcha to search appointments was not meant to be searched.</li>
<li><u>Nationwide Clinics</u><br />
This included CVS, Ralphs, Albertsons, & RiteAid. The nice thing about these clinics was once a selenium script was created for one clinic, the logic could be repurposed for many clinics.</li>
<li><u>Hospital Chains</u><br />
Local hospital chains we’re pretty valuable, such that they had consistently reliable appointment information.</li>
<li><u>Urgent Care Clinics</u><br />
This included privately owned urgent care clinics. The supply on these clinics was very low, but consistent. They were hidden gems of appointments that showed up for a few days at a time.</li>
<li><u>Mass Vaccination Sites</u><br />
This includes one off mass vaccination sites. The web interfaces were very subjective, such that I could put a few hours into scraping them, get a lot of appointments for a short amount of time, then never be able to reused that code again.</li>
</ul>
<p>Keeping searching running and able to deliver accurate appointment data was an ongoing chore. Part of me thinks the govt dropped the ball by not rolling out a nationwide covid appointment booking website, but that now seems like a near impossible task.</p>
<p><strong>Lessons learned</strong></p>
<ul>
<li>Scaling a search engine across clinics is exponentially difficult. Every clinic, county, and state, may have their own way of managing covid appointments.</li>
<li>Terraform is convenient, but figuring out VPCs, SGs, and IAM roles is a chore.</li>
<li>Twitter is full of spam. 40% of my clicks seem to have been generated by bots.</li>
</ul>When covid vaccine appointments were initially rolled out, they were difficult to book, requiring users to navigate confusing forms, refresh pages and pray for openings. To ease the frustration for nearby friends & family, I created a python script to automate the steps to find covid appointments. Things snowballed from there into this twitter bot. The bot ran for 3 months, generated 12,349 tweets, 106,353 clicks and cost $1.25/day.Detecting Data Anomalies In Your Data2020-08-09T08:09:00+00:002020-08-09T08:09:00+00:00https://iblaine.github.io/detecting-data-anomalies-in-your-data<p>This project is still in an alpha phase…and I do hope it gets to a point where this can be shared, used, poked at, prodded, and improved.</p>
<p><strong>What?</strong>
The Data Anomaly Detection Tool(“DAD Tool”) is a tool that can assess data for abnormalities at scale.</p>
<p><strong>Why?</strong>
Newly generated data can be difficult to trust. Every situation is unique and some organizations are going to struggle more than others. Generally speaking, an organization that consumes a lot of unstructured data is going to have a more challenging time trusting the data it consumes. With that in mind, we want tools that can improve our ability to trust our data.</p>
<p><strong>How?</strong>
The DAD Tool was created using python, Flask, and AWS…for the most part.</p>
<p><strong>What else?</strong>
Here’s a deck that was created to describe the tool: <a href="https://www.slideshare.net/iblaine/using-airflow-for-tools-development">https://www.slideshare.net/iblaine/using-airflow-for-tools-development</a></p>
<p>What’s interesting about the DAD Tool is how the architecture uses lightweight Python classes for the math behind the statistics, takes advantage of databases to do the heavy lifting of transporting data and uses Airflow as part of the backend. All that is a longer conversation, but basically the DAD Tool aggregates statistical requirements, leverages Airflow as an outsourced orchestration tool, and simplifies the execution of tests for users. With a few clicks, new tests can be added to assess every column of every table in a schema.</p>blaineThis project is still in an alpha phase…and I do hope it gets to a point where this can be shared, used, poked at, prodded, and improved.Strandbeest Bike2019-12-19T22:10:00+00:002019-12-19T22:10:00+00:00https://iblaine.github.io/strandbeest-bike<p>Some friends and I built a strandbeest bike that’s part bike, part strandbeest. Here it is on an episode of Tosh.o.</p>
<iframe width="560" height="310" src="https://www.youtube.com/embed/hxeOGIpN41k" frameborder="0" allowfullscreen=""></iframe>
<p>FWIW, I did not get permission to put this video up on youtube but Tosh.o didn’t seek permission to put our video up on their TV show so…we are even.</p>
<p>Here is the bike on display at an exhibit at the Exploratorium in San Francisco for a strandbeest event.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/strandbeest_bike_exploratorium.jpg" alt="The Strandbeest bike at the Exploratorium" /></p>
<h4 id="answers-to-common-questions">Answers to common questions…</h4>
<p>Q. Why not put legs on the front?</p>
<p>A. Putting legs on the front would probably prevent it from turning. Plus the idea is to make it half bike, half legs. It needs at least one wheel.</p>
<p>Q. Is it fast?</p>
<p>A. No</p>
<p>Q. Can it climb hills?</p>
<p>A. Probably not. it needs a level surface with as few curves as possible.</p>
<hr />
<h2 id="inventory-to-get-started">Inventory to get started</h2>
<p>Materials</p>
<ul>
<li>160′ of 1/2″ steel pipe</li>
<li>3′ of 5/8″ cold rolled steel rod</li>
<li>10’x1″ of 1/8″ steel plate</li>
<li>120 bushings (part# MYI-05-05 from igus.com)</li>
</ul>
<p>Equipment</p>
<ul>
<li>Arc welder</li>
<li>Drill press</li>
<li>Pipe cutter</li>
<li>Graduated drill bit for cutting 1/2 steel</li>
</ul>
<hr />
<h2 id="building-the-strandbeest-bike">Building the strandbeest bike</h2>
<p>First step was to create a 3D model of what the bike.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/strandbeest_bike_3D_model2.jpg" alt="" /></p>
<hr />
<p>Here’s the strandbeest mechanism. Just need 4 of these on the back of a bike.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/Strandbeest-Walking-Animation.gif" alt="" /></p>
<hr />
<p>Framing a test leg out of wood.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/jp-fitting-wooden-leg.jpg" alt="" /></p>
<hr />
<p>Lots of 1/2 inch steel to be used for building legs.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/raw-steel.jpg" alt="" /></p>
<hr />
<p>Creating leg parts out of steel.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/21-bracket-cutting.jpg" alt="" /></p>
<hr />
<p>Fabricating some sides of a leg.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/leg-part-sides.jpg" alt="" /></p>
<hr />
<p>Leg parts waiting to be assembled.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/23-leg-parts-sorted.jpg" alt="" /></p>
<hr />
<p>Curved bars were used to avoid getting in the way of the crankshaft.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/23-leg-assembly-curved-bar.jpg" alt="" /></p>
<hr />
<p>Frame construction.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/25-frame-construction.jpg" alt="" /></p>
<hr />
<p>Creating the crankshaft.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/26-crankshaft-more-welding.jpg" alt="" /></p>
<hr />
<p>Parts waiting for assembly.</p>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/23-parts-set-up-for-assembly.jpg" alt="" /></p>
<hr />
<h2 id="assembling-the-strandbeest-bike">Assembling the strandbeest bike</h2>
<p>Sorting the parts</p>
<iframe width="560" height="310" src="https://www.youtube.com/embed/nFIwtX1wF-w" frameborder="0" allowfullscreen=""></iframe>
<p>Building the legs (part 1 of 2)</p>
<iframe width="560" height="310" src="https://www.youtube.com/embed/2jjiput_nok" frameborder="0" allowfullscreen=""></iframe>
<p>Building the legs (part 2 of 2)</p>
<iframe width="560" height="310" src="https://www.youtube.com/embed/pXLGaVCF370" frameborder="0" allowfullscreen=""></iframe>
<p>Attaching the legs and crankshaft to the frame</p>
<iframe width="560" height="310" src="https://www.youtube.com/embed/TO_TPZACS3Y" frameborder="0" allowfullscreen=""></iframe>
<p>Attaching the frame to the bike.</p>
<iframe width="560" height="310" src="https://www.youtube.com/embed/2I_YZfY0iJ8" frameborder="0" allowfullscreen=""></iframe>
<hr />
<h2 id="riding-the-strandbeest-bike">Riding the strandbeest bike</h2>
<p><img src="https://raw.githubusercontent.com/iblaine/iblaine.github.io/master/assets/images/projects/strandbeest-on-display.jpg" alt="" /></p>
<iframe width="560" height="310" src="https://www.youtube.com/embed/YKAxKrxFys8" frameborder="0" allowfullscreen=""></iframe>Some friends and I built a strandbeest bike that’s part bike, part strandbeest. Here it is on an episode of Tosh.o.Data Engineers need to be better at Systems Design2019-04-30T09:48:55+00:002019-04-30T09:48:55+00:00https://iblaine.github.io/data-engineers-need-to-be-better-at-system-design<p>Systems Design in Data Engineering is becoming increasingly important. As the data industry becomes increasingly complex, the cost to build frameworks is increasing, as well as the need for good systems design skills. These examples have been taken from LinkedIn, Airbnb, and Chegg.</p>
<p><strong>First, some horror stories to demonstrate when Data Engineering goes wrong.</strong></p>
<p><strong>2,000 line ETL job in Python</strong></p>
<p>This ETL job was written entirely in python to ingest data, transform it, and write to several tables. It was difficult to maintain and understand. No one wanted to touch this process for fear of permanently breaking it. This ETL slowly grew out of control over the years. To avoid this problem, the ETL process should have stuck to the principle that a single process should have a single purpose.</p>
<p><strong>The magical ETL tool</strong></p>
<p>This ETL tool was written in house with good intentions but eventually had to be replaced. This tool could ingest an event steam, structure unstructured data into automatically generated tables, then create pseudo facts and dimensions. It was impressively feature rich but also complex and too closely coupled with a specific database. It began to fail as the needs of the business outgrew the primary goals of the tool. To avoid this problem, it should have been split into multiple services.</p>
<p><strong>Using the wrong tool for the wrong job</strong></p>
<p>This data pipeline was feature rich and built with too many tools. It included a complex tech stack of Informatica, Informatica Cloud, SQL, MS SQL stored procedures & Powershell. Informatica was used only as a dependency manager, which was a red flag. The original developers were under pressure to deliver, they picked technologies they were unfamiliar with, then used them incorrectly. The framework created facts with attributes and dimensions with measures. Technologies were used for the wrong reasons and design patterns were broken.</p>
<p><strong>How could these problems have been avoided?</strong></p>
<ul>
<li>The Single Responsibility Principle should be followed for nodes within data pipelines</li>
<li>ETL processes should be Idempotent</li>
<li>Scope creep should be recognized as it happens</li>
<li>Avoid committing code in bulk. Small commits speed up the process to find problems and make deployments easier to digest.</li>
</ul>
<p><strong>Data Engineering is now Software Engineering</strong></p>
<p>Data Warehouse Architectures used to be built by a collection of a few dozen large software vendors like SAP, Business Objects, Cognos, Informatica, Oracle and Teradata. Frameworks were the software you purchased, with Data Engineers building scripts to take care of the long tail of requirements. Todays industry is more complex. There are hundreds of solutions to pick from between Amazon, Google, Apache and others. Companies build their own frameworks from scratch and tailor them to the tools they prefer. Scripts are no longer enough to successfully use today tools. Data Engineers need to be proficient at Systems Design.</p>
<p><strong>How can Data Engineers make the shift to think like Software Engineers?</strong></p>
<ul>
<li>For systems design, brush up using an online course in Software Design and Architecture</li>
<li>For industry knowledge, listen to podcasts and read blogs such as Data Engineering Weekly and SF Data Weekly</li>
<li>For general coding, practices on Leet Code. In algorithms, Data Engineers should be able to solve all easy challenges, most medium challenges, and some hard challenges.</li>
</ul>
<p>To be fair, some of these problems listed above were caused by me, fixed by me or refactored by me. No one wants to learn from mistakes but it’s better than making the same mistake twice.</p>
<p><a href="https://www.linkedin.com/pulse/data-engineers-need-better-systems-design-blaine-elliott/">https://www.linkedin.com/pulse/data-engineers-need-better-systems-design-blaine-elliott/</a></p>blaineSystems Design in Data Engineering is becoming increasingly important. As the data industry becomes increasingly complex, the cost to build frameworks is increasing, as well as the need for good systems design skills. These examples have been taken from LinkedIn, Airbnb, and Chegg.Defining the role of a Data Engineer2019-02-05T09:25:38+00:002019-02-05T09:25:38+00:00https://iblaine.github.io/defining-the-role-of-a-data-engineer<p>The role of the Data Engineer has evolved rapidly over the past few years and created a lot of confusion. Some people with the title Data Engineer spend most their time writing SQL queries while others are creating databases from scratch. This leads to a lot of confusion for Engineering and Recruiting departments. Companies may not know when to shift responsibilities to Data Engineers and Recruiting departments may have trouble creating job postings. I’m going to try to clear that up in this post.</p>
<p>Data Engineering was born out of the need to have a Business Intelligence Analyst with most of the skills of a Software Engineer. The market used to be dominated by a few large companies and the scope of technologies was relatively narrow. Since then, the industry has been infused with many different types of technologies, which has greatly increased the scope of responsibilities. Today’s Data Engineer has most of the skills of a Software Engineer with additional skills to create solutions specific to data.</p>
<p><strong>Examples of tasks for Data Engineers</strong></p>
<ul>
<li>Administer AWS EC2 instances</li>
<li>Build dimensional data marts per Kimball or Inmon</li>
<li>Build tools for managing data in Python, PHP, Ruby or Go</li>
<li>Containerize applications and services using docker, swarm or kubernetes</li>
<li>Create backup and recovery plans</li>
<li>Create CI/CD processes to manage deployments</li>
<li>Create data modeling diagrams using Visio, ERwin or LucidCharts</li>
<li>Create data pipelines for dependency management in Airflow, Gobblin, Luigi, Nifi, or Jenkins</li>
<li>Create ELT jobs using spark, hive, Python, Scala, SQL, Java, pig or PLSQL</li>
<li>Create ETL jobs using SSIS, Informatica, Talend or Pentaho</li>
<li>Create MapReduce functions in Java</li>
<li>Create REST APIs in Python or Java to ingest data from third party tools such as SalesForce or Zuora</li>
<li>Create reports in Tableau, Qlik or Looker</li>
<li>Deploy streaming tools such as Kafka, Flink, Streamsets, Beam or AWS Kinesis</li>
<li>Deploy and optimize… distributed databases such as Hadoop, spark, Druid or ElasticSearch</li>
<li>Deploy and optimize… MPP databases such as Redshift, Teradata, or Snowflake</li>
<li>Deploy and optimize… NewSQL databases such as Rockset or TiDB</li>
<li>Deploy and optimize… NoSQL databases such as MongoDB, HBase or Cassandra</li>
<li>Deploy and optimize… RDBMS databases such as MySQL, Postgres, SQL Server, Oracle</li>
<li>Design and develop data security policies</li>
<li>Manage AI/ML modeling data pipelines in Pandas, Sci-kit Learn, or PySpark</li>
<li>Migrate services and between Amazon AWS, the Google Cloud Platform and on prem solutions</li>
<li>Support end users with a large range of technical skills, from data scientists to management</li>
<li>Track down data quality issues</li>
<li>Train companies and departments on best practices related to databases and data platforms</li>
</ul>
<p>All these tasks above can fall under the scope of Data Engineering. This is a problem because all these tasks together involve a wide range of skills. It’s unrealistic to expect Data Engineers to be experts in all of these tasks. Part of the solution to that problem is Data Engineer must very good at learning new technologies. Another part of that solution is to create categories of Data Engineers and we’re starting to see that today. Titles like Data Infrastructure Engineer and Data Modeling Engineer now exist. Here is a list of common specialties within Data Engineering:</p>
<p><strong>Data Engineer</strong></p>
<p>Primary purpose is to cover Data Engineering needs for the entire company. Typically comfortable with many skills but is not a master in any specific skills. Should understand cloud services, databases, ETL, data pipelines, SQL, at least one object oriented language, and unix. Must be a quick learner because this type of role has you moving in different directions. Usually found at small companies or those companies with new Data Engineering departments.</p>
<p><strong>Sr Data Engineer</strong></p>
<p>All the skills of a Data Engineer plus the ability to lead projects that impact a single department. Comfortable with ‘Big Data’ concepts. Can contribute to existing frameworks.</p>
<p><strong>Staff Data Engineer</strong></p>
<p>All the skills of a Sr Data Engineer plus the ability to lead projects that impact the entire company. Has the ability to create frameworks from scratch. Comfortable making Build vs Buy decisions plus defining and executing roadmaps for large projects.</p>
<p><strong>Data Architect</strong></p>
<p>Primary purpose is to act as a technical leader for all things related to data. Typically provides guidance for projects, does not have direct reports and leads data governance efforts. Should have all the skills of a Staff Data Engineer. Given a project with a dozen Software Engineers and a few Data Engineers, there will be one Data Architect who provides guidance for data related technology decisions. This role is commonly found at large companies and consulting companies.</p>
<p><strong>Data Engineer - Analytics</strong></p>
<p>Primary purpose is to deliver analytical reports and analysis. Should have a good understanding of reporting tools like Tableau, Qlik and Looker. Should be able to optimize queries and databases. This role share a lot in common with a Business Intelligence Analyst.</p>
<p><strong>Data Engineer - Data Pipelines / ETL</strong></p>
<p>Primary purpose is to create and manage data pipelines. Should be able to create their own tools for managing data pipelines or deploy tools like Airlow and Informatica. Has a good understanding of design patterns for batch processing and stream processing.</p>
<p><strong>Data Engineer - Infrastructure</strong></p>
<p>Primary purpose is to create and manage infrastructure projects. Should be able to create tools and build applications. Has a good understanding of OOP, open source projects and cloud services. Many open source projects are spawned by these types of Data Engineers.</p>
<p><strong>Data Engineer - Modeling</strong></p>
<p>Primary purpose is to create and manage dimensional models. Should be able to debate Kimball and Inmon design patterns and has an in depth understanding of database optimization.</p>
<p><strong>The future of Data Engineering roles</strong></p>
<p>For the future, we know that data is not going to go away, it is only going to grow. Technology will keep adjusting to handle the challenges that come with big data both on the infrastructure side as well as the application/tooling side. Today, we are seeing large growth in the data engineer-infrastructure role to keep up with the explosion of big data applications under Apache, AWS and GCP. Data must be accessible to be able to be applied. And we need people to help bridge the gap from raw data to applied data, and those roles may be classified as data engineers, data scientists, or ML engineers.</p>
<p>Thank you Cindy Rottinghuis for your contributions and editing.</p>
<p>#dataengineering #dataengineer</p>
<p><a href="https://www.linkedin.com/pulse/defining-role-data-engineer-blaine-elliott/">https://www.linkedin.com/pulse/defining-role-data-engineer-blaine-elliott/</a></p>blaineThe role of the Data Engineer has evolved rapidly over the past few years and created a lot of confusion. Some people with the title Data Engineer spend most their time writing SQL queries while others are creating databases from scratch. This leads to a lot of confusion for Engineering and Recruiting departments. Companies may not know when to shift responsibilities to Data Engineers and Recruiting departments may have trouble creating job postings. I’m going to try to clear that up in this post.