Happy New Year, everyone! It's 2022 and time to make our resolutions for the year. Even though 4 is not a Fibonacci number, nor is it a prime number, it is the sum of 2+2, which seemed to be an appropriate number of resolutions for 2022.
So here they are; a few things I wish we could all resolve to do in 2022 to improve our data quality and make our data engineers less stressed out.
1) Always use version control
Do you have 1000 different copies of the same spreadsheet between Slack, email, and your local hard drive? Do you ever wonder if you're looking at the most current version or if someone has been updating another version? Do you ever have to manually reconcile differences between independently updated versions?
These are all examples from an immature data architecture.
To begin your metamorphosis into an organization with more data maturity, embrace version control and editing access rights. Whether you're in the Google ecosystem, Windows ecosystem, or something else, you can do both with very little friction. If you keep the same name for the file instead of changing it to v.1, v.2, etc, your document storage system should automatically track changes. To access a previous version, you just need to right click on the file and select it. Right clicking will also bring up a menu for you to adjust different access privileges (e.g. admin, edit, read only). Next time you go to drop a file into your slack channel, post a link instead and save everyone headaches in 2022!
2) Implement More Data Validation
I can't tell you how many times I've dealt with messy data from an organization that shouldn't have been in the state it was in. The bare minimum for data validation can happen with a little JavaScript on the frontend and natively on the backend with SQL databases. If you have a field called street number, you could set it both in the frontend and backend to accept only numbers and it would prevent someone from accidentally inputting both the street number and street name. That process works until the address is 222 1/2 Example Street.
The next level of validating a street address would be to call an API to validate addresses (like the free USPS API) at the point of entry and even implementing autocomplete to assist the person doing inputting data to get it right before it's staged for entry into your database.
I chose this example because everyone has encountered it on an e-commerce website and it is easy to relate to, but there are organizations that don't do this. Even if you're an organization that does implement something like it, think about how you can improve your forms in a similar way to ensure the data you're getting is clean, correct, and not in need of interpretation on the way in. If you leave any ambiguity in the form, people will find all of the edge cases you didn't think about.
3) Be Kind to the Data Engineers
If data is the new oil, then Data Engineers are the refineries that turn crude oil into petroleum products like the gasoline that makes your Lamborghini of an organization run. Analysts put all of the data into pretty pictures and make them digestible to managers, so they get love, but the people who build the foundation of good, clean data that populate a data warehouse or a data lake, are often forgotten about in the process. Managers don't see a database, but they see a report built from one. In 2022, let's try to appreciate the work involved in getting crude oil into a refined product!
4) Manage Expectations
Self driving vehicles have been one year away for years now, but they're still not here; at least not fully autonomous vehicles. Still, we'd all agree that cars are far more capable of driving themselves with human support than they were a decade ago, and you don't even need a Tesla to see the benefits of auto-adjust cruise control or lane guidance systems.
That's exactly the point we need to emphasize in 2022. If a person is able to forecast annual demand with mediocre accuracy, that person would likely keep his/her job even though it is far from perfect accuracy; however, a machine learning model that performs with the same accuracy would most likely be scrapped. This is true even though that person's predictions won't likely be consistent from period to period while the machine learning model will likely perform fairly similarly when kept fresh with new data and could potentially improve through feature engineering.
If it performs as well as or better than a human, that's a win! If you expect perfection, you'll be missing many opportunities for gradual improvement along the way.
(side note: I don't believe we'll ever have fully autonomous cars without 1) infrastructure - vehicle communication, and 2) no more human interference)
Top comments (0)