The SQL Language is Just a Hurdle You Gotta Overcome.

Last Updated 6 days ago

I’m not talking just about Microsoft SQL Server specifically here, nor T-SQL. Let’s zoom out a little and think bigger picture for a second: is the SQL language itself a problem?

Sometimes when I talk to client developers, they gripe about the antiquated language.

The order of a SELECT statement doesn’t make any sense. You shouldn’t state what you’re looking for, before you even say where you wanna get the data from. The FROM should really go first so that query-completion tools like IntelliSense have a fighting chance to help you write the SELECT part. If we started writing our queries like this:

FROM dbo.Users u
INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId
SELECT ...

FROM dbo.Users u

INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId

SELECT ...

Then as you started typing stuff in the SELECT, you could actually get useful stuff out of IntelliSense. How many times have you started typing a query, and query completion tools start throwing all kinds of system functions at you? Idiotic.

Exception handling is a painful mess. Let’s be honest here: the majority of stored procedures and functions out there don’t have error handling. They YOLO their way through the data, hoping and praying that things are as we expect, we have the right permissions, structures haven’t changed, and the data is in a useful state. Everybody looks the other way and mumbles, “We’ll handle errors on the application side,” when in reality those errors are either thrown directly at the innocent user, or simply suppressed and not logged anywhere.

It’s not really a standard. Oh sure, SELECT/FROM/WHERE/ORDER BY works in most databases, but even trivially simple applications break if you try to port them from one database management system to another. Your skills transfer in a similar way: even if you’re great at T-SQL exception handling, you’re still gonna have to tweak the way you do it in Postgres. The concepts are standard, but the specifics are different.

Unit testing is a pipe dream. App code developers know if their code changes will break something. Database developers just punt their stuff into development, run the query a few times, nod because no errors get thrown, and then toss it into production. When code breaks weeks or months later, all we hear is, “Nothing’s been changed.”

So why haven’t we moved on past SQL?

In some ways, we have, with object-relational mapping (ORM) tools like Entity Framework, Hibernate, and Django. The database administrator readers here in the audience usually cringe when they hear those words, but the reality is that developers leverage those tools heavily to build new applications. I don’t blame them. I would too, for all the reasons I talked about above.

What those tools do is translate your desires into SQL, though, which brings us right back where we started. Often, the SQL they generate sucks for performance, thus the typical DBA’s feelings about ORMs. So why haven’t we got a new standard way for applications to talk directly to databases, in a secure, performant, and easy-to-write way?

It’s not for lack of trying: at least once every 6 months, I see a post on HackerNews about a better replacement for SQL. Someone puts a lot of thought into the problems, puts a lot of work into a replacement, and then proudly announces it.

And nobody uses it.

Because SQL is the lowest common denominator that works damn near everywhere, for values of “works.”

It works on the back end. Remember when NoSQL came out, and everybody was all “databases r doomd”? And remember what business users said when they wanted to run their reports? NoSQL persistence layers pretty quickly changed their tune, saying, “Oh, well, uh, we meant Not Only SQL, that’s what we meant,” as they struggled to quickly slap in SQL compatibility. Even MongoDB, king of NoSQL, implemented SQL support.

It works on the front end, especially the reporting front end, which is what managers care about. The people who sign the checks wanna see their data in Power BI and Excel. Every new reporting tool that comes out, in order to check boxes and say they’re compatible with every database, implements SQL support. Oh sure, these tools write horrific queries, but they check the box to say they can get data out of all your different persistence layers, and they do it with SQL first because it’s cheap and easy to support lots of databases that way.

I’ll leave you with an amusing quote from Bjarne Stroustrup:

There are only two kinds of languages: the ones people complain about and the ones nobody uses.

[Video] Office Hours: Database Answers from a Hotel Room

Join Me in Seattle for Watch Brent Tune a Query in SQL Server 2022

32 Comments. Leave new

happydba
June 25, 2024 4:07 pm

Interesting list of quotes you linked!

I think this one from Bjarne also applies to SQL –
“Stability is an important feature for a language used for systems that have to work for decades”

Most SQL code written decades ago still works, and that’s important too. To me anyways!

Reply
Stefano
June 25, 2024 4:47 pm

50 years of SQL. Stable, reliable… it works!
Long life to SQL.

Reply
Michael McCormick
June 25, 2024 4:59 pm

I always start with Select * From dbo.ThatDamnedTableWith149Columns, then delete the ‘*’ and start typing in column names.

Reply
Brian Leach
June 25, 2024 5:05 pm

It is a mistake to conflate T-SQL with SQL. In Oracle’s PL/SQL error handling is trivial, not the nightmare of T-SQL. I started with PL/SQL and in every non-trivial procedure I log the errors, and usually re-raise them. When I needed to write some code in T-SQL, I found it was pretty much impossible to do the same thing.

Reply
- Keith
  June 25, 2024 5:56 pm
  
  I am somewhat confused about what is supposed to be that difficult about logging errors in TSQL. Whatever limitations it has, I typically have more problems with developers mistaking the database for an application and going way beyond logging an error and maintaining data consistency.
  
  Reply
  - Brian Leach
    June 25, 2024 6:27 pm
    
    Keith,
    
    Thank you for your comment. In Oracle PL/SQL you have an exception clause that can catch multiple specific errors, or anything not specific. I use that to call an autonomous transaction procedure that logs the error.
    It may be my ignorance, but it appears to me that in T-SQL
    1) Different exceptions have to be captured and handled in different ways.
    2) I have not yet found a method to call an autonomous transaction to log the error. I do have a kluge where I call a procedure through a linked server to log the messages, but it is not nearly as easy and satisfactory as Oracle PL/SQL:
    
    Procedure xyz (parameter1 in varchar2, parameter2 in varchar2)
    as
    begin
    ….
    exception
    when no_data_found then
    null; — ignore the exception
    when too_many_rows then
    … do something here
    when others then
    — anything else, log the error and re-raise it.
    make_log_error_entry(p_message =>labels(‘p_parameter1′,p_parameter1,’p_parameter2’, p_parameter2));
    raise;
    end;
    
    make_log_error_entry is an autonmous transaction that logs the message, the error, and the call stack. This greatly simplifies locating and defeating the error.
    
    I have not been successful at replicating this in T-SQL. If someone knows of a way, please direct me to some documentation as I could really use this.
    
    Reply
    - Dave Wentzel
      June 26, 2024 3:19 am
      
      No, you are right.
      
      There are a ton of places where PL/SQL is head/shoulders above TSQL. Autonomous transactions are another (I have a big proc with inline logging and I need to rollback the entire proc, yet maintain the log). But this is nit-picking, imo. I did studies on this years ago with folks that were essentially n00bs. In order, these were the easiest to learn/code-for-common-use-cases:
      * plpgsql
      * tsql
      * mysql
      * pl/sql
      
      The fact is, it takes far more lines of code to do very simple things in pl/sql than in tsql. Start with “returning result sets from a ‘procedure'”. That’s not exactly native in pl/sql.
      
      Point is: we can pull out one-offs to prove one is better/easier than another, but the real metric, IMHO, is being able to learn the language quickly, be productive quickly, and cover the most common use cases.
      
      Reply
      - Brian Leach
        June 26, 2024 4:18 pm
        
        Hey Dave,
        
        Don’t want to get into a “which is better” battle, but just so you know, PL/SQL has evolved. If you want to return result sets, you use a collection or a temp table. If you need to return multiple result sets, use a collection of collections. Collections are as easy to implement as T-SQL temp tables, with their biggest disadvantage being they are in memory, so I use temp tables instead for large result sets.
        
        I have written code mostly in PL/SQL, but quite a bit in T-SQL, and found advantages to both languages. I personally find that in PL/SQL it takes less code to do something rather than more code as your studies found. That is my personal experience, I would not expand that experience to a generality.
        
        As for easy to learn, my background started with BASIC on a BASIC 4 system, FORTRAN IV, PL1, SAS, C++ (yes, before I did any C) and various scripting languages such as DOS, BASH, and CSH. I found PL/SQL to be very easy to learn, most likely due to my extensive background in a variety of languages. I struggled learning T-SQL, and to this day I don’t do what I consider adequate error handling.
    - Dave Wentzel
      June 26, 2024 3:51 am
      
      btw, since you asked, the “pattern” in tsql to do an autonomous transaction is to put all of the data you wish to NOT rollback (so the logs in your case) into a TABLE VARIABLE (@tablevar). Then, at the end, persist the @tablevar to your log table or whatever. @tablevars are just tsql variables and do not participate in transaction mgmt semantics.
      
      It’s a kludgey hack, but works.
      
      Reply
    - Dave Wentzel
      June 26, 2024 3:57 am
      
      The “linked server trick” you reference is another way but requires the linked server to NOT participate in DTC transactions. Way back in the day b4 tsql had SEQUENCES and you needed to create your own “sequence generator” this was the trick. Within a proc, or even just a multi-statement call from odbc/jdbc, you could use the “linked server trick” to get a block of “sequences” and be sure that getting those sequences within a long running transaction didn’t block your “sequence table” for other callers.
      But that was decades ago and we have proper sequences now.
      
      Reply
    - Keith
      June 26, 2024 4:14 pm
      
      I was legitimately curious. The data atrocities committed with only raiserror and try catch available i’ve had to deal with have already been bad. The autonomous transaction seems to be the biggest strength in error handling. In the SQL world, whenever SQL makes something easier however it seems to just give users who shouldn’t even be allowed to own a computer, let alone an account with ddl permissions, (let’s be honest, probably sysadmin which IT, compliance and DBAs said they can’t have but the Cxx insisted upon providing it) more agency to create problems.
      
      Reply
toltecman
June 25, 2024 5:07 pm

Since I have retired from database administration I have spent a lot of time reading papers, books and articles written by C. J. Date and Hugh Darwen back in the 80’s and 90’s on various aspects of SQL, relational theory, NULLs, and 3VL. I would recommend anyone seriously programming in SQL to do the same. I am particularly interested in their proposed language called Tutorial-D. Check out Date’s various books particularly ‘Relational Theory For Computer Professionals’ and ‘SQL and Relational Theory: How To Write Accurate SQL Code’. In my last few years in the industry I saw too many developers that were more or less just SQL ‘hacks’. When I suggested using a correlated subquery I got way too many deer-in-headlights looks from them and they thought the only way to tune a query was to add another index. Sadly this was more the case where we were using SQL Server. The developers I knew working with Oracle were much more advanced on their knowledge of and skills with SQL.

Reply
- dave wentzel
  June 26, 2024 3:22 am
  
  I agree with that…but see my other comment above. pl/sql developers _in toto_ are anecdotally more “advanced” than tsql developers, but the time to get to “advanced” takes far more time.
  
  Just my 2 cents.
  
  Reply
  - Dave Wentzel
    June 26, 2024 4:31 am
    
    One last point, it’s common to conflate “sql” with relational theory. SQL was never meant to adhere to relational theory. There is one particular blogger, Fabian Pascal, a CJ Date fanatic, that constantly conflates these. He complains _ad nauseum_ how SQL-the-language sucks and how every RDBMS doesn’t adhere to relational theory.
    
    The thing is…he’s NOT WRONG. But he _assumes_ the goal of every RDBMS and every physical data modeler is purity to relational theory. That’s a lofty goal. Essentially, he wants the _logical data model_ to be the same as _the physical data model_. Fact is, every RDBMS carries a lot of baggage from the 1970s where shortcuts needed to be made to make the work _efficient_. Hence why the physical model diverges from the logical.
    
    Similarly, relational DBAs and other “sql folks” bitch and moan about NoSQL. “You can’t do JOINs efficiently” and “it violates 3NF everywhere.” Yep. That’s the whole point. If you are designing a 3NF implementation on NoSQL, well, YOU ARE DOING IT WRONG. NoSQL is meant (among a few other things), to be super-efficient in point-based lookups. So…the data model MUST take some liberties with relational theory, even more than most RDBMSs. NoSQL is NOT a replacement for an RDBMS, and if you think it is, YOU ARE DOING IT WRONG. It’s meant to make it easier to comprehend for a very small set of use cases.
    
    Pascal also complains about star schemas b/c they take liberties with proper relational modeling (3NF). Yep. And we do that, as we all know, because it performs BETTER for analytics. But you’d never do a star schema for OLTP. All of these things are _optimization hacks_.
    
    BTW, this is why the star schema is slowly dying. Those optimization hacks aren’t needed as much today as in the past. Data lakes (and their semantics) make the modeling much simpler. Bill Inmon, the Father of Data Warehousing, has a free ebook on databricks’ website where he goes over his Corporate Information Factory concepts and how his thinking has changed with the advent of data lakes and faster/cheaper compute. It’s a good read just to see how his thinking has evolved. Similarly, this is why OLAP cubing tech is dying…another _optimization hack_ that may not be needed in 2024. The same can be said of memory-resident/optimized storage structures like SAP HANA and “hekaton”. With technologies like Spark I don’t necessarily need to keep entire datasets resident in RAM to get the performance I need.
    
    But I’m digressing…
    
    The goal of SQL, star schemas, and NoSQL isn’t TRULY for performance, but really it’s to make certain things easier for folks to comprehend _for the given use case_. But as performance is being solved with better hardware, we’re finding we need less folks with PhDs in data modeling. ie, we are bringing these things to the masses. “Solve the business problem in the best way you can.” Yep, the result is likely a horrendous data model that annoys purists, but meanwhile the rest of us just want to generate business value.
    
    Tutorial-D is an awesome language _if you are a purist_. But the reality is…it never caught on because nobody really cares about purity, they care about _gittin stuff done_. Tutorial-D’s learning curve is just too steep for mere mortals…IMO.
    
    Reply
    - Koen Verbeeck
      June 26, 2024 8:05 am
      
      Ugh, Fabian Pascal, is that dick still around?
      As you mentioned, NoSQL is great for a certain use cases. But I wouldn’t use it for financial transactions 🙂
      
      Not sure the star schema is slowly dying. Sure, you don’t really need it for performance optimizations anymore, but it has other uses: simplicity for end users, centralizing business logic in one fact table etc. When using Power BI, star schemas are still the de facto standard.
      
      Reply
      - Dave Wentzel
        June 26, 2024 10:03 pm
        
        hahaha. Yeah, he’s still around.
        
        I think the star schema is dying insofar as being needed for performance optimizations. Similar to OLAP cubes. For understandability, mosdef it makes a lot of sense. However, I’d posit data lake semantics are even obviating the understandability need too. For example…thought exercise…say it takes you 1 hour to develop some code to load some data without SCD type 2 into a dimension. How much time would it take you to make that code SCD2-aware? With BIML maybe it’s a checkbox activity but most ETL developers (when I’ve polled them) will say something like “for every 1 hour needed to code a dimension I need another 3-4 hours to get SCD2 working right”. And how many reports really need anything other than IsActive=1? Not many. So it’s the Pareto Principle (80% of the work covers 20% of the use cases). AND…SCD2 confuses the hell out of most business users. Data lake semantics allow you to totally skip SCD2 code and allows you to get the history, when it’s needed, via AS OF TIMESTAMP semantics (for example). We do a lot of studies on this and the conclusion is “if you don’t need SCD2 right now, don’t do it, we can always generate it later from the lake”. This saves a ton of time and confusion. I’m not suggesting that star schemas are dead, but rather the number of use cases is shrinking.
        
        Said differently, and to Brent’s point of this post…you gotta learn some SQL if you wanna use data. If we can simplify the models the SQL skills become attainable for the masses. For most of these users they really aren’t full-time star schema developers and dashboard creators. Rather, they are trying to answer questions that are squishy and likely don’t need to go on a dashboard. Things like, “We spend a ton on facebook ad analytics. I wonder what the rough ROI is and would it get better or worse if we spend a different amount.” These are the types of questions that can be answered with simplistic models and sql code. We aren’t looking for exact answers and certainly wouldn’t spend any time getting the data into a star schema.
Dave Wentzel
June 26, 2024 3:40 am

This is an important post and thank you Brent for writing it.

In my role I am an evangelist for data democratization, self-service analytics, etc. The biggest “complaint” I hear is “my people can’t learn sql”. Well, that’s hogwash.

What they really mean is “they can’t learn SQL as it is implemented in all current DBMSs”.

The fact is, you really don’t need to know EVERYTHING about SQL to be productive. The real problem isn’t pl/sql vs tsql or understanding error handling or transaction mgmt or whatever. What most folks need is a basic grounding of filtering, joins, and grouping. Window functions, subqueries, etc…possibly that isn’t _needed_ for the average organization to get data into the hands of the masses to provide organizational _lift_.

The fact is, ZERO of the RDBMSs do a good job of making this easier for the masses.

Enter Spark.

I show folks how easy sql by using Spark and we walk through a use case. Get data, clean up data, join data to one table, check results, join to next table, check results, repeat, then pepper in your business logic accordingly.

When I show this code to a SQL Server DBA the response is: “wait, you’ve effectively joined 32 tables together one-by-one and that will NEVER perform since it all goes to temp tables.” WRONG. In Spark, no work is “done” until you take an “action”…then all of the intermediate queries (the 32 table joins to 32 temp tables) are optimized into one giant query and executed…just like a SQL Server DBA would do it.

Said differently, imagine a gnarly stored proc with lots of temp table processing. What does the optimizer do? Basically nothing, it executes each code block sequentially. Imagine, instead, if the optimizer took a few secs and said, “wait, I see what you’re doing, I can optimize this for you by eliminating a lot of the temp tables.”

This makes SQL “knowledge” attainable for the masses. We do small operations, then check our work. In ALL CURRENT DBMSs we would need to have this giant single SELECT statement with LEFT JOINs, filters on the ON vs WHERE (which affect results, right?), GROUP BYs…how do you QA this? How do you understand this if you aren’t a “tsql developer”? How do you even begin to THINK about unit testing it?

That is the true beauty of Spark that ALMOST NO ONE (including Spark experts) understands and appreciates. (Add to that the ability to write python interspersed with SQL where both operate on the same result sets and, whoa, that’s some power).

If the DBMS vendors added this one, simple “feature” to their language this would make learning SQL so much easier for the masses. Basically, allow n00bs to write inefficient code littered with temp tables and, instead of running the SQL code procedurally (line by line, top-down, just like C # code), optimize it FOR THE USER by eliminating the unnecessary temp tables.

Interested in your thoughts

Reply
- andrew wait
  June 28, 2024 6:33 am
  
  @dave wentzel
  
  I was really interested in your comments about spark and performance, especially ‘in Spark, no work is “done” until you take an “action” ‘ and that this is little understood/exploited
  
  I’ve tried leveraging google to investigate further but would be really interested if you could point to worked examples/blog posts/stack overflow etc. where this is elaborated further
  
  Reply
  - dave wentzel
    July 1, 2024 12:55 am
    
    Just do a google search on “spark transformations vs actions”. Here’s a brief example, imagine a billion row table:
    * convert all the unix-timestamp cols to datetimes. Runs instantly because this is a transformation
    * concatenate firstname with lastname. Runs instantly because this is a transformation.
    * filter out all rows older than 30 days. Runs instantly b/c this is a transformation.
    * select top 100 * or select count(*) …takes awhile b/c this is an action and the previous 3 transformations will be optimized, inlined, and then run.
    
    During development this is less noticeable for most users b/c after each transformation you are likely doing a select top 100 to make sure your code worked. But when this is done in production it is totally optimized and this means that transformations can be pipelined and are much easier to write, debug, test, etc than how every other RDBMS does it today.
    
    Think of it this way…have you ever seen truly “pythonic” code, or powershell code that is just a series of complex commands that are separated with pipe characters, or similar code in bash? While it is super-efficient, I often wonder if code like this is purposefully written by developers that merely want the next person to think how clever they are. Fact is, code like that is difficult to understand, unit test, debug. SQL code with lots of INNER and LEFT JOINs, derived tables, correlated subqueries, etc is the same thing.
    
    Writing code where each “line” can “do one thing and do it well” is, IMO, the right way to do it.
    
    Reply
    - andrew wait
      July 1, 2024 7:23 am
      
      Thank you so much for that reply, really appreciate the time, the comments and the google pointer..
      
      … and I’m adding “Writing code where each “line” can “do one thing and do it well” is, IMO, the right way to do it.” to my list of maxims
      
      Reply
Dew Drop – June 26, 2024 (#4215) – Morning Dew by Alvin Ashcraft
June 26, 2024 11:11 am

[…] The SQL Language is Just a Hurdle You Gotta Overcome. (Brent Ozar) […]

Reply
SQL as a Language – Curated SQL
June 26, 2024 12:10 pm

[…] Brent Ozar shares some thoughts: […]

Reply
Julie
June 26, 2024 2:22 pm

My first thought on day 1 of learning SQL in my current job: Why are we telling it what to find before we tell it where to find it? And then: How did people who are so smart miss/ignore such a basic human thought process?

These days, I write the FROM clause first since as Brent said this allows IntelliSense to help instead of just making everything look like garbage.

Reply
- Dave Wentzel
  June 26, 2024 10:16 pm
  
  I disagree. The human thought process, expressed in English, is “show me this and get it from here”, which is imperative and closer to the english language and maps to SELECT before FROM. By telling it where to find it and then what you want I think you are thinking more like a procedural computer. In English, that would be “after getting the data from here, show me this”. Seems unnatural.
  
  But mosdef it absolutely sucks in 2024 with Intellisense. This is baggage from when SQL was called SEQUEL (the first E being “English”)
  
  Reply
  - Brent Ozar
    June 26, 2024 10:19 pm
    
    “Go to the grocery store and get me eggs, milk, and cheese.”
    
    Sounds perfectly normal to me.
    
    Reply
    - toltecman
      June 27, 2024 2:48 pm
      
      More like ‘go to aisle 5 get me eggs, go to aisle 12 and get me milk, go to aisle 14 and get me cheese’, i never made a list like that.
      
      Reply
    - dave wentzel
      June 27, 2024 4:18 pm
      
      I’m not saying you’re wrong. I’m merely saying what the original SEQUEL devs were thinking (it’s in a whitepaper somewhere). Regardless, I don’t think you’re analogy holds. By saying “Go to the grocery store and” ….that’s not really SELECT, which is closer in analog to “show me” or “I want”.
      
      So, using grocery store analogy…”I want eggs, milk, cheese from grocery store”. It’s closer to imperative than procedural.
      
      I don’t disagree with you, just telling you what people were thinking at the time.
      
      Reply
      - Brent Ozar
        June 28, 2024 3:46 pm
        
        I do understand what people were thinking decades ago. This post is about why that thinking makes less sense today.
    - Brian Leach
      June 27, 2024 4:25 pm
      
      I only wrote a single C# program, and that was 15 years ago, but I seem to recall a SQL interface called LINQ that structured queries this way. Because I had been writing SQL for over 10 years by that time, I felt more comfortable writing the database SQL directly.
      
      Reply
  - animorph
    June 27, 2024 10:23 am
    
    I wouldn’t say unnatural (see Brent’s reply for a usual way of thinking about it), but I understand your thought process too. Using the shopping example, you would write out your shopping list and then think where to (select) it from – e.g. your eggs, milk, and cheese comes from Aldi, but you really need to go to Lidl for that handy chainsaw and scuba diving mask.
    
    Reply
Francesco Mantovani
June 27, 2024 8:36 am

Would it be possible to create a plugin for IntelliSense that could archive this goal?
Something like this:
– When you write FROM it suggest a list of tables. All tables actually.
– When you choose a table it suggest you to use SELECT or JOIN
– If you use SELECT the plugin, behind the curtains, writes “SELECT * from dbo.Posts” but the user don’t see this. This will allow to show a list of columns
– If you use JOIN the plugin show you a list of tables that have foreign keys linked to that table.

Reply
- Dave Wentzel
  June 27, 2024 4:29 pm
  
  You could do that, but frankly I can’t see any good reason where the order of the clauses EVER mattered. Am I missing something?
  * you must always have a FROM and SELECT (oracle uses the notion of the dual table for table-less selects….but whatever)
  * a HAVING needs a GROUP BY
  * blah blah blah
  The only time the order matters, that I can think of, is LEFT JOIN filter conditions (it matters if they are on the ON or on the WHERE).
  
  This would allow you to write your SQL query pretty much however your brain works. And if you want intellisense then just put the FROM clause first.
  
  Another pet peeve is I wish the vendors supported NATURAL JOIN keyword (which allows the ON clause to be omitted if DRI is in place).
  
  Reply