> The key point is that SQLite is very forgiving of the type of data that you pu...

at_a_remove · on Oct 20, 2022

Me!

I'm often exploring data where either there's no defined standard or the use of the data has drifted from the standard. Now, I could go over this line-by-line, but instead my go-to has been "Hey, let's throw this into SQLite, then run stats on it!" See what shakes out. SQLite kindly and obediently takes all of this mystery data, which ends up being nothing like what I was told, and just accepting it. Then I can begin prodding it and see what is actually going on.

This is something that has come up for me for at least a decade: chuck it in SQLite, then figure out what the real standard is.

thfuran · on Oct 20, 2022

It's reasonable to have an arbitrary/untyped data type but an integer data type should not accept non-integers.

galaxyLogic · on Oct 20, 2022

Right what's the point of declaring a column to have type INTEGER if it really means "This column MAY contain integers, but may contain other kinds of data as well".

rzzzt · on Oct 20, 2022

Or union types. Just tack "...or a string" onto any questionable column.

muttled · on Oct 20, 2022

It's especially great for large important legacy datasets that similarly did no enforcement. You can then interact with the data and clean it up through some commands without trying to work with massive datasets in Excel where it sometimes just crashes the entire app.

rscho · on Oct 20, 2022

I do exactly the same. You import from CSV, everything ends up as a string, and then you copy the data to a STRICT table so you can understand what's really going on.

dekhn · on Oct 20, 2022

irreversible loss of information.

rscho · on Oct 20, 2022

What do you mean? I don't think this process ends up with less info than the CSV file format itself.

Short of doing that, cleaning very dirty data has no satisfactory solution, I think. Optional typing is a nice middle ground between untyped and slow (R, Python) or strictly typed and tedious (all other DB engines).

dekhn · on Oct 20, 2022

Maybe I misunderstand, but if you take a data type that is a union of two datatypes (string and int) and push it into somethign that converts all the data to strings, you lose the information of whether a string that looks like an int was originally a string or int.

Maybe instead you mean to use the database functionality to identify problems like that in the original source data. If somebody hands me a string "11101100101001" I would not attempt to interpret it by parsing it into a binary number, but I think that's because I really like strong, simple typing.

ratrocket · on Oct 20, 2022

I can't answer for the person you're commenting to, but I use sqlite in a similar way and the loss of information isn't an issue. When you're exploring data, just trying to make sense of what someone has sent you (say, migrating data from an existing system to a new system) you probably have a pretty good idea if the column is supposed to have integers or strings in it. In my own use of it, it's very unlikely that the loss of information will cause a problem. Eventually the data will be type-checked -- not in sqlite (I haven't experimented with the new-ish "strict tables" feature yet), but in my case the data will eventually flow through a typed programming language.

Anyway, different ways to use the tool, for sure! And in some cases one would definitely need to be attuned to the issue you're raising. In the kind of situation I'm thinking of the data is usually so dirty anyway, a bit of string->integer conversion won't hurt (probably).

dekhn · on Oct 20, 2022

I get it, you don't have typing control over your data exchange. somebody might send you an unquoted-string CSV with multiple data types in the same column:

id,location,name

1,90210,Tori Spelling

2,Schitt's Creek,Eugene Levy

and you use sqlite to quickly explore the loaded data to check the set of types used in a column, and maybe even glean what the meaning is.

ratrocket · on Oct 20, 2022

Exactly. The thing I'm talking about is necessarily a semi-manual process (data exploration, usually prior to migrating data from one system to another).

In the example you give, when you've done all the exploration you need, there's a program interpreting that CSV that ensures the location column values are strings. At least, that's how I do it!

rscho · on Oct 20, 2022

> Maybe instead you mean to use the database functionality to identify problems like that in the original source data.

Yes, that's exactly my use case.

at_a_remove · on Oct 20, 2022

Like I said earlier, I don't know what I am getting. Strong, simple typing is premature at this juncture.

dekhn · on Oct 20, 2022

Got it, it's data exploration not a formal ETL pipeline code.

at_a_remove · on Oct 20, 2022

I've done it for formal ETL pipeline code, too. I make it tighter, but not too tight. Sadly, the spec given me was very ... optimistic? Bright-eyed? Many things that were not supposed to happen did happen, data that shouldn't have gotten near me did, and so on. SQLite swallowed it and kept on truckin', because when it comes to ETL, one dumb field entry shouldn't stop the show. Catch it, yes, deal with it when you note it in the logs, absolutely, but one must continue, and for that SQLite is most excellent.

throw10920 · on Oct 20, 2022

Were SQLite not such a stellar piece of software, I would complain, and ask that this kind of flexibility be disabled by default, with an option to explicitly enable it if necessary.

tehbeard · on Oct 20, 2022

While I doubt it will ever be enabled by default due to backwards compatibility, there is a strict tables[1] feature now.

[1] https://www.sqlite.org/stricttables.html

silvestrov · on Oct 20, 2022

There should be an option to have strict tables, foreign key checks and other modern features to be automatically enforced.

E.g. a command line argument when creating a DB, which then taints this database as modern. Something like what html did with "<!DOCTYPE html>".

Or simply accept that the time has come to make a version 4 of sqlite.

tomsmeding · on Oct 20, 2022

> There should be an option to have strict tables, foreign key checks and other modern features to be automatically enforced.

I know to enable foreign key checking, and I learned strict tables from this thread. What are the "other modern features" you were referring to?

EDIT: In TFA, section 8 is relevant here I guess: SQLite accepts double-quoted strings as a historical misfeature, which can be disabled with the appropriate C API functions. This is one of the "other modern features" I guess; TIL.

ratrocket · on Oct 20, 2022

That might have to be version 5 of sqlite. Version 4 came and went:

https://sqlite.org/src4/doc/trunk/www/index.wiki

> SQLite4 was an experimental rewrite of SQLite that was active from 2012 through 2014.

tehbeard · on Oct 20, 2022

The strict keyword per that doc already taints the a table for pre 3.37.0 (2021-11-27) libraries. (There is a way around that for pre 3.37 per the doc, but it's mainly enabling the PRAGMA used for db recovery operations that treats the strict at the end as an error and ignores it to "facilitate recovery").

So the solution your after would be some additional calls when initializing the database to enable the FK checks (alongside any other app related PRAGMA calls like checking the data_version, configuring journal or page size), and ensure any tables created have STRICT applied.

chrismorgan · on Oct 21, 2022

Unfortunately I found strict mode not very usable, because it only allows the six basic datatype names and doesn’t do column affinity determination <https://www.sqlite.org/datatype3.html#determination_of_colum...>. This thwarts one of the more practical forms of strictness in querying libraries, automatic column type mapping (e.g. mapping DATE (which has NUMERIC affinity) to your language’s date type, or conceivably some other convention like mapping TEXT_Foo to your own Foo type). I’m perplexed as to the rationale for this restriction which feels just gratuitous and unnecessary.

tehbeard · on Oct 21, 2022

That feels like a bit of a hack, tbh. Playing off a mechanism I think? was made for making porting from other day syntaxes easier?

Is sqlite converting an iso8601 to a timestamp, or would it just store as given when type coercion fails due to dashes and spaces? From memory, it's the latter.

It would be slightly more brittle, but surely this metadata should be at app/orm layer rather than the database?

chrismorgan · on Oct 21, 2022

> surely this metadata should be at app/orm layer rather than the database?

I cannot fathom why you would say this, because the way I see it of course it belongs in the database as part of the table definition, as it’s obviously part of the logical schema. Sure, you can’t actually enforce invariants for individual types so that they would be just conventional aliases for the underlying affinities, but that doesn’t mean you should avoid specifying meaningful types.

Which of these would you prefer:

  CREATE TABLE example (
     id      BLOB PRIMARY KEY,
     created INTEGER,
     data    TEXT
  );

  CREATE TABLE example (
     id      uuid PRIMARY KEY,
     created timestamp,
     data    json,
  );

I know that I want the latter, because it makes my life much easier when reading the schema, and lets code automatically use the right types based on inspecting the schema (though you will need to define a mapping of SQL type names to your programming language’s types, since they are still only informational rather than structural like in most SQL databases). Ideally you might be able to define your own datatype affinities (SQL even defines a suitable syntax: CREATE TYPE uuid AS BLOB), but it’s not so bad leaning into the built-in rules with BLOB_uuid, INTEGER_timestamp and TEXT_json (… though on reflection I admit this is rather perilous due to the precise affinity determination rules, shown in the example “TEXT_point” which would be INTEGER due to containing “int”, so maybe it is actually better that strict mode doesn’t blithely use the current affinity determination rules on the expressed type).

(Actually on the DATE thing I was forgetting and thinking that was a regular feature but it’s actually just the fallback affinity where a type is specified but not matched by any other rule, NUMERIC. Strike out that example as a canonical definition or anything, then. But the rest of the point still stands.)

systemvoltage · on Oct 20, 2022

> Were SQLite not such a stellar piece of software,

Why? Complaining about a rational and logical thing is orthogonal to something's popularity or how excellent it is otherwise.

State your opinions freely.

ajkjk · on Oct 20, 2022

well, it seems like you should chuck it into an untyped table first. If there are types on a column I'd... really want them to do something.

rscho · on Oct 20, 2022

> wtf. who would ever want that?

Each and every biostatistician on this planet. Especially those touching clinical data. Personally, I was saddened to learn that DuckDB did not include dynamically typed, or at least untyped, columns. Happily my data loads are usually small enough for a row-oriented data store.

CHECK constraints, now in conjunction with STRICT tables, are the best invention since sliced bread! If I could improve on one thing, it would be to remove any type notion from non-STRICT tables.

masklinn · on Oct 20, 2022

Fwiw the ANY type is valid in strict tables, and does what it says.

Slightly more so than in non-strict tables in fact: it will store exactly what you give it without trying to reinterpret it e.g. in non-strict, a quoted literal composed of digits will be parsed and stored as an integer, in strict, as a string.

Strict mode also doesn’t influence the parser, so e.g. inserting a `true` in a strict table will work, because `true` is interpreted as an integer literal.

mytherin · on Oct 20, 2022

We have a PR in the works that is adding support for sum types to DuckDB [1]. Not quite fully dynamic, but perhaps still useful :)

[1] https://github.com/duckdb/duckdb/pull/4966

dspillett · on Oct 20, 2022

This always put me off a bit too. I have memories of mySQL supporters back in the early days defending it storing dates like 2022-02-31 and other such type/range/other issues with the standard refrain of “but SELECT * is blazing fast!”…

As of last year there is an option to make things more strict (https://www.sqlite.org/stricttables.html) though as SQLite doesn't have real date types, unless you are using one of the integer storage options code could insert invalid dates like mysql used to allow.

--

EDIT: having actually read the linked article, it explicitly mentions the date type issue also.

paulclinger · on Oct 20, 2022

Even in the absence of strict tables, you can also add a CHECK constraint whenever limits are needed on the value types. Here is one example from a recent discussion in SQLite forum threads: `check(datetime(x, '+0 seconds') IS x)`.

chasil · on Oct 20, 2022

An important proviso on the strict tables is that they will not be observed on older SQLite versions lacking this feature.

SQLite has been rock-solid since attaining DO-178B, and commonly isn't ever upgraded in many installations.

CentOS 7 is using 3.7.17. Since the v3 database format is standardized, the older version can utilize a STRICT database file, but will not have the capability to alter datatype behavior.

For these cases, implementing both STRICT and the relevant CHECK constraints is advisable.

RedShift1 · on Oct 20, 2022

If you use integer unix timstamp as datatype for dates, you technically cannot insert an invalid date

dspillett · on Oct 21, 2022

Hence "unless you use one of the integer storage options". I suspect many don't because the values being directly human readable could be useful.

jrockway · on Oct 20, 2022

It's even more fun in a boolean column, where some databases accept 't'/'f', 'true'/'false', etc. SQLite accepts some of those, but treats other ones as text.

groue · on Oct 20, 2022

I do, for JSON columns. I store UTF8 strings in SQLite, so that it is easy to see JSON values with a plain `SELECT *`). And I load blobs, because I code in Swift and the standard JSON decoder eats raw UTF8 memory buffers, not strings.

This avoids two useless conversions:

- first from a C string loaded from SQLite to a Swift Unicode string (with UTF8 validation).

- next from this Swift Unicode string to a UTF8 memory buffer (so that JSONDecoder can do its job).

SQLite is smart enough to strip the trailing \0 when you load a blob from a string stored in the database :-)

systemvoltage · on Oct 20, 2022

SQLite is an excellent app database for local things. It doesn't compete with PostgreSQL or MySQL for huge number of reasons. So, to make it fit in the competition model, perhaps there are a bunch of things are glossed over.

Does anyone see a massive pro-sqlite movement going on? Sort of like what happens in JS-ecosystem. Everyone is bandwagoning on it. Criticism of SQLite is much welcomed, specifically exemplifying what its role is and which use cases it serves really well.

SPBS · on Oct 20, 2022

SQLite needs this, because changing column types is such a pain in the ass. The answer to "I want to move from INT primary keys to TEXT/UUID, how do I change the column type?" is "just insert the data into the column because SQLite allows everything".

And the reason changing column types is so hard is because, uh, SQLite stores its schema as human readable plaintext (easier to keep compatibility between versions) and not normalized tables like other databases.

dekhn · on Oct 20, 2022

What does it mean for a database to have primary keys of different types? In particular, primary keys are frequently used for clustering and other performance enhancements and I would expect that they would need a total ordering that made sense.

RedShift1 · on Oct 20, 2022

If you're talking about clustering you're probably also beyond the use case of sqlite

dekhn · on Oct 20, 2022

Not multi-machine clustering, index clustering: https://www.sqlite.org/withoutrowid.html

As much as I love sqlite, its table model is really confusing to me coming from a postgres mentality: "WITHOUT ROWID is found only in SQLite and is not compatible with any other SQL database engine, as far as we know. In an elegant system, all tables would behave as WITHOUT ROWID tables even without the WITHOUT ROWID keyword. However, when SQLite was first designed, it used only integer rowids for row keys to simplify the implementation. This approach worked well for many years. But as the demands on SQLite grew, the need for tables in which the PRIMARY KEY really did correspond to the underlying row key grew more acute. The WITHOUT ROWID concept was added in order to meet that need without breaking backwards compatibility with the billions of SQLite databases already in use at the time (circa 2013)."

tejtm · on Oct 20, 2022

> wtf. who would ever want that?

Have you met non-computer scientists?

Anything and everything is fair game and it is probably for the best.

Job security anyway