> The key point is that SQLite is very forgiving of the type of data that you put into the database. For example, if a column has a datatype of "INTEGER" and the application inserts a text string into that column, SQLite will first try to convert the text string into an integer, just like every other SQL database engine. Thus, if one inserts '1234' into an INTEGER column, that value is converted into an integer 1234 and stored. But, if you insert a non-numeric string like 'wxyz' into an INTEGER column, unlike other SQL databases, SQLite does not throw an error. Instead, SQLite stores the actual string value in the column.
I'm often exploring data where either there's no defined standard or the use of the data has drifted from the standard. Now, I could go over this line-by-line, but instead my go-to has been "Hey, let's throw this into SQLite, then run stats on it!" See what shakes out. SQLite kindly and obediently takes all of this mystery data, which ends up being nothing like what I was told, and just accepting it. Then I can begin prodding it and see what is actually going on.
This is something that has come up for me for at least a decade: chuck it in SQLite, then figure out what the real standard is.
Right what's the point of declaring a column to have type INTEGER if it really means "This column MAY contain integers, but may contain other kinds of data as well".
It's especially great for large important legacy datasets that similarly did no enforcement. You can then interact with the data and clean it up through some commands without trying to work with massive datasets in Excel where it sometimes just crashes the entire app.
I do exactly the same. You import from CSV, everything ends up as a string, and then you copy the data to a STRICT table so you can understand what's really going on.
What do you mean? I don't think this process ends up with less info than the CSV file format itself.
Short of doing that, cleaning very dirty data has no satisfactory solution, I think. Optional typing is a nice middle ground between untyped and slow (R, Python) or strictly typed and tedious (all other DB engines).
Maybe I misunderstand, but if you take a data type that is a union of two datatypes (string and int) and push it into somethign that converts all the data to strings, you lose the information of whether a string that looks like an int was originally a string or int.
Maybe instead you mean to use the database functionality to identify problems like that in the original source data. If somebody hands me a string "11101100101001" I would not attempt to interpret it by parsing it into a binary number, but I think that's because I really like strong, simple typing.
I can't answer for the person you're commenting to, but I use sqlite in a similar way and the loss of information isn't an issue. When you're exploring data, just trying to make sense of what someone has sent you (say, migrating data from an existing system to a new system) you probably have a pretty good idea if the column is supposed to have integers or strings in it. In my own use of it, it's very unlikely that the loss of information will cause a problem. Eventually the data will be type-checked -- not in sqlite (I haven't experimented with the new-ish "strict tables" feature yet), but in my case the data will eventually flow through a typed programming language.
Anyway, different ways to use the tool, for sure! And in some cases one would definitely need to be attuned to the issue you're raising. In the kind of situation I'm thinking of the data is usually so dirty anyway, a bit of string->integer conversion won't hurt (probably).
I get it, you don't have typing control over your data exchange. somebody might send you an unquoted-string CSV with multiple data types in the same column:
id,location,name
1,90210,Tori Spelling
2,Schitt's Creek,Eugene Levy
and you use sqlite to quickly explore the loaded data to check the set of types used in a column, and maybe even glean what the meaning is.
Exactly. The thing I'm talking about is necessarily a semi-manual process (data exploration, usually prior to migrating data from one system to another).
In the example you give, when you've done all the exploration you need, there's a program interpreting that CSV that ensures the location column values are strings. At least, that's how I do it!
I've done it for formal ETL pipeline code, too. I make it tighter, but not too tight. Sadly, the spec given me was very ... optimistic? Bright-eyed? Many things that were not supposed to happen did happen, data that shouldn't have gotten near me did, and so on. SQLite swallowed it and kept on truckin', because when it comes to ETL, one dumb field entry shouldn't stop the show. Catch it, yes, deal with it when you note it in the logs, absolutely, but one must continue, and for that SQLite is most excellent.
Were SQLite not such a stellar piece of software, I would complain, and ask that this kind of flexibility be disabled by default, with an option to explicitly enable it if necessary.
> There should be an option to have strict tables, foreign key checks and other modern features to be automatically enforced.
I know to enable foreign key checking, and I learned strict tables from this thread. What are the "other modern features" you were referring to?
EDIT: In TFA, section 8 is relevant here I guess: SQLite accepts double-quoted strings as a historical misfeature, which can be disabled with the appropriate C API functions. This is one of the "other modern features" I guess; TIL.
The strict keyword per that doc already taints the a table for pre 3.37.0 (2021-11-27) libraries. (There is a way around that for pre 3.37 per the doc, but it's mainly enabling the PRAGMA used for db recovery operations that treats the strict at the end as an error and ignores it to "facilitate recovery").
So the solution your after would be some additional calls when initializing the database to enable the FK checks (alongside any other app related PRAGMA calls like checking the data_version, configuring journal or page size), and ensure any tables created have STRICT applied.
Unfortunately I found strict mode not very usable, because it only allows the six basic datatype names and doesn’t do column affinity determination <https://www.sqlite.org/datatype3.html#determination_of_colum...>. This thwarts one of the more practical forms of strictness in querying libraries, automatic column type mapping (e.g. mapping DATE (which has NUMERIC affinity) to your language’s date type, or conceivably some other convention like mapping TEXT_Foo to your own Foo type). I’m perplexed as to the rationale for this restriction which feels just gratuitous and unnecessary.
That feels like a bit of a hack, tbh. Playing off a mechanism I think? was made for making porting from other day syntaxes easier?
Is sqlite converting an iso8601 to a timestamp, or would it just store as given when type coercion fails due to dashes and spaces? From memory, it's the latter.
It would be slightly more brittle, but surely this metadata should be at app/orm layer rather than the database?
> surely this metadata should be at app/orm layer rather than the database?
I cannot fathom why you would say this, because the way I see it of course it belongs in the database as part of the table definition, as it’s obviously part of the logical schema. Sure, you can’t actually enforce invariants for individual types so that they would be just conventional aliases for the underlying affinities, but that doesn’t mean you should avoid specifying meaningful types.
Which of these would you prefer:
CREATE TABLE example (
id BLOB PRIMARY KEY,
created INTEGER,
data TEXT
);
CREATE TABLE example (
id uuid PRIMARY KEY,
created timestamp,
data json,
);
I know that I want the latter, because it makes my life much easier when reading the schema, and lets code automatically use the right types based on inspecting the schema (though you will need to define a mapping of SQL type names to your programming language’s types, since they are still only informational rather than structural like in most SQL databases). Ideally you might be able to define your own datatype affinities (SQL even defines a suitable syntax: CREATE TYPE uuid AS BLOB), but it’s not so bad leaning into the built-in rules with BLOB_uuid, INTEGER_timestamp and TEXT_json (… though on reflection I admit this is rather perilous due to the precise affinity determination rules, shown in the example “TEXT_point” which would be INTEGER due to containing “int”, so maybe it is actually better that strict mode doesn’t blithely use the current affinity determination rules on the expressed type).
(Actually on the DATE thing I was forgetting and thinking that was a regular feature but it’s actually just the fallback affinity where a type is specified but not matched by any other rule, NUMERIC. Strike out that example as a canonical definition or anything, then. But the rest of the point still stands.)
Each and every biostatistician on this planet. Especially those touching clinical data. Personally, I was saddened to learn that DuckDB did not include dynamically typed, or at least untyped, columns. Happily my data loads are usually small enough for a row-oriented data store.
CHECK constraints, now in conjunction with STRICT tables, are the best invention since sliced bread! If I could improve on one thing, it would be to remove any type notion from non-STRICT tables.
Fwiw the ANY type is valid in strict tables, and does what it says.
Slightly more so than in non-strict tables in fact: it will store exactly what you give it without trying to reinterpret it e.g. in non-strict, a quoted literal composed of digits will be parsed and stored as an integer, in strict, as a string.
Strict mode also doesn’t influence the parser, so e.g. inserting a `true` in a strict table will work, because `true` is interpreted as an integer literal.
This always put me off a bit too. I have memories of mySQL supporters back in the early days defending it storing dates like 2022-02-31 and other such type/range/other issues with the standard refrain of “but SELECT * is blazing fast!”…
As of last year there is an option to make things more strict (https://www.sqlite.org/stricttables.html) though
as SQLite doesn't have real date types, unless you are using one of the integer storage options code could insert invalid dates like mysql used to allow.
--
EDIT: having actually read the linked article, it explicitly mentions the date type issue also.
Even in the absence of strict tables, you can also add a CHECK constraint whenever limits are needed on the value types. Here is one example from a recent discussion in SQLite forum threads: `check(datetime(x, '+0 seconds') IS x)`.
An important proviso on the strict tables is that they will not be observed on older SQLite versions lacking this feature.
SQLite has been rock-solid since attaining DO-178B, and commonly isn't ever upgraded in many installations.
CentOS 7 is using 3.7.17. Since the v3 database format is standardized, the older version can utilize a STRICT database file, but will not have the capability to alter datatype behavior.
For these cases, implementing both STRICT and the relevant CHECK constraints is advisable.
It's even more fun in a boolean column, where some databases accept 't'/'f', 'true'/'false', etc. SQLite accepts some of those, but treats other ones as text.
I do, for JSON columns. I store UTF8 strings in SQLite, so that it is easy to see JSON values with a plain `SELECT *`). And I load blobs, because I code in Swift and the standard JSON decoder eats raw UTF8 memory buffers, not strings.
This avoids two useless conversions:
- first from a C string loaded from SQLite to a Swift Unicode string (with UTF8 validation).
- next from this Swift Unicode string to a UTF8 memory buffer (so that JSONDecoder can do its job).
SQLite is smart enough to strip the trailing \0 when you load a blob from a string stored in the database :-)
SQLite is an excellent app database for local things. It doesn't compete with PostgreSQL or MySQL for huge number of reasons. So, to make it fit in the competition model, perhaps there are a bunch of things are glossed over.
Does anyone see a massive pro-sqlite movement going on? Sort of like what happens in JS-ecosystem. Everyone is bandwagoning on it. Criticism of SQLite is much welcomed, specifically exemplifying what its role is and which use cases it serves really well.
SQLite needs this, because changing column types is such a pain in the ass. The answer to "I want to move from INT primary keys to TEXT/UUID, how do I change the column type?" is "just insert the data into the column because SQLite allows everything".
And the reason changing column types is so hard is because, uh, SQLite stores its schema as human readable plaintext (easier to keep compatibility between versions) and not normalized tables like other databases.
What does it mean for a database to have primary keys of different types? In particular, primary keys are frequently used for clustering and other performance enhancements and I would expect that they would need a total ordering that made sense.
As much as I love sqlite, its table model is really confusing to me coming from a postgres mentality:
"WITHOUT ROWID is found only in SQLite and is not compatible with any other SQL database engine, as far as we know. In an elegant system, all tables would behave as WITHOUT ROWID tables even without the WITHOUT ROWID keyword. However, when SQLite was first designed, it used only integer rowids for row keys to simplify the implementation. This approach worked well for many years. But as the demands on SQLite grew, the need for tables in which the PRIMARY KEY really did correspond to the underlying row key grew more acute. The WITHOUT ROWID concept was added in order to meet that need without breaking backwards compatibility with the billions of SQLite databases already in use at the time (circa 2013)."
wtf. who would ever want that?