Category Archives: Database

SQL Basics (Part 2)

JOINS

Joins allows us to aggregate multiple tables into a single result set. So for example we might want names of Plants from our plant database and information from other tables that relates to the plant.

As a more concrete example, we might have a foreign key from our Plant table which relates to the Plant Type (i.e. whether a plant is a Tree, Vegetable, Flower etc.)

CROSS JOIN

Before we look at an example of a JOIN as outlined above, let’s look at a CROSS JOIN. A CROSS JOIN is basically a JOIN of two tables without any where clause, so for example

select p.CommonName, pt.Name 
from Plants p, PlantTypes pt

This type of join is expensive in that it’s simply taking two tables and merging data, and in many ways it’s probably of little in terms of the result set produced. For example in my plant data at the moment I have 29 plants listed (not many I admit, but it’s early days). I also have 4 plant types. The result set of the CROSS join above is 4 * 29 = 116 rows. Basically the result set lists each plant CommonName against each plan type Name.

INNER JOIN

An inner join is generally the most used JOIN whereby we’re going to be looking for all items from one table matching a column to data from another table. So again using our Plants table and Plant Types we might want to see the plant type associated with each plan in our database.

Note: We can create such a join without the INNER JOIN keywords, by default in SQL Server such joins are inner joins anyway

select p.CommonName, pt.Name 
from Plants p inner join PlantTypes pt
on p.Type_Id = pt.Id

So this assumes that the Type_Id is a foreign key into the PlantTypes table and relates to it’s primary key.

FULL OUTER JOIN

A FULL OUTER JOIN will return NULL data. So for example we allowed NULL’s for our Type_Id or there are no matches for a PlantTypes Id then we’ll see NULL values in the output columns, for example the query would look like

select p.CommonName, pt.Name 
from Plants p inner join PlantTypes pt
on p.Type_Id = pt.Id

Our result set may now display CommonName NULL (for example) if a PlantType Id does not
have a matching Plant, i.e. we’ve not added any Trees yet to out Plants table.

Note: The FULL OUTER JOIN syntax is not supported by MySQL

LEFT OUTER JOIN

As we’ve seen with a FULL OUTER JOIN we might have a PlantType which is not yet used in the Plants table and therefore we’ll see a NULL for the column CommonName. However we might also have a plant which doesn’t yet have a Type_Id and hence would have a NULL in this column.

If we don’t want to view plant types which have NULL CommonNames (in other words we only really care about the plants not the types) we may want to see all plants and their plant types regardless of whether they’re NULL, we can use a LEFT OUTER JOIN

select p.CommonName, pt.Name 
from Plants p left outer join PlantTypes pt
on p.Type_Id = pt.Id

In this case we get data from the left table whether or not they have a matching value in the right hand side table.

RIGHT OUTER JOIN

As you’ve guessed a RIGHT OUTER JOIN will return all values from the plants data with a match on the plant type plus those not matching the plan type with no plants associated with it.

select p.CommonName, pt.Name 
from Plants p right outer join PlantTypes pt
on p.Type_Id = pt.Id

SELF JOIN

We’ve seen how to create joins with other tables but actually we can join with the same table. There’s no SELF JOIN keywords, it’s more the concept of joining one table with itself. The most obvious use of such a join is within a hierarchal type of data. Possibly a Plant Type might have a parent Plant Type then one could join against the parent plant type on the same table as the child plant type.

SQL basics (Part 1)

Let’s take a look at some SQL basics.

Note: Unless otherwise stated, the queries are tested in SQL Server only

Select

So to retrieve data from a database using SQL we write

select <columns> from <table>

and optionally we can add a where clause to reduce the result set based upon Boolean logic.

select <columns> from <table> where <boolean logic>

We can view all columns within a table using the wildcard * however in a production environment (at least where an application is retrieving data from the database) we would be better off specifying the columns we want to improve performance and ensure that if new columns are added or the likes, our query still produce the same “expected” result set, column-wise anyway.

So using the wild card we would have something like

select * from Plants

or specifying the required columns we use something like

select CommonName from Plants

or retrieving multiple columns

select CommonName from Plants
select CommonName, Genus from Plants

When specifying columns we might be querying from multiple tables so it’s best to alias the table name in case of column name duplication across different tables, hence we get something like

select p.CommonName from Plants p

Aliasing columns

Using the alias table query (from above) we will get a result set (in SQL Server Management Studio) with the column name CommonName. We can assign an alias to the column during a query and therefore change the name output for the column, for example

select p.CommonName as 'Common Name' from Plants p

this will output a column named Common Name now.

Aliasing can also be used on the output from functions etc. So, for example

select count(*) as Count from Plants p

will output a column named Count.

Count

Count returns a scalar value. We can get the number of rows in a table using the wildcard

select count(*) from Plants

This will return the total number of rows in the Plants table, however using a column name within count returns the number of non-NULL rows, hence

select count(p.CommonName) from Plants p

Min/Max

Both Min and Max returns a single value indicating the minimum (obviously using MIN) or maximum (obviously using MAX) non-NULL value from a column, both can be used on numeric or non-numeric columns.

Here’s an example of the usage

select max(p.Height) from Plants p

The above returns the maximum height found in the Plants table and

select min(p.Height) from Plants p

returns the minimum height.

AVG

AVG returns a single value indicating the average value within a selected column. AVG only works on numeric columns.

select avg(p.Height) from Plants p

SUM

SUM can be used on a numeric column and return the SUM of all values

select sum(p.Height) from Plants p

DISTINCT

The distinct keyword allows us to get only distinct (non-duplicating) values, i.e.

select distinct p.Genus from Plants p

The above will basically remove duplicates.

GROUP BY

We can create sub groups from our data using the GROUP BY clause. For example say we want to duplicate the DISTINCT functionality by only returning all plants with no duplicate common names we can do this with GROUP BY as

select p.CommonName from Plants p group by p.CommonName

Basically we create groups based upon the CommonName and then output each group’s CommonName. But we can think of the result set as groups (or arrays) of data so we can also do things like list a count for the number of duplicated names against each group

select p.CommonName, count(p.CommonName) from Plants p group by p.CommonName

This will now list each distinct group name and list a count alongside it to show how many items have that CommonName.

HAVING

The HAVING clause is used in a similar way to the WHERE clause but for GROUP BY result sets. An example might be where we’ve grouped by the CommonName of a plant in our database but are only interested in those names with more than a certain number of occurrences, thus

select p.CommonName, count(p.CommonName) 
from Plants p 
group by p.CommonName having count(p.CommonName) > 3

Now we’re basically get a result set with the CommonName and the Count for those CommonNames duplicated more than 3 times.

Note: In SQL Server we cannot alias the count(p.CommonName) as c, for example and then use c in the having clause, whereas MySQL does allow this syntax

Database indexes

At some point we may need to look at adding indexes to our tables. Indexes can help with the performance when retrieving rows from our database.

An index allows us to help the SQL query engine to better search for commonly searched for data by giving it “hints” about the searchable columns. An index helps the query engine so that it doesn’t have to search all rows in a database by default but instead can use a more efficient looking up on indexed columns. So for example if we have an Employee database we might have a primary key on an employee number (which will usually be indexed) and we might add an index to the surname column, in the knowledge that this is often uses in queries.

Sounds great, right, but there is a downside to creating indexes and it is that they can have a detrimental affect on performance when inserting data. Because indexes need updating when new rows are added to the database. Hence when we add a new item of data the database engine may need to regenerate indexes each time.

An index is usually represented as a B-Tree structure (a balanced tree), see B-tree for examples of implementations.

Before we get started

The following examples work on SQL Server. In some cases they’ll work exactly “as is” on other databases, in some you may need to tweak things. I’ve simply not tried them out on anything other than SQL Server at this time.

Creating an index

As a naming convention, we will prefix our index name with IX_, so for example

CREATE INDEX IX_TITLE
ON FlashCard (Title)

this creates an index, named IX_TITLE on the table “FlashCard” and the column “Title”. By default we’ll have created a NOCCLUSTERED index. If want to created a CLUSTERED index we need to change the SQL to

CREATE CLUSTERED INDEX IX_TITLE
ON FlashCard (Title)

See CREATE INDEX (Transact-SQL) for the full syntax for creating an index using SQL Server.

According to the post Maximum Capacity Specifications for SQL Server you can create 999 nonclusted indexes per table. However you can only create one clustered index.

If we can create an index we must be able to drop one, so here goes

DROP INDEX IX_TITLE 
ON FlashCard

Viewing indexes

Using a UI tool such as SQL Server Management Studio we simply open the database in the Object Explorer and look at the Indexes section. This lists all the indexes on the table. In SQL we query the sys.indexes table, so for example we can use the following against SQL Server

select * from sys.indexes

or to view our recently added index

select * from sys.indexes where name = 'IX_TITLE'

CLUSTERED, NONCLUSTERED and UNIQUE indexes

In their simplest form, a NONCLUTERED index is an index which may have duplicates whereas (probably obvious) a UNIQUE index may not. UNIQUE indexes have a unique constraint on them whereas a CLUSTERED index is again unique but has the addition that data is ordered on disk to match the index.

By default your primary key will be created as a clustered index. For example this is the result of getting SQL Server Management Studio to create a CREATE script (abridged) for a simple FlashCard database

CREATE TABLE [dbo].[FlashCard](
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[Title] [nchar](20) NOT NULL,
	[Front] [ntext] NULL,
	[Back] [ntext] NULL,
 CONSTRAINT [PK_Card] PRIMARY KEY CLUSTERED 

And ofcourse, as only one clustered index may exist per table, if you attempt to create a clustered index on a table with a primary key set as clustered (in it’s default state) you’ll get the error

Cannot create more than one clustered index on table ‘FlashCard’. Drop the existing clustered index ‘PK_Card’ before creating another.

in which case you’ll need to decide whether your primary key is best suited to be clustered, if not then you’ll need to drop the primary key constraint and then re-add it, for example

ALTER TABLE FlashCard 
DROP CONSTRAINT PK_Card
GO
ALTER TABLE FlashCard 
ADD CONSTRAINT PK_Card PRIMARY KEY NONCLUSTERED (Id)

or ofcourse create the constraint in the first place as nonclustered.

References

For an excellent and far more comprehensive look at indexes, check out What every developer should know about SQL performance.

SQL Server Index Design Guide

Mongodb Replication & Recovery

In a non-trivial system, we’d normally look to have three types of database set-up. A primary would be set-up as a writeable database, one or more secondary databases would be set-up as readonly databases and finally and arbiter is set-up to be used to help decide which secondary database takes over in the case of the primary database going down.

Note: An arbiter is added to stop tied votes when deciding a secondary to take over as primary and thus should only be used where an even number of instances of mongodb exists in a replication set.

The secondary databases will be “eventually consistent” in that when data is written to the primary database it is not immediately replicated to the secondary databases, but will “eventually” be replicated.

Let’s look at an example replication set…

To set-up a replication set, we would start with a minimum of three instances of, or machines running, mongodb. As previously mentioned, this replication set would consist of a primary and secondary database and arbiter.

Let’s run three instances on a single machine to begin with, so we need to create three database folders, foe example

mkdir MyData\database1
mkdir MyData\database2
mkdir MyData\database3

Obviously, if all three are running on the same machine, we need to give the mongodb instances their own ports, for example run the following commands each in their own command prompt

mongod --dbpath /MyData\database1 --port 30000 --replSet "sample"
mongod --dbpath /MyData\database2 --port 40000 --replSet "sample"
mongod --dbpath /MyData\database3 --port 50000 --replSet "sample"

“sample” denotes a arbitrary, user-defined name for our replication set. However the replication set still hasn’t been created at this point. We instead need to run the shell against one of the servers, for example

Note: the sample above, showing all databases on the same machine is solely an example, obviously no production system should implement this strategy, each instance of primary, secondary and arbiter, should be run on it’s own machine.

mongo --port 30000

Now we need to create the configuration for our replication set, for example

var sampleConfiguration =
{ _id : "sample", 
   members : [
     {_id : 0, host : 'localhost:30000', priority : 10 },
     {_id : 1, host : 'localhost:40000'},
     {_id : 2, host : 'localhost:50000', arbiterOnly : true } 
   ]
}

This sets up the replication set, stating the host on port 300000 is the primary (due to it’s priority being set, in this example). The host on port 40000 doesn’t have a priority (or abiterOnly) so this is the secondary and finally we have the arbiter.

At this point we’ve created the configuration but we still need to actually initiate/run the configuration. So, again, from the shell we write

rs.initiate(sampleConfiguration)

Note: This will take a few minutes to configure all the instances which make up the replication set. Eventually the shell will return from initiate call and should say “ok”.

The shell prompt should now change to show the replication set name of the currently connected server (i.e. PRIMARY).

Now if we write data to the primary it will “eventually” be replicated to all secondary databases.

If we take the primary database offline (or worse still a fault occurs and it’s taken offline without our involvement) a secondary database will be promoted to become the primary database (obviously in our example we only have one secondary, so this will take over as the primary). If/when the original primary comes back online, it will again become the primary database and the secondary will, of course, return to being a secondary database.

Don’t forget you can use

rs.help()

to view help for the various replication commands.

MongoDB notes

This post is meant as a catch all for those bits of information which do not need a full post (ofcourse this doesn’t mean I won’t create a full post at a later date should the need arise).

Atomicity

Whilst MongoDB guarantees atomicity of a write operation as a single document level there is no such guarantee on multiple documents.

So for example if one is creating a relational style model with related data in more than one document (i.e. not an embedded relational model) then writing to the multiple documents is NOT atomic.

Indexes

Be default an index is created for the _id of a document. To create our own we use the ensureIndex method on a collection.

The MongoDB documentation states that

  • Each index requires at least 8KB od data space
  • Whilst query performance on the indexed data is increased the write operation will see a negative performance impact – therefore indexes can be expensive if there’s a high write to read ratio on documents

Overriding Default functionality

As mongodb uses JavaScript we can use JavaScript code to override some default functionality.

Let’s say for example, we want to stop users accidentally dropping a database. We can override the dropDatabase function thus

DB.prototype.dropDatabase = function() {
   print("dropDatabase disabled");
}

db.dropDatabase = DB.prototype.dropDatabase;

Loading scripts from the shell

Typing in code, such as the above override of the dropDatabase function every time, would end up being time consuming. Being that this is JavaScript it may come as no surprise to find out that the mongo shell can load script files.

Using load(‘myscript.js’);

Loading scripts at startup

In the above example, we’ve created a script which we might need to load every time we start mongo. So it’d be better if we could just load the script file automatically at startup.

So to load scripts at startup on a Windows machine, place the script file into c:\users\\.mongorc.js

Note: It’s also possible that we might want to ignore scripts at startup when debugging, in such a case we start mongo using

mongo –norc

More to come…

Indexing your MongoDB data

By default MongoDB creates an index on the _id (ObjectId) field, but we can easily add indexes to other fields.

Using the JavaScript shell

In the JavaScript shell simply call ensureIndexUsing the 10gen drivers in C#

In C# using the 10gen drivers we can create an index using the following

collection.EnsureIndex(new IndexKeysBuilder().Ascending("artist"));

where collection is

MongoCollection<CD> collection = db.GetCollection<CD>("cds");

To remove an index we can simply use

collection.DropIndex(new IndexKeysBuilder().Ascending("Artist"));

Handling case-sensitive mapping from MongoDB to a POCO

So the convention appears to be to use camel case for column/field names within MongoDB. For example if we create an entry such as db.cds.Update({artist:”Alice Cooper”}).

In C# the convention is for properties, for example, to be written in Pascal case. So we’d have something like

public class CD
{
   public ObjectId Id { get; set; }
   public string Artist { get; set; }
   public string Title { get; set; }
   public string Category { get; set; }
}

So obviously MongoDB has a field name artist and we need to map it to the property name “Artist”.

To handle this we can use the BsonElement in the MongoDB.Bson.Serialization.Attributes namespace, as per

public class CD
{
   public ObjectId Id { get; set; }
   [BsonElement("artist")]
   public string Artist { get; set; }
   [BsonElement("title")]
   public string Title { get; set; }
   [BsonElement("category")]
   public string Category { get; set; }
}

or we can set up the mappings using the following

BsonClassMap.RegisterClassMap<CD>(cm =>
{
   cm.AutoMap();
   cm.GetMemberMap(c => c.Artist).SetElementName("artist");
   cm.GetMemberMap(c => c.Title).SetElementName("title");
   cm.GetMemberMap(c => c.Category).SetElementName("category");
});

Note: we do not need to setup the Id field to any mapping as this appears to be mapped based upon it’s type.

A class map may only be registered once, we can use BsonClassMap.IsClassMapRegistered if need be to ensure this.

More information can be found at Serialize Documents with the CSharp Driver

Starting out with MongoDB

This is a post on some of the basics of using MongoDB (on Windows).

mongod.exe

So to run the mongo server we use mongod.exe, if all goes well the server will start up and state it’s waiting for connections.

mongo.exe

The JavaScript shell used as the client admin application for the mongo server is named mongo.exe.

Shell Commands

  1. use <database> is used to switch the shell to using the named <database>. If this command is not used, be default the test database will be used.
  2. show dbs can be used to get a list of all the databases stored in the instance of MongoDB.
  3. show collections lists the available collections within the selected database. This is analogous to listing the tables in an SQL database.
  4. show users lists the users added to the database. To add a user via the JavaScript Shell we can use the following
    db.addUser({user: "UserName", pwd: "Password", roles: ["readWrite", "dbAdmin"]})
    

    Beware, as the roles are not validated against any schema or role list, so you can assign roles names which are not recognised as specific mongo roles, i.e. I made a typo for dbAdmin and it’s accepted as a role.

  5. db is the database object, so we can execute commands such as db.stats()
  6. db.myobjects let’s assume we created a database and added some objects into a collection named myobjects. Because we’re using a JavaScript shell we can access the myobjects collection off of the db object. So for example we can now using db.myobjects.count() to get a count of the number of “rows” in the collection. Equally we can use commands such as db.myobjects.find() to list the rows in the collection.

Example workflow

This is a simple example of using mongo.exe to create a database and some data, then querying the database.

For this example we’ll create a CD database which will be used to contain information on all the CD’s we own.

  1. If it’s not already running, run mongod.exe
  2. In a separate command prompt, run mongo.exe
  3. We’ll create the database by typing use CDDatabase, remember it will not be fully created until some data is added to it
  4. We can check which database we’re currently viewing by typing db. This will show us we’re in the CDDatabase, but if you now type show dbs you’ll notice the database still doesn’t actually exist in the list of available databases
  5. We can now create our first entry in the database. Notice, unlike an SQL database we do not create tables etc. So to create an entry we can type the following
    a = {artist:"Alice Cooper", title:"Billion Dollar Babies", category:"Rock"}
    db.cds.insert(a)
    

    Note: we don’t need to create a variable as we’ve done here, we can just replace ‘a’ in the insert method with the right hand side of the = operator as per

    db.cds.insert({artist:"Alice Cooper", title:"Billion Dollar Babies", category:"Rock"})
    

    So we should have now successfully added an entry into the “cds” collection. Notice again we do not need to define the “cds” collection first, it’s dynamically created when we start using it.

    We can use the command show collections to display a list of the collections that make up the datavase.

    Feel free to add more data if you wish.

  6. Let’s now see what’s in the collection dbs. To do this simply type db.cds.find(). You should now see all the entries within the “cds” collection that you created. Notice that a variable _id of type ObjectId has been added to our data.

    To find a single entry, in this example we’ll find the artist by name we simply use db.cds.find({artist:”Alice Cooper”})

  7. So as part of the standard CRUD operations on data, we’ve seen how to Create and Read data, now let’s look at updating data. Run the following command
    db.cds.insert({artist:"Iron Maiden", title:"Blood Brothers"})
    

    Now I’m left off the category here, so we now need to update this entry. We use the collection’s update method, thus

    db.cds.update({artist:"Iron Maiden"}, {$set:{category:"Heavy Metal"}})
    

    The first argument in the update method {artist:”Iron Maiden”} is the query string, in other words update where artist is “Iron Maiden”. The second argument is the update action, which simply tells update to set the category to “Heavy Metal”.

    If we run this command we’ll see the category added, but this command works only on the first item found, so if there were multiple “Iron Maiden” entries missing the category this would be a bit of a problem. However we can apply a third argument to update to solve this.

    db.cds.update({category:"Rock"}, {$set:{category:"Heavy Rock"}},{multi:true})
    

    The addition of {multi:true} will allow the update to be applied to all matching entries.

  8. Now to the final CRUD operation – Delete. This is simply a case of using the collection’s remove method. For example
    db.cds.remove({artist:"Alice Cooper"})
    

    Be careful not to forget the query part of this method call, the {artist:”Alice Cooper”} as using db.cds.remove() will remove all entries from the collection.

Creating a simple database in MongoDB with C#

This is quick post on getting up and running with MongoDB using the “Official MongoDB C# driver” and both creating a database and adding some data to it.

  1. Using Nuget, add the “Official MongoDB C# driver” from 10gen to your project
  2. Add the following using clauses
    using MongoDB.Bson;
    using MongoDB.Driver;
    
  3. Create a POCO object to represent the data you want to persist. The key thing to remember is to add a property of type ObjectId without this we’ll get and exception stating “No IdGenerator found”.

    So an example POCO might look like this

    public class Person
    {
       public ObjectId Id { get; set; }
       public string FirstName { get; set; }
       public string LastName { get; set; }
       public int Age { get; set; }
    }
    
  4. Now we need to connect to the server and create/use a database.

    MongoClient client = new MongoClient();
    MongoServer server = client.GetServer();
    MongoDatabase db = server.GetDatabase("MyDatabase");
    

    Obviously the first two lines create a mongo client and then gets access to the server. The line server.GetDatabase(“MyDatabase”) will get the database (if it exists) but also create a database if it doesn’t exist.

    Note: if you are creating a database using GetDatabase it will not exist until you actually store data in it.

  5. Next we’re going to assume we want to store a collection of employees (a collection of Person objects). So we want to get the collection of “employees”. Like the creating of the database, if no employees currently exist we still get a collection object which we can then save data to.

    MongoCollection<Person> collection = db.GetCollection<Person>("employees");
    
  6. Let’s now create a Person object ready for adding to the collection and ultimately to the database.

    Create the following

    Person p = new Person
    {
       Id = ObjectId.GenerateNewId(),
       FirstName = "Bob",
       LastName = "Baker",
       Age = 36
    }
    

    Notice that we generate the Id using ObjectId.GenerateNewId().

  7. Our next step is to save the new Person to the collection and this will add the data to the collection and thus the database, thus we can then query for this data afterwards using the JavaScript shell.

    collection.Save(p);
    

What version of Oracle am I connecting to ?

We recently moved from using Oracle 10g to Oracle 11g

Note: I have access to the DB in this scenario but only via Oracle SQL Developer

I wanted to verify that the version I was connecting to was indeed 11g, so found this

SELECT * FROM PRODUCT_COMPONENT_VERSION;

Which returns the following

[table “” not found /]