R - Data Frame

Data Frame

  • A Data Frame in R has two dimensional properties similar to a matrix but it can contain heterogeneous data

  • In a way, data frame is like a list with components as columns

  • The components of a data frame must be vectors (numeric, character or logical), factors, numeric matrices, lists or other data frames

  • Vector structures appearing as variables of a data frame must all have the same length and matrix structures must all have the same row size

Syntax:

  • data.frame(vectors, row.names = NULL, etc…)

  • as.data.frame(x)

    • where x can be vector, list, factor or matrix

  • is.data.frame(x)

    • it checks if the variable is a data frame or not

Creating a data frame with name myScore  by using data.frame() function

	Subjects <- c("Math","Science","English","Social")
	Marks <- c(99,67,74,62)
	myScore <- data.frame(Subjects, Marks)
	myScore

Output:

Consider ‘mat1’ matrix. Use as.data.frame() function to convert it to a data frame with the name ‘NewDF

NewDF <- as.data.frame(mat1)
NewDF

Output :

Consider the two heterogeneous vectors, Subjects and Marks, with character and numeric types respectively and the myScore data frame created in the previous example

The ‘Subjects’ character vector got converted to a factor, when the data frame was created

To ensure that the ‘Subjects’ vector remains as a character, use option stringsAsFactors = FALSE

	data.frame(Subjects, Marks, stringsAsFactors = FALSE)
  • names() function

    • can be used to retrieve the column names

    • can be used to modify the column names

Consider the previous GMAT example. If we want to change the name from ‘Tom ‘ to ‘John’

	names(GMAT.df)
	[1] "Jane" "Tom" "Katy" "James"
	names(GMAT.df) <- c("Jane", "John", "Katy", "James")
	GMAT.df

Output:

  • colnames() & rownames() function

    • can be used to retrieve or modify the column and row names respectively

	colnames(GMAT.df)
	[1] "Jane" "John" "Katy" "James"
  • '$'symbol is required to access a specific column

	GMAT.df$Katy
	[1] 99.4 99.7 98.9
  • dataFrameName[position]

    • an element can be retrieved with the help of its position in the data frame

	GMAT.df[2,3]
	[1] 99.7

        # 99.7 is math score of Katy

 

  • dim() function can be used to check the dimensions of the data frame and also to modify the dimensions of the same

	dim(GMAT.df)
	[1] 3 4

One way to subset the data frame is by using ‘subset()’ function    

Syntax :

  • subset(x, condition, select, ..)
    • x: the data frame
    • condition: the subset condition
    • select: columns to be displayed in the output

Consider the ‘math’ data frame

If we want the details of students who scored more than or equal to 96 marks

subset(math, math$marks >= 96)

Output:

If we want the names & marks of students who scored more than 96 and less than 99

subset(math, math$marks > 96 & math$marks < 99, select = c(name, marks))

Output:

What will happen if there are missing values in my data frames ?

Any operation performed on missing data(NA), will result in NA, but we have a option to resolve this issue. Let’ us see what it is.